OpenMP

Open Multi-Processing (OpenMP) is a standardised API for writing shared-memory multi-process applications in Fortran and C/C++.

It consists of compiler directives, runtime library routines, and environment variables to enable thread-based parallel programming using a fork-join model. A master thread parallelises parts of a program by creating several threads (fork). The parallel region is then processed by these threads, until they finish, synchronise, and terminate (join).

The shared memory model allows to share data in parallel regions between all threads. Shared memory access can be limited by using data scope attribute clauses.

Add OpenMP support to your Fortran application by importing the library omp_lib. All compiler directives start with the sentinel !$omp, followed by the actual directive name. Optional clauses can be added. The parallel directive is used to parallelise parts of a program. The subroutine omp_set_num_threads() sets the number of threads to use. The function omp_get_thread_num() returns the number of the current thread.

! example.f90
program example
    use, intrinsic :: omp_lib
    implicit none

    ! Set number of threads to use.
    call omp_set_num_threads(4)

    !$omp parallel

        print '("Thread: ", i0)', omp_get_thread_num()

    !$omp end parallel
end program example

The compiler must have support for OpenMP, and the OpenMP shared library has to be present. GNU Fortran uses the -fopenmp compiler flag to enable OpenMP:

$ gfortran13 -fopenmp -o example example.f90

The example program just outputs the number of each running thread:

$ ./example
Thread: 2
Thread: 1
Thread: 0
Thread: 3

Data Scope Attribute Clauses

The shared-memory model of OpenMP allows threads to share data between each other. The visibility of variables can be changed by data scope attribute clauses in the compiler directive.

private
Private variables are only visible to the current thread. They will not be initialised on entering the parallel region.
firstprivate
These are similiar to private variables, but are initialised with the last value before entering the parallel region.
shared
Shared variables can be accessed and changed by all threads. The changes are visible to all threads.
default
The cause specifies a default scope for all variables in the parallel region: default(private | firstprivate | shared | none). Using none requires to explicitly scope all variables.

Sections

The sections construct is used to distribute tasks to different threads that divide the sections between themselves. The tasks have to be pre-determined and independend of each other:

! example.f90
program main
    use, intrinsic :: omp_lib
    implicit none
    integer :: n

    call omp_set_dynamic(.false.)
    call omp_set_num_threads(2)

    !$omp parallel private(n)
    !$omp sections
    !$omp section

        ! The first thread.
        n = omp_get_thread_num()
        print '("[Thread ", i0, "] ", a)', n, 'Hello, World!'

    !$omp section

        ! The second thread.
        n = omp_get_thread_num()
        print '("[Thread ", i0, "] ", a)', n, 'Hello, World!'

    !$omp end sections
    !$omp end parallel
end program main

Example

In the following example, A and B are matrices of size n × n, with n = 1000. The program will calculate C = AB:

C = AB

At first, the matrix product will be calculated sequentially with triple do-loops in the optimal jki-form. Then, the calculation will be repeated, but parallised with OpenMP. The constant nthreads defines the number of threads to use.

! benchmark.f90
program benchmark
    use, intrinsic :: omp_lib
    implicit none
    integer, parameter :: dp       = kind(0.0d0)
    integer, parameter :: n        = 2000
    integer, parameter :: nthreads = 4

    integer       :: i, j, k
    real(kind=dp) :: t1, t2
    real(kind=dp) :: a(n, n), b(n, n), c(n, n)

    ! Set number of threads to use.
    call omp_set_num_threads(nthreads)

    ! Initialise the PRNG and fill matrices A and B with random numbers.
    call random_seed()
    call random_number(a)
    call random_number(b)

    c = 0.0_dp
    t1 = omp_get_wtime()

    ! Calculate C = AB sequentially.
    do j = 1, n
        do k = 1, n
            do i = 1, n
                c(i, j) = c(i, j) + a(i, k) * b(k, j)
            end do
        end do
    end do

    t2 = omp_get_wtime()
    print '(a, f7.3, a)', 'single: ', t2 - t1, ' s'

    c = 0.0_dp
    t1 = omp_get_wtime()

    ! Calculate C = AB in parallel with OpenMP, using static scheduling.
    !$omp parallel shared(a, b, c) private(i, j, k)
    !$omp do schedule(static)

        do j = 1, n
            do k = 1, n
                do i = 1, n
                    c(i, j) = c(i, j) + a(i, k) * b(k, j)
                end do
            end do
        end do

    !$omp end do
    !$omp end parallel

    t2 = omp_get_wtime()
    print '(a, f7.3, a)', 'OpenMP: ', t2 - t1, ' s'
end program benchmark

For real-world applications, one would probably prefer the intrinsic Fortran function matmul(a, b), or use a highly optimised third-party library, like ATLAS, for matrix multiplication.

Compile the example with GNU Fortran:

$ gfortran13 -O3 -march=native -ffast-math -fopenmp -o benchmark benchmark.f90

The parallised loop, running in 2 threads on an Intel Core i7 (Ivy Bridge), is nearly twice as fast as the non-parallised loop:

$ ./benchmark
single:   5.250 s
OpenMP:   2.755 s

Depending on the CPU, increasing the number of threads may improve the performance even further.

References