OpenMP

Open Multi-Processing (OpenMP) is a standardised API providing multi-threaded, shared-memory parallelism for Fortran and C/C++ applications.

It consists of compiler directives, runtime library routines, and environment variables to enable thread-based parallel programming using a fork-join model. A master thread parallelises parts of a program by creating several threads (fork). The parallel region is then processed by these threads, until they finish, synchronise, and terminate (join).

The shared memory model allows to share data in parallel regions between all threads. Shared memory access can be limited by using data scope attribute clauses.

Add OpenMP support to your Fortran application by importing the library omp_lib. All compiler directives start with the sentinel !$omp, followed by the actual directive name. Optional clauses can be added. The parallel directive is used to parallelise parts of a program. The subroutine omp_set_num_threads() sets the number of threads to use. The function omp_get_thread_num() returns the number of the current thread.

! example.f90
program example
    use :: omp_lib
    implicit none

    ! Set number of threads to use.
    call omp_set_num_threads(4)

    !$omp parallel
    print '(a, i0)', 'Thread: ', omp_get_thread_num()
    !$omp end parallel
end program example

The compiler must have OpenMP support and the OpenMP shared library has to be installed. GNU Fortran requires the -fopenmp compiler flag:

$ gfortran8 -Wl,-rpath=/usr/local/lib/gcc8/ -fopenmp -o example example.f90

The example program just outputs the number of each running thread:

$ ./example
Thread: 2
Thread: 1
Thread: 0
Thread: 3

Data Scope Attribute Clauses

The shared-memory model of OpenMP allows threads to share data between each other. The visibility of variables can be changed by data scope attribute clauses in the compiler directive.

private
Private variables are only visible to the current thread. They will not be initialised on entering the parallel region.
firstprivate
These are similiar to private variables, but are initialised with the last value before entering the parallel region.
shared
Shared variables can be accessed and changed by all threads. The changes are visible to all threads.
default
The cause specifies a default scope for all variables in the parallel region: default(private | firstprivate | shared | none). Using none requires to explicitly scope all variables.

Example

In the following example, A and B are matrices of size n × n, with n = 1000. The program will calculate the matrix product C = AB twice: at first, sequentially with triple do-loops in the optimal jki-form, and then, parallised with OpenMP. The constant nthreads defines the number of threads to use.

For real-life applications, one would probably prefer the intrinsic Fortran function matmul(a, b), or use a highly optimised third-party library, like ATLAS, for matrix multiplication.

! benchmark.f90
program benchmark
    use :: omp_lib
    implicit none
    integer, parameter :: dp       = kind(0.0d0)
    integer, parameter :: n        = 1000
    integer, parameter :: nthreads = 2

    integer                        :: i, j, k
    real(kind=dp)                  :: t1, t2
    real(kind=dp), dimension(n, n) :: a, b, c

    ! Initialise the PRNG.
    call random_seed()

    ! Fill matrices A and B with random numbers.
    do i = 1, n
        do j = 1, n
            call random_number(a(i, j))
            call random_number(b(i, j))
        end do
    end do

    call cpu_time(t1)

    ! Calculate C = A x B sequentially.
    do j = 1, n
        do k = 1, n
            do i = 1, n
                c(i, j) = c(i, j) + a(i, k) * b(k, j)
            end do
        end do
    end do

    call cpu_time(t2)
    print '(a, f7.3, a)', 'default: ', t2 - t1, ' s'

    ! Set number of threads to use.
    call omp_set_num_threads(nthreads)

    ! Please be aware that `cpu_time()` cannot be called for parallel
    ! operations. Use `omp_get_wtime()` instead.
    t1 = omp_get_wtime()

    ! Calculate C = A x B in parallel with OpenMP, using static scheduling.
    !$omp parallel shared(a, b, c) private(i, j, k)
    !$omp do schedule(static)
    do j = 1, n
        do k = 1, n
            do i = 1, n
                c(i, j) = c(i, j) + a(i, k) * b(k, j)
            end do
        end do
    end do
    !$omp end do
    !$omp end parallel

    t2 = omp_get_wtime()
    print '(a, f7.3, a)', ' openmp: ', t2 - t1, ' s'
end program benchmark

Compile the example with GNU Fortran:

$ gfortran8 -Wall -Wl,-rpath=/usr/local/lib/gcc8/ -O3 -march=native -ffast-math -fopenmp \
  -o benchmark benchmark.f90

Running the example on an Intel Core i7 3520M shows the difference in performance between no threading and threading with OpenMP:

$ ./benchmark
default:   6.057 s
 openmp:   2.530 s

Depending on the CPU, increasing the number of threads may improve the performance even further.

References