OpenMP
Open Multi-Processing (OpenMP) is a standardised API for writing shared-memory multi-process applications in Fortran and C/C++.
It consists of compiler directives, runtime library routines, and environment variables to enable thread-based parallel programming using a fork-join model. A master thread parallelises parts of a program by creating several threads (fork). The parallel region is then processed by these threads, until they finish, synchronise, and terminate (join).
The shared memory model allows to share data in parallel regions between all threads. Shared memory access can be limited by using data scope attribute clauses.
Add OpenMP support to your Fortran application by importing the library
omp_lib
. All compiler directives start with the sentinel
!$omp
, followed by the actual directive name. Optional clauses can
be added. The parallel
directive is used to parallelise parts of a
program. The subroutine omp_set_num_threads()
sets the number of
threads to use. The function omp_get_thread_num()
returns the
number of the current thread.
! example.f90
program example
use, intrinsic :: omp_lib
implicit none
! Set number of threads to use.
call omp_set_num_threads(4)
!$omp parallel
print '("Thread: ", i0)', omp_get_thread_num()
!$omp end parallel
end program example
The compiler must have support for OpenMP, and the OpenMP shared library has
to be present. GNU Fortran uses the -fopenmp
compiler flag to
enable OpenMP:
$ gfortran13 -fopenmp -o example example.f90
The example program just outputs the number of each running thread:
$ ./example
Thread: 2
Thread: 1
Thread: 0
Thread: 3
Data Scope Attribute Clauses
The shared-memory model of OpenMP allows threads to share data between each other. The visibility of variables can be changed by data scope attribute clauses in the compiler directive.
private
- Private variables are only visible to the current thread. They will not be initialised on entering the parallel region.
firstprivate
- These are similiar to
private
variables, but are initialised with the last value before entering the parallel region. shared
- Shared variables can be accessed and changed by all threads. The changes are visible to all threads.
default
- The cause specifies a default scope for all variables in the parallel
region:
default(private | firstprivate | shared | none)
. Usingnone
requires to explicitly scope all variables.
Sections
The sections
construct is used to distribute tasks to different
threads that divide the sections between themselves. The tasks have to be
pre-determined and independend of each other:
! example.f90
program main
use, intrinsic :: omp_lib
implicit none
integer :: n
call omp_set_dynamic(.false.)
call omp_set_num_threads(2)
!$omp parallel private(n)
!$omp sections
!$omp section
! The first thread.
n = omp_get_thread_num()
print '("[Thread ", i0, "] ", a)', n, 'Hello, World!'
!$omp section
! The second thread.
n = omp_get_thread_num()
print '("[Thread ", i0, "] ", a)', n, 'Hello, World!'
!$omp end sections
!$omp end parallel
end program main
Example
In the following example, A and B are matrices of size n × n, with n = 1000. The program will calculate C = AB:
At first, the matrix product will be calculated sequentially with triple
do
-loops in the optimal jki
-form. Then, the
calculation will be repeated, but parallised with OpenMP. The constant
nthreads
defines the number of threads to use.
! benchmark.f90
program benchmark
use, intrinsic :: omp_lib
implicit none
integer, parameter :: dp = kind(0.0d0)
integer, parameter :: n = 2000
integer, parameter :: nthreads = 4
integer :: i, j, k
real(kind=dp) :: t1, t2
real(kind=dp) :: a(n, n), b(n, n), c(n, n)
! Set number of threads to use.
call omp_set_num_threads(nthreads)
! Initialise the PRNG and fill matrices A and B with random numbers.
call random_seed()
call random_number(a)
call random_number(b)
c = 0.0_dp
t1 = omp_get_wtime()
! Calculate C = AB sequentially.
do j = 1, n
do k = 1, n
do i = 1, n
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
t2 = omp_get_wtime()
print '(a, f7.3, a)', 'single: ', t2 - t1, ' s'
c = 0.0_dp
t1 = omp_get_wtime()
! Calculate C = AB in parallel with OpenMP, using static scheduling.
!$omp parallel shared(a, b, c) private(i, j, k)
!$omp do schedule(static)
do j = 1, n
do k = 1, n
do i = 1, n
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
!$omp end do
!$omp end parallel
t2 = omp_get_wtime()
print '(a, f7.3, a)', 'OpenMP: ', t2 - t1, ' s'
end program benchmark
For real-world applications, one would probably prefer the intrinsic Fortran
function matmul(a, b)
, or use a highly optimised third-party
library, like ATLAS, for matrix
multiplication.
Compile the example with GNU Fortran:
$ gfortran13 -O3 -march=native -ffast-math -fopenmp -o benchmark benchmark.f90
The parallised loop, running in 2 threads on an Intel Core i7 (Ivy Bridge), is nearly twice as fast as the non-parallised loop:
$ ./benchmark
single: 5.250 s
OpenMP: 2.755 s
Depending on the CPU, increasing the number of threads may improve the performance even further.
References
- B. Barney: OpenMP, Lawrence Livermore National Laboratory
< Object-Oriented Programming | [Index] | POSIX Threads > |