openmp
openmp
API (Application Program Interface) for shared memory, explicit, thread based parallelism.
Goals of OpenMP:
Standardization
Ease of Use
Portability (across different platforms)
OpenMP Thread
A thread that is managed by the OpenMP runtime system.
Team
A set of one or more threads participating in the execution of a parallel region.
Task
A specific instance of executable code and its data environment that the OpenMP imlementation can schedule for
execution by threads.
The following base languages are given in [OpenMP-5.2, 1.7]: C90, C99, C11, C18, C++98, C++11, C++14, C++17, C++20,
Fortran 77, Fortran 90, Fortran 95, Fortran 2003, Fortran 2008, and a subset of Fortran 2018
Base Program
A program written in the base language.
OpenMP Program
A program that consists of a base program that is annotated with OpenMP directives or that calls OpenMP API
runtime library routines.
The name of the command line argument is not mandated by the specification and differs from one compiler to
another.
#include <omp.h>
C
include "omp_lib.h"
or a Fortran 90 module
F08
use omp_lib
omp_get_num_threads()
Returns the number of threads that constitute the team executing a parallel region from which this routine is called.
omp_set_num_threads()
Sets the number of threads that will be used in the following parallel region(s).
omp_get_thread_num()
Returns the thread number of the thread within a team calling this routine.
csh/tcsh
$ setenv ENV_VAR {NUM}
sh/bash
$ export ENV_VAR={NUM}
OMP_NUM_THREADS
Sets number of threads to use in the OpenMP program.
Modifications to environment variables after the OpenMP program has started are ignored by the OpenMP
implementation. ICVs can be however changed through directive clauses and OpenMP runtime routines.
Structured Block
An executable statement, possibly compound, with a single entry at the top and a single exit at the bottom, or an
OpenMP construct.
sentinel = !$OMP
The usual line length, white space and continuation rules apply
Thread that reaches a parallel directive creates a team of threads and becomes the master of the team
Creates a team of threads to execute the parallel region
The code is duplicated and all threads in the team will execute the code contained in the structured block
Inside the region threads are identified by consecutive numbers starting at zero
There is an implied barrier at the end of a parallel section, only the master thread continues past this point
Optional clauses (explained later) can be used to modify behaviour and data environment of the parallel
region
int main(void) {
printf("Hello from your main thread.\n");
!$omp parallel
print *, "Hello from thread ", omp_get_thread_num(), " of ",
↪ omp_get_num_threads(), "."
!$omp end parallel
end program
#include <stdio.h>
#include <omp.h>
int main(){
printf("master thread: hello world.\n");
return(0);
}
C
Private Variable
With respect to a given set of task regions that bind to the same parallel region, a variable for which the name
provides access to a different block of storage for each task region.
Shared Variable
With respect to a given set of task regions that bind to the same parallel region, a variable for which the name
provides access to the same block of storage for each task region.
FIRSTPRIVATE Clause equals the private clause with automatic initialization of the listed variables accoding to the
value of their original objects prior to entry into the parallel or work-sharing construct. C/C++: firstprivate(list), F:
FIRSTPRIVATE(list).
LASTPRIVATE Clause equals the private clause with a copy from the last loop iteration or section to the original
variable object. C/C++: lastprivate(list), F: LASTPRIVATE(list).
SHARED Clause declares variables in its list to be shared among all threads in the team. A shared variable exists in
only one memory location and all threads can read or write to that address. C/C++: shared(list), F: SHARED(list).
DEFAULT Clause specifies a default scope for all variables in the lexical extent of any parallel region.
C/C++: default (shared | none), F: DEFAULT (PRIVATE | FIRSTPRIVATE | SHARED | NONE).
Data Race
A data race occurs when
multiple threads write to the same memory unit without synchronization or
at least one thread writes to and at least one thread reads from the same memory unit without
synchronization.
!$omp barrier
Threads are only allowed to continue execution of code after the barrier once all threads in the current team
have reached the barrier.
A barrier region must be executed by all threads in the current team or none.
Execution of critical regions with the same name are restricted to one thread at a time.
name is a compile time constant.
In C, names live in their own name space.
In Fortran, names of critical regions can collide with other identifiers.
!$OMP END DO
An ordered directive can only appear in the dynamic extent of the for or parallel for (C/C++) directives and
equivalently DO or PARALLEL DO (Fortran) directives.
A loop which contains an ordered directive, must be a loop with an ordered clause.
list contains variables that are private in the enclosing parallel region.
At the end of the single construct, the values of all list items on the single thread are copied to all other
threads.
E.g. serial initialization
copyprivate cannot be combined with nowait.
Declares the iterations of a loop to be suitable for concurrent execution on multiple threads.
The loop directive applies to the outermost loop of a set of nested loops, by default
collapse(n) extends the scope of the loop directive to the n outer loops
All associated loops must be perfectly nested, i.e.:
Determines how the iteration space is divided into chunks and how these chunks are distributed among threads.
static Divide iteration space into chunks of chunk_size iterations and distribute them in a round-robin
fashion among threads. If chunk_size is not specified, chunk size is chosen such that each thread gets
at most one chunk.
dynamic Divide into chunks of size chunk_size (defaults to 1). When a thread is done processing a chunk it
acquires a new one.
guided Like dynamic but chunk size is adjusted, starting with large sizes for the first chunks and decreasing to
chunk_size (default 1).
auto Let the compiler and runtime decide.
runtime Schedule is chosen based on ICV run-sched-var.
If no schedule clause is present, the default schedule is implementation defined.
Compile and execute the file openmp_hello_world.{c|f90}. It contains a ‘hello world’ from all
available OpenMP threads.
How many OpenMP threads are used? Implement different ways to set the number of OpenMP threads. Which
way overwrites which other? Make a hierarchical list.
Compile and execute the file openmp_ws_manual.{c|f90}. It containes a small loop inside parallel
region.
Compile and execute the file openmp_simple_sum.{c|f90}. It containes a serial version of a simple sum
from 0 to large_number.
What result do you get? How long does it take? How does the runtime scale with the number of OpenMP
threads and the value of large_number (degrees of freedom)?
What result do you get? How long does it take? How does the runtime scale with the number of OpenMP
threads and the matrix size (degrees of freedom)?
Exercise 4 – Race Conditions
What result do you get? How long does it take? How does the runtime scale with the number of OpenMP
threads and the matrix size (degrees of freedom)?
Experiment!
Play with various variants, e.g. array size, number of threads, chunk size.