0% found this document useful (0 votes)
47 views70 pages

OpenMP 01 Introduction

The document provides an introduction to OpenMP, including its history and evolution over time. OpenMP is a standard for shared-memory parallelization and was first released in 1997 for FORTRAN. It has since expanded to support C, C++, and other languages. OpenMP uses a fork-join model where worker threads are spawned to handle parallel regions, with one initial master thread. It provides directives for defining parallel regions and work-sharing constructs like the FOR loop to distribute work across threads.

Uploaded by

Jyotirmay Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views70 pages

OpenMP 01 Introduction

The document provides an introduction to OpenMP, including its history and evolution over time. OpenMP is a standard for shared-memory parallelization and was first released in 1997 for FORTRAN. It has since expanded to support C, C++, and other languages. OpenMP uses a fork-join model where worker threads are spawned to handle parallel regions, with one initial master thread. It provides directives for defining parallel regions and work-sharing constructs like the FOR loop to distribute work across threads.

Uploaded by

Jyotirmay Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Introduction to OpenMP

Christian Terboven, Dirk Schmidl


IT Center, RWTH Aachen University
Member of the HPC Group
{terboven,schmidl}@itc.rwth-aachen.de

IT Center der RWTH Aachen University


History

 De-facto standard for Shared-Memory Parallelization.

 1997: OpenMP 1.0 for FORTRAN


 1998: OpenMP 1.0 for C and C++
 1999: OpenMP 1.1 for FORTRAN
(errata) http://www.OpenMP.org
 2000: OpenMP 2.0 for FORTRAN
 2002: OpenMP 2.0 for C and C++
 2005: OpenMP 2.5 now includes
both programming languages.
 05/2008: OpenMP 3.0 release
 07/2011: OpenMP 3.1 release
RWTH Aachen University is
 07/2013: OpenMP 4.0 release a member of the OpenMP
Architecture Review Board
 11/2015: OpenMP 4.5 release (ARB) since 2006.
2 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Multi-Core System
Architecture

3 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Moore‘s Law still holds!

The number of transistors


on a chip is still doubling
every 24 months …

… but the clock speed is no


longer increasing that fast!

Instead, we will see many


more cores per chip!

Source: Herb Sutter


www.gotw.ca/publications/concurrency-ddj.htm

4 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example for a SMP system

 Dual-socket Intel Woodcrest


(dual-core) system
 Two cores per chip, 3.0 GHz Core Core Core Core

 Each chip has 4 MB of L2 on-chip cache on-chip cache

cache on-chip, shared by


both cores

 No off-chip cache bus

 Bus: Frontsidebus

 SMP: Symmetric Multi Processor memory


 Memory access time is
uniform on all cores

 Limited scalabilty
5 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Overview
&
Parallel Region

6 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP‘s machine model

 OpenMP: Shared-Memory Parallel Programming Model.

Memory All processors/cores access


a shared main memory.

Crossbar / Bus Real architectures are


more complex, as we
will see later / as we
have seen.
Cache Cache Cache Cache

Parallelization in OpenMP
Proc Proc Proc Proc
employs multiple threads.

7 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Execution Model

 OpenMP programs start with


Master Thread Serial Part
just one thread: The Master.

Parallel
 Worker threads are spawned Slave Region
at Parallel Regions, together Slave
Threads
Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
 In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region

 Concept: Fork-Join.
 Allows for an incremental parallelization!
8 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Parallel Region and Structured
Blocks

 The parallelism has to be expressed explicitly.


C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... $!omp end parallel
}
 Structured Block  Specification of number of threads:
 Exactly one entry point at the top  Environment variable:

 Exactly one exit point at the bottom OMP_NUM_THREADS=…


 Branching in or out is not allowed  Or: Via num_threads clause:
 Terminating the program is allowed add num_threads(num) to the
(abort / exit) parallel construct
9 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Hello OpenMP World

10 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Hello orphaned OpenMP World

11 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Starting OpenMP Programs on Linux

 From within a shell, global setting of the number of threads:


export OMP_NUM_THREADS=4
./program

 From within a shell, one-time setting of the number of threads:


OMP_NUM_THREADS=4 ./program

12 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing Construct

13 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing

 If only the parallel construct is used, each thread executes the


Structured Block.
 Program Speedup: Worksharing
 OpenMP‘s most common Worksharing construct: for
C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i];
a[i] = b[i] + c[i]; END DO
}

 Distribution of loop iterations over all threads in a Team.

 Scheduling of the distribution can be influenced.

 Loops often account for most of a program‘s runtime!

14 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Worksharing illustrated

Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do

15 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Vector Addition

16 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Influencing the For Loop Scheduling

 for-construct: OpenMP allows to influence how the iterations are


scheduled among the threads of the team, via the schedule clause:
 schedule(static [, chunk]): Iteration space divided into blocks of
chunk size, blocks are assigned to threads in a round-robin fashion. If chunk
is not specified: #threads blocks.

 schedule(dynamic [, chunk]): Iteration space divided into blocks


of chunk (not specified: 1) size, blocks are scheduled to threads in the order
in which threads finish previous blocks.

 schedule(guided [, chunk]): Similar to dynamic, but block size


starts with implementation-defined value, then is decreased exponentially
down to chunk.
 Default on most implementations is schedule(static).
17 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Synchronization Overview

 Can all loops be parallelized with for-constructs? No!


 Simple test: If the results differ when the code is executed backwards, the
loop iterations are not independent. BUT: This test alone is not sufficient:
C/C++
int i, int s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
s = s + a[i];
}

 Data Race: If between two synchronization points at least one thread


writes to a memory location from which at least one other thread
reads, the result is not deterministic (race condition).

18 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Synchronization: Critical Region

 A Critical Region is executed by all threads, but by only one thread


simultaneously (Mutual Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}

 Do you think this solution scales well?


C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
19 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping

20 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scoping Rules

 Managing the Data Environment is the challenge of OpenMP.

 Scoping in OpenMP: Dividing variables in shared and private:


 private-list and shared-list on Parallel Region

 private-list and shared-list on Worksharing constructs

 General default is shared for Parallel Region, firstprivate for Tasks.

 Loop control variables on for-constructs are private

 Non-static variables local to Parallel Regions are private

 private: A new uninitialized instance is created for each thread

firstprivate: Initialization with Master‘s value

lastprivate: Value of last loop iteration is written back to Master

 Static variables are shared


21 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Privatization of Global/Static Variables

 Global / static variables can be privatized with the threadprivate


directive
 One instance is created for each thread

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword


__thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

22 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Privatization of Global/Static Variables

 Global / static variables can be privatized with the threadprivate


directive
 One instance is created for each thread

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword


__thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

23 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct

24 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct

 OpenMP barrier (implicit or explicit)


 Threads wait until all threads of the current Team have reached the barrier
C/C++
#pragma omp barrier

 All worksharing constructs contain an implicit barrier at the end

25 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Back to our bad
scaling example
C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

26 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
It‘s your turn: Make It Scale!

#pragma omp parallel


do i = 0, 24
{ s = s + a(i)
end do
#pragma omp for
for (i = 0; i < 99; i++) do i = 25, 49
{ s = s + a(i)
end do
do i = 0, 99
s = s + a(i)
s = s + a[i]; do i = 50, 74
end do
s = s + a(i)
end do
}
do i = 75, 99
s = s + a(i)
} // end parallel end do

27 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Reduction Clause

 In a reduction-operation the operator is applied to all variables in the


list. The variables have to be shared.
 reduction(operator:list)
 The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}

 Possible reduction operators with initialization value:


+ (0), * (1), - (0), & (~0), | (0), && (1), || (0),
^ (0), min (largest number), max (least number)
28 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing

double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i];
}
29 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data in Caches

 When data is used, it is copied into


caches.
 The hardware always copies Core Core
chunks into the cache, so called
cache-lines. on-chip cache on-chip cache

 This is useful, when:


 the data is used frequently (temporal
locality) bus

 consecutive data is used which is on


the same cache-line (spatial locality)
memory

30 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing

 False Sharing occurs when


 different threads use elements of the
same cache-line Core Core

 one of the threads writes to the on-chip cache on-chip cache

cache-line
 As a result the cache line is moved
between the threads, also there is bus
no real dependency

 Note: False Sharing is a memory


performance problem, not a
correctness issue

31 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing
4000
 no performance benefit for
more threads 3000

MFLOPS
 Reason: false sharing of 2000
s_priv
1000
 Solution: padding so that
only one variable per cache 0
line is used 1 2 3 4 5 6 7 8 9 10 11 12
#threads

with false sharing


with falsewithout
sharing false sharing

cache line 1 cache line 2


Standard 1 2 3 4 …
With padding 1 2 3 …
32 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing avoided

double s_priv[nthreads * 8];


#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t * 8] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i * 8];
}
33 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example

PI

34 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

35 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for private(fX,i) reduction(+:fSum)


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

36 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (2/2)

 Results:
# Threads Runtime [sec.] Speedup
1 1.11 1.00
2
4
8 0.14 7.93

 Scalability is pretty good:


 About 100% of the runtime has been parallelized.

 As there is just one parallel region, there is virtually no overhead introduced


by the parallelization.

 Problem is parallelizable in a trivial fashion ...

37 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Single and Master Construct

38 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Single Construct

C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single

 The single construct specifies that the enclosed structured block is


executed by only on thread of the team.
 It is up to the runtime which thread that is.

 Useful for:
 I/O

 Memory allocation and deallocation, etc. (in general: setup work)

 Implementation of the single-creator parallel-executor pattern as we will see


now…
39 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Master Construct

C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
... structured block ... ... structured block ...
!$omp end master

 The master construct specifies that the enclosed structured block is


executed only by the master thread of a team.

 Note: The master construct is no worksharing construct and does


not contain an implicit barrier at the end.

40 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Section and Ordered Construct

41 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?

 How would you parallelize this code?


void traverse (Tree *tree)
{
if (tree->left) traverse(tree->left);

if (tree->right) traverse(tree->right);

process(tree);
}

 One option: Use OpenMP‘s parallel sections.

42 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Sections Construct

C/C++ Fortran
#pragma omp sections [clause] !$omp sections [clause]
{ !$omp section
#pragma omp section ... structured block ...
... structured block ... !$ omp section
#pragma omp section ... structured block ...
... structured block ... ...
... !$omp end sections
}

 The sections construct contains a set of structured blocks that are


to be distributed among and executed by the team of threads.

43 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?!

 How would you parallelize this code?


void traverse (Tree *tree)
{
#pragma omp parallel sections Nested Parallel Regions
{
#pragma omp section
if (tree->left) traverse(tree->left);
#pragma omp section
if (tree->right) traverse(tree->right);
} // end omp parallel Barrier here!
process(tree);
}
 Downsides of this option:
 Unneccessary overhead and synchronization points

 Not always well supported (how many threads to be used?)


44 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The ordered Construct

 Allows to execute a structured block within a parallel loop in sequential


order
 In addition, an ordered clause has to be added to the for construct which any
ordered construct may occur
#pragma omp parallel for ordered
for (i=0 ; i<10 ; i++){
...
#pragma omp ordered
{
...
}
...
}

 Use Cases:
 Can be used e.g. to enforce ordering on printing of data
 May help to determine whether there is a data race

45 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library

46 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library

 C and C++:
 If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP
is defined. To use the OpenMP runtime library, the header omp.h has to
be included.

 omp_set_num_threads(int): The specified number of threads will be


used for the parallel region encountered next.

 int omp_get_num_threads: Returns the number of threads in the


current team.

 int omp_get_thread_num(): Returns the number of the calling thread


in the team, the Master has always the id 0.
 Additional functions are available, e.g. to provide locking
functionality.
47 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Tasking

48 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Recursive approach to compute
Fibonacci
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}

 On the following slides we will discuss three approaches to


parallelize this recursive code with Tasking.

49 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Task Construct

C/C++ Fortran
#pragma omp task [clause] !$omp task [clause]
... structured block ... ... structured block ...
!$omp end task

 Each encountering thread/task creates a new Task


 Code and data is being packaged up

 Tasks can be nested

Into another Task directive

Into a Worksharing construct


 Data scoping clauses:
 shared(list)

 private(list) firstprivate(list)

 default(shared | none)
50 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Tasks in OpenMP: Data Scoping

 Some rules from Parallel Regions apply:


 Static and Global variables are shared

 Automatic Storage (local) variables are private

 If shared scoping is not derived by default:


 Orphaned Task variables are firstprivate by default!

 Non-Orphaned Task variables inherit the shared attribute!

 Variables are firstprivate unless shared in the enclosing context

51 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
First version parallelized with Tasking
(omp-v1)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x)
{
#pragma omp parallel
x = fib(n - 1);
{
}
#pragma omp single
#pragma omp task shared(y)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
[...] return x+y;
} }
o Only one Task / Thread enters fib() from main(), it is responsable for
creating the two initial work tasks
o Taskwait is required, as otherwise x and y would be lost

52 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration

 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks

fib(2) fib(1) fib(1) fib(0)

Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

53 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration

 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks
 … fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

54 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (1/3)

 Overhead of task creation prevents better scalability!

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads

55 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
if Clause

 If the expression of an if clause on a task


evaluates to false
 The encountering task is suspended

 The new task is executed immediately

 The parent task resumes when the new task finishes

→ Used for optimization, e.g., avoid creation of small tasks

56 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v2)

 Improvement: Don‘t create yet another task once a certain (small


enough) n is reached
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x) \
#pragma omp parallel if(n > 30)
{ {
#pragma omp single x = fib(n - 1);
{ }
fib(input); #pragma omp task shared(y) \
} if(n > 30)
} {
[...] y = fib(n - 2);
} }
#pragma omp taskwait
return x+y;
}
57 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (2/3)

 Speedup is ok, but we still have some overhead when running with 4
or 8 threads

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads
58 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v3)

 Improvement: Skip the OpenMP overhead once a certain n


is reached (no issue w/ production compilers)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ if (n <= 30)
[...] return serfib(n);
#pragma omp parallel int x, y;
{ #pragma omp task shared(x)
#pragma omp single {
{ x = fib(n - 1);
fib(input); }
} #pragma omp task shared(y)
} {
[...] y = fib(n - 2);
} }
#pragma omp taskwait
return x+y;
59 Introduction to OpenMP }
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (3/3)

 Everything ok now 

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads

60 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (1/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
61 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (2/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
62 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (3/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c:
// Scope of d:
// Scope of e:
} } }
63 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (4/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d:
// Scope of e:
} } }
64 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (5/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e:
} } }
65 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (6/7)

int a = 1;
void foo() Hint: Use default(none) to be
{ forced to think about every
int b = 2, c = 3; variable if you do not see clear.
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }
66 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (7/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared, value of a: 1


// Scope of b: firstprivate, value of b: 0 / undefined
// Scope of c: shared, value of c: 3
// Scope of d: firstprivate, value of d: 4
// Scope of e: private, value of e: 5
} } }
67 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier and Taskwait Constructs

 OpenMP barrier (implicit or explicit)


 All tasks created by any thread of the current Team are guaranteed to be
completed at barrier exit
C/C++
#pragma omp barrier

 Task barrier: taskwait


 Encountering Task suspends until child tasks are complete

Only direct childs, not descendants!


C/C++
#pragma omp taskwait

68 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Task Synchronization

 Task Synchronization explained:

#pragma omp parallel num_threads(np)


{
#pragma omp task np Tasks created here, one for each thread

function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task 1 Task created here
function_B();
}
B-Task guaranteed to be completed here
}

69 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Questions?

70 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University

You might also like