0% found this document useful (0 votes)
43 views119 pages

Fortran Supplement Ver 1.1

Uploaded by

Yttria Therbium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views119 pages

Fortran Supplement Ver 1.1

Uploaded by

Yttria Therbium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

The OpenMP Common Core

The Fortran Supplement (Version 1.1)

Timothy G. Mattson, Yun (Helen) He, and Alice E. Koniges

The MIT Press


Cambridge, Massachusetts
London, England

c 2020 Timothy G. Mattson, Yun (Helen) He, and Alice E. Koniges

All rights reserved.

This book was set in LATEX by the authors.


Contents

Preface to the Fortran supplement v

1 Parallel Computing 1

2 The Language of Performance 5

3 What is OpenMP? 7

4 Threads and the OpenMP Programming Model 9

5 Parallel Loops 23

6 OpenMP Data Environment 33

7 Tasks in OpenMP 47

8 OpenMP Memory Model 67

9 Common Core Recap 73

10 Multithreading beyond the Common Core 75

11 Synchronization and the OpenMP Memory 87


Model

12 Beyond OpenMP Common Core Hardware 95

13 Your Continuing Education in OpenMP 111


Preface to the Fortran supplement

OpenMP is defined for C, C++, and Fortran. Ideally, when you write a book about
OpenMP, everything is covered in triplicate; once for each of these programming
languages. If that was done, however, the resulting book would be cluttered and
painful to read. What’s a poor author to do?
After struggling with the language problem for many moons we came upon what
we hope is an effective compromise. Our book on the OpenMP Common Core covers
C and Fortran. Every time we present an item from the OpenMP API, we define it
in both C and Fortran. The code discussed in the book, however, only addresses C.
The result is a book tightly woven around a set of C examples; free from the clutter
of replicated content from other languages.
For C++ programmers, this solution works quite well. While offensive to a “proper”
C++ programmer, you can think of C++ as a superset of C. With few exceptions,
if you move OpenMP for C into C++, things just work. We thought this was an
adequate solution for Fortran programmers as well since surely modern Fortran
programmers understand C. Based on informal surveys at numerous OpenMP
tutorials, however, we’ve learned that this assumption is not universally true. Many
Fortran programmers are not comfortable with C. The authors of the OpenMP
Common Core started life as Fortran programmers. We love Fortran and would
hate to leave our fellow “Formula Translation” buddies out in the cold without the
benefit of our excellent book.
We came up with a simple solution to this problem. We produced a supplement
to our book that presents every example from the book implemented in Fortran.
Our Fortran friends would buy the OpenMP Common Core book and download this
free supplement. Having the two side by side, Fortran programmers could easily
absorb the contents of our book and apply what they have learned to their own
Fortran programs.
There is one technical complication to this solution. When presenting OpenMP
to C and C++ programmers, we delay introduction of clauses that manipulate the
data environment. There is so much to grapple with when learning multithreaded
programming. It greatly simplifies the discussion if we don’t move beyond the
default rules for data sharing until much later.
For Fortran programmers however, this is not possible. A C programmer can
declare a new variable almost anywhere. Fortran, on the other hand, requires that
all variables are declared before any executable code. Therefore, to discuss OpenMP
with Fortran programmers, we need to present one of the clauses from Chapter 6
(OpenMP Data Environment) right from the beginning. This clause is the private
clause.
vi Preface to the Fortran supplement

The private(list) clause takes a comma separated list of variables as an


argument. Each of the variables in the list are declared earlier (in the declarative
section of the program). We call those the original variables. When the private
clause is used with a construct that creates threads, each thread allocates a variable
for each of the original variables. These new variables are local or “private” to
each thread. These variables are uninitialized, so inside the code executed by the
OpenMP threads, you must initialize all private variables before using them.
There are numerous details with the private clause, but we leave those to
chapter 6. To get started with the OpenMP Common Core, you only need to use
the private clause with simple scalar variables. Once you know about this clause,
its use is almost self evident. You’ll see when you get to its first use in Figure 4.2.
With that small technical detail out of the way, we return to our discussion of
the Fortran Supplement to our book on the OpenMP Common Core. Production of
the Fortran Supplement took far more time than we expected. As is often the case
when writing a book, we were so fixated on the text that we did not organize all the
code we used in one place. So one of us (Helen) extracted each and every fragment
of C code from the book. She then converted each of them to Fortran. While she
tested the resulting code, another member of the team (Tim) independently verified
all of her code. He then made a copy of the Latex source code for the OpenMP
Common Core book and modified it to include just the figures-with-source-code and
code-embedded-in-text replacing the C code with Helen’s Fortran code.
The result is this document. With Fortran versions for all the code from our
OpenMP Common Core book, it should make our book useful to Fortran program-
mers. We have mirrored the chapter structure of the Common Core book with
section headings that call out:

• The figure number for programs presented in the OpenMP Common Core
book

• The page number from the OpenMP Common Core book where code-embedded-
in-text is found.

We hope you find our solution to the multi-language problem in OpenMP useful.
We really want our Fortran readers to benefit from our wonderful little book on the
OpenMP Common Core.
Preface to the Fortran supplement vii

Acknowledgments

When we first started teaching OpenMP, Scientific Computing was almost exclusively
centered on Fortran. Even if programmers used a different language, they were
generally comfortable reading Fortran. Over the years, the dominance of Fortran has
slipped only to be replaced by C. In response, our materials for teaching OpenMP
shifted taking us to place where we largely ignored Fortran.
This situation is changing. We are moving Fortran back into the core materials we
use for teaching OpenMP. We can only do this, however, with help from the OpenMP
Fortran community. We all still use Fortran and are comfortable with the language,
but we need an expert who is up to date with the latest developments in Fortran
to check our work and make sure it is correct. Henry Jin (NASA Ames Research
Center) played the “expert Fortran reviewer” role for us. He found numerous errors,
both large and small, in our code. We are grateful to Henry for the many hours he
spent reviewing our code and making sure that this Fortran Supplement is of the
highest quality.
1 Parallel Computing

Figure 1.1: Our first OpenMP Program

A simple “Hello World” program to demonstrate concurrent execution.


1 program h e l l o w o r l d
2 use omp lib
3 ! $omp p a r a l l e l
4 w r i t e ( ∗ , ’ ( a ) ’ , advance =’no ’ ) ’ H e l l o ’
5 w r i t e ( ∗ , ’ ( a ) ’ ) ’ World ’
6 ! $omp end p a r a l l e l
7 end program h e l l o w o r l d
2 Chapter 1

Figure 1.5: A pthreads “Hello World” program

A “Hello World” where a C function using Pthreads is called from a


Fortran Program.
1
2 ! This i s a F o r t r a n wrapper t o c a l l C Pthreads f u n c t i o n
3 ! Save t h e c o n t e n t s i n 2 f i l e s a s below .
4
5 ! To c o m p i l e :
6 ! g f o r t r a n −c F i g 1 . 5 P o s i x H e l l o . f 9 0
7 ! g c c −c F i g 1 . 5 P o s i x H e l l o e x t e r n a l . c
8 ! gfortran Fig 1 .5 PosixHello . o Fig 1 .5 PosixHello external . o
9
10 # File 1: ” Fig 1 .5 PosixHello . f90 ”
11
12 program main
13 i m p l i c i t none
14 external : : pthreads c
15 c a l l pthreads c ()
16 end program main
17
18 # File 2: ” Fig 1 .5 PosixHello external . c”
19
20 #i n c l u d e <p t h r e a d . h>
21 #i n c l u d e <s t d i o . h>
22 #i n c l u d e < s t d l i b . h>
23 #d e f i n e NUM THREADS 4
24
25 v o i d ∗ P r i n t H e l l o W o r l d ( v o i d ∗ InputArg )
26 {
27 p r i n t f (” Hello ” ) ;
28 p r i n t f ( ” World \n ” ) ;
29 }
30
31 void pthreads c ( )
32 {
33 p t h r e a d t t h r e a d s [NUM THREADS ] ;
34 int id ;
Parallel Computing 3

35 pthread attr t attr ;


36 p t h r e a d a t t r i n i t (& a t t r ) ;
37 p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE ) ;
38
39 f o r ( i d = 0 ; i d < NUM THREADS; i d++) {
40 p t h r e a d c r e a t e (& t h r e a d s [ i d ] , &a t t r , P r i n t H e l l o W o r l d , NULL ) ;
41 }
42
43 f o r ( i d = 0 ; i d < NUM THREADS; i d ++){
44 p t h r e a d j o i n ( t h r e a d s [ i d ] , NULL ) ;
45 }
46
47 p t h r e a d a t t r d e s t r o y (& a t t r ) ;
48 p t h r e a d e x i t (NULL ) ;
49 }
2 The Language of Performance

In this chapter, we focus on the words we use when talking about the performance
of a parallel program. These concepts are language independent. Hence, there is no
C, C++, or Fortran code in this chapter.
3 What is OpenMP?

Table 3.1: The contents of the OpenMP Common Core

A structured block is implied between any directive and its “end directive” form.
Square brackets ([ ]) denote optional items.

Directives, subprograms, clauses, Description


and an environment variable
!$OMP parallel Create a team of threads to
!$OMP end parallel execute a structured block of code
integer function omp get num threads() Number of threads in a team (N)
integer function omp get thread num() Thread ID (from 0 to (N-1))

subroutine omp set num threads(numthrds) Set default number of threads to request
integer numthrds when creating a team of threads
double precision function omp get wtime() Wall clock time
export OMP NUM THREADS=N Environment Variable: number of threads
!$OMP barrier Wait for all threads in the team

!$OMP critical Mutual exclusion synchronization


!$OMP end critical
!$OMP do Divide a loop’s work between the team
[! $OMP end do]

!$OMP parallel do Create a team then share work of a


[! $OMP end parallel do] loop across the team
reduction(op: list ) Reduction across a team
schedule( static [, chunk]) Fixed loop-distribution at compile time
schedule(dynamic [,chunk]) Loop-distribution varies at runtime
private( list ) Create variable local to each thread/task
firstprivate ( list ) Create and initialize a private variable
shared( list ) Share a variable between threads/tasks
default(none) Force explicit storage attribute definitions
nowait Disable implied barriers
!$OMP single Workshare work so it is done
!$OMP end single by a single thread
!$OMP task Create an explicit task
!$OMP end task

!$OMP taskwait Wait for tasks to complete


4 Threads and the OpenMP Programming Model

Figure 4.2: Shared data, private data, and parallel regions

Data movement and parallel regions – This simple program sets the default
number of threads to request for a parallel region to 4. A parallel region is defined with a
private clause so each thread has its own copy of ID. The thread ID is set and a simple
subroutine is called. Key points form this program: (1) all the threads independently
execute the same block of code in this parallel region, (2) all threads have access to the
array declared prior to the parallel region, and (3) each thread has its own, private copy of
the integer ID.

1 Program parReg
2 use omp lib
3 i m p l i c i t none
4
5 r e a l : : A( 1 0 )
6 i n t e g e r : : ID
7
8 A = 0
9 c a l l omp set num threads (4)
10
11 ! $omp p a r a l l e l p r i v a t e ( ID )
12 ID = omp get thread num ( ) + 1
13 c a l l pooh ( ID , A)
14 w r i t e ( ∗ , 1 0 0 ) ID , A( ID )
15 100 format ( ”A o f ID ( ” , I3 , ”)=” , f 1 0 . 4 )
16 ! $omp end p a r a l l e l
17
18 contains
19
20 s u b r o u t i n e pooh ( ID , A)
21 i n t e g e r : : ID
22 r e a l , dimension ( : ) : : A
23 A( ID ) = ID
24 end s u b r o u t i n e pooh
25
26 end Program parReg
10 Chapter 4

Figure 4.3: Thread counts and IDs

Library routines to manage threads – This program shows how to set the default
number of threads to request in parallel regions, query the number of threads in a team,
and set a unique thread ID. Notice the care taken to avoid a data race when assigning to
size of team.
1 Program parReg1
2 use omp lib
3 i m p l i c i t none
4
5 i n t e g e r : : ID , s i z e o f t e a m , NThrds
6 c a l l omp set num threads (4)
7 ! $omp p a r a l l e l p r i v a t e ( ID , NThrds )
8 ID = omp get thread num ( )
9 NThrds = omp get num threads ( )
10 i f ( ID == 0 ) s i z e o f t e a m = NThrds
11 ! $omp end p a r a l l e l
12 p r i n t ∗ , ”We j u s t d i d t h e j o i n on a team o f s i z e ” , s i z e o f t e a m
13 end Program parReg1
Threads and the OpenMP Programming Model 11

Figure 4.5: The Pi program

Serial program to numerically estimate a definite integral using the mid-


point rule – The loop iterations are independent other than the summation into sum.
1 PROGRAM MAIN
2
3 USE OMP LIB
4 IMPLICIT NONE
5
6 INTEGER : : i
7 INTEGER, PARAMETER : : num steps = 100000000
8 REAL∗8 : : x , pi , sum , s t e p
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10
11 sum = 0 . 0
12
13 s t e p = 1 . 0 / num steps
14 s t a r t t i m e = OMP GET WTIME( )
15
16 DO i = 1 , num steps
17 x = ( i − 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20
21 p i = s t e p ∗ sum
22 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
23
24 WRITE( ∗ , 1 0 0 ) pi , num steps , r u n t i m e
25 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s , ’ , f 8 . 3 , ’ s e c s ’ )
26
27 END PROGRAM MAIN
12 Chapter 4

Page 59: Loop level parallelism with the SPMD pattern

integer :: i
ID = omp_get_thread_num()
numthreads = omp_get_num_threads()

do i = ID + 1, num_steps, numthreads
! body of the loop
end do
Threads and the OpenMP Programming Model 13

Figure 4.6: SPMD parallel Pi program

SPMD parallel numerical integration with cyclic distribution of the loop


iterations – The program computes the area of the curve defined by the integrand by
filling the area under a curve with rectangles and summing up their areas. This version
of the program promotes the accumulation variable sum to an array and uses a cyclic
distribution of loop iterations between threads.

1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER, PARAMETER : : MAX THREADS = 4
6 INTEGER : : i , j , id , numthreads , n t h r e a d s
7 INTEGER, PARAMETER : : num steps = 100000000
8 REAL∗8 : : pi , r e a l s u m , s t e p , x
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10 REAL∗8 : : sum ( 0 :MAX THREADS−1)
11
12 s t e p = 1 . 0 / num steps
13
14 CALL OMP SET NUM THREADS(MAX THREADS)
15 s t a r t t i m e = omp get wtime ( )
16
17 !$OMP PARALLEL PRIVATE( id , x , numthreads )
18 i d = omp get thread num ( )
19 numthreads = OMP GET NUM THREADS( )
20 sum ( i d ) = 0 . 0
21
22 IF ( i d == 0 ) THEN
23 n t h r e a d s = numthreads
24 ENDIF
25
26 DO i = id , num steps − 1 , numthreads
27 x = ( i + 0.5) ∗ step
28 sum ( i d ) = sum ( i d ) + 4 . 0 / ( 1 . 0 + x ∗ x )
29 ENDDO
30 !$OMP END PARALLEL
31
14 Chapter 4

32 pi = 0.0
33 DO i = 0 , n t h r e a d s −1
34 p i = p i + sum ( i )
35 ENDDO
36
37 pi = step ∗ pi
38 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
39 WRITE( ∗ , 1 0 0 ) pi , num steps , r u n t i m e
40 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s , ’ , f 8 . 3 , ’ s e c s ’ )
41
42 END PROGRAM MAIN
Threads and the OpenMP Programming Model 15

Figure 4.7: SPMD Pi program with block-distribution of loop


iterations

SPMD parallel numerical integration with block decomposition of the


loop iterations – The parallel region from the code in Figure 4.6, but replacing the
cyclic distribution of loop iterations with a block distribution.
1 s t e p = 1 . 0 / num steps
2
3 !$OMP PARALLEL PRIVATE( id , x , numthreads , i s t a r t , i e n d )
4 i d = omp get thread num ( )
5 numthreads = OMP GET NUM THREADS( )
6 sum ( i d ) = 0 . 0
7
8 IF ( i d == 0 ) THEN
9 n t h r e a d s = numthreads
10 ENDIF
11
12 i s t a r t = i d ∗ num steps / numthreads + 1
13 i e n d = ( i d +1) ∗ num steps / numthreads
14 i f ( i d == ( numthreads −1)) i e n d = num steps
15
16 DO i = i s t a r t , i e n d
17 x = ( i − 0.5) ∗ step
18 sum ( i d ) = sum ( i d ) + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20 !$OMP END PARALLEL
16 Chapter 4

Figure 4.9: Remove false sharing with a padded array

Padded sum array numerical integration – The sum array padded to fill an L1
cache line with the extra dimension and put subsequent rows of sum, i.e., each sum(0,id),
on different cache lines.

1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , j , id , numthreads , n t h r e a d s
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER, PARAMETER : : MAX THREADS=4
8 INTEGER, PARAMETER : : CBLK=8
9 REAL∗8 : : pi , s t e p , x
10 REAL∗8 : : s t a r t t i m e , r u n t i m e
11 REAL∗8 : : sum (CBLK, 0 :MAX THREADS−1)
12
13 s t e p = 1 . 0 / num steps
14
15 CALL OMP SET NUM THREADS(MAX THREADS)
16 s t a r t t i m e = omp get wtime ( )
17
18 !$OMP PARALLEL PRIVATE( id , x , numthreads )
19 i d = omp get thread num ( )
20 numthreads = OMP GET NUM THREADS( )
21 sum ( 1 , i d ) = 0 . 0
22
23 IF ( i d == 0 ) THEN
24 n t h r e a d s = numthreads
25 ENDIF
26
27 DO i = id , num steps −1, numthreads
28 x = ( i + 0.5) ∗ step
29 sum ( 1 , i d ) = sum ( 1 , i d ) + 4 . 0 / ( 1 . 0 + x ∗ x )
30 ENDDO
31 !$OMP END PARALLEL
32
33 pi = 0.0
Threads and the OpenMP Programming Model 17

34 DO i = 0 , n t h r e a d s − 1
35 p i = p i + sum ( 1 , i )
36 ENDDO
37
38 pi = step ∗ pi
39 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
40 WRITE( ∗ , 1 0 0 ) pi , run ti me , n t h r e a d s
41 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s and ’ , i 3 , ’ t h r e a d s ’ )
42
43 END PROGRAM MAIN
18 Chapter 4

Figure 4.10: Mutual exclusion synchronization with critical

A critical section – consume() must be called by one thread at a time.


1 ! sample c o m p i l e command t o g e n e r a t e t h e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 4 . 1 0 c r i t . f 9 0
3
4 program c r i t
5 use omp lib
6 i m p l i c i t none
7 real : : res = 0.0
8 i n t e g e r : : n i t e r s = 1000
9 real : : B
10 i n t e g e r : : i , id , n t h r d s
11
12 interface
13 function big job ( i )
14 real : : big job
15 integer , intent ( in ) : : i
16 end f u n c t i o n b i g j o b
17
18 f u n c t i o n consume ( a )
19 r e a l : : consume
20 real , intent ( in ) : : a
21 end f u n c t i o n consume
22 end i n t e r f a c e
23
24 ! $omp p a r a l l e l p r i v a t e ( id , nthrds , B)
25 i d = omp get thread num ( )
26 n t h r d s = omp get num threads ( )
27 do i = id , n i t e r s − 1 , n t h r d s
28 B = big job ( i )
29 ! $omp c r i t i c a l
30 r e s = r e s + consume (B)
31 ! $omp end c r i t i c a l
32 end do
33 ! $omp end p a r a l l e l
34 end program c r i t
Threads and the OpenMP Programming Model 19

Figure 4.11: Pi program with a Critical section

Numerical integration with a critical section – The partial sums go into a


private variable allocated by each thread. These private variables are extremely unlikely to
reside on the same L1 cache lines and therefore, there will be no false sharing. The partial
sums are combined inside a critical section so there is no data race.
1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , j , id , numthreads , n t h r e a d s
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER, PARAMETER : : MAX THREADS=4
8 REAL∗8 : : pi , r e a l s u m , s t e p , f u l l s u m , x
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10 REAL∗8 : : p a r t i a l s u m
11
12 full sum = 0.0
13
14 s t e p = 1 . 0 / num steps
15
16 CALL OMP SET NUM THREADS(MAX THREADS)
17 full sum = 0.0
18 s t a r t t i m e = OMP GET WTIME( )
19
20 !$OMP PARALLEL PRIVATE( i , id , numthreads , p a r t i a l s u m , x )
21 i d = OMP GET THREAD NUM( )
22 numthreads = OMP GET NUM THREADS( )
23 partial sum = 0.0
24
25 i f ( i d == 0 ) n t h r e a d s = numthreads
26
27 DO i = id , num steps −1, numthreads
28 x = ( i +0.5)∗ s t e p
29 p a r t i a l s u m = p a r t i a l s u m + 4 . 0 / ( 1 . 0 + x∗x )
30 ENDDO
31
32 !$OMP CRITICAL
20 Chapter 4

33 full sum = full sum + partial sum


34 !$OMP END CRITICAL
35
36 !$OMP END PARALLEL
37
38 pi = step ∗ full sum
39 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
40 WRITE( ∗ , 1 0 0 ) pi , run ti me , n t h r e a d s
41 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s and ’ , i 3 , ’ t h r e a d s ’ )
42
43 END PROGRAM MAIN
Threads and the OpenMP Programming Model 21

Figure 4.12: Barrier synchronization

Example of an explicit barrier – An explicit barrier is used to assure that all threads
complete filling the array Arr before using it to compute Brr. We assume the SPMD
pattern so we pass the thread id and the number of threads to all the functions. Notice
that only one thread saves the number of threads to a shared variable should it be needed
after the parallel region.

1 r e a l ∗8 : : Arr ( 8 ) , Brr ( 8 )
2 i n t e g e r : : numthrds
3 i n t e g e r : : id , n t h r d s
4 r e a l ∗8 , e x t e r n a l : : l o t s o f w o r k , n e e d s a l l o f A r r
5
6 c a l l omp set num threads (8)
7 ! $omp p a r a l l e l p r i v a t e ( id , n t h r d s )
8 i d = omp get thread num ( ) + 1
9 n t h r d s = omp get num threads ( )
10 i f ( i d == 1 ) numthrds = n t h r d s
11 Arr ( i d ) = l o t s o f w o r k ( id , n t h r d s )
12 #pragma omp b a r r i e r
13 Brr ( i d ) = n e e d s a l l o f A r r ( id , nthrds , Arr )
14 ! $omp end p a r a l l e l
5 Parallel Loops

Page 75: Basic vector addition loop

do i = 1, N
a(i) = a(i) + b(i)
end do

Figure 5.1: SPMD pattern for loop-level parallelism

SPMD parallel vector add program – Create a team of threads and assign one
chunk of loop iterations to each thread.

1 ! OpenMP p a r a l l e l r e g i o n and SPMD p a t t e r n


2
3 i n t e g e r : : id , i , Nthrds , i s t a r t , i e n d
4
5 ! $omp p a r a l l e l p r i v a t e ( id , i , i s t a r t , iend , Nthrds )
6 i d = omp get thread num ( )
7 Nthrds = omp get num threads ( )
8 i s t a r t = i d ∗ N / Nthrds + 1
9 i e n d = ( i d + 1 ) ∗ N / Nthrds
10 i f ( i d == Nthrds − 1 ) i e n d = N
11 do i = i s t a r t , i e n d
12 a( i ) = a( i ) + b( i )
13 end do
14 ! $omp end p a r a l l e l

Page 76: Worksharing-loop directives

!$omp do
...
!$omp end do
24 Chapter 5

Figure 5.2: The worksharing-loop construct

Loop-level parallelism for the vector add program – We create a team of


threads and then add a single directive to split up loop iterations among threads.

1 ! OpenMP p a r a l l e l r e g i o n and a worksharing −l o o p c o n s t r u c t


2
3 ! $omp p a r a l l e l
4 ! $omp do
5 do i = 1 , N
6 a( i ) = a( i ) + b( i )
7 end do
8 ! $omp end do
9 ! $omp end p a r a l l e l

Page 77: Standard do loop format to use with a worksharing-loop


construct

do i = init, end, incr


structured block
end do

Figure 5.3: Parallel and worksharing-loop constructs

An example of a parallel worksharing-loop construct – Create multiple threads,


then split the loop iterations among multiple threads to share the work.
1 ! $omp p a r a l l e l
2 ! $omp do
3 do i = N, 0 , −2
4 c a l l NEAT STUFF( i )
5 enddo
6 ! $omp end do
7 ! $omp end p a r a l l e l
Parallel Loops 25

Page 79: Combined parallel worksharing-loop construct

The following pattern with a pair of OpenMP constructs, one to create the team of
threads and the other to split up loop iterations among threads, is very common:

!$omp parallel
!$omp do
do-loop
!$omp end do
!$omp end parallel

As a convenience, these two directives can be combined into a single directive:

!$omp parallel do
do-loop
!$omp end parallel do

Figure 5.4: A simple program with a reduction

A serial reduction example – This loop has a loop-carried dependence through


the variable ave and therefore, the loop cannot be parallelized with a worksharing-loop
directive without completely changing the body of the loop.
1 integer : : i
2 r e a l ∗8 : : ave , A(N)
3
4 c a l l I n i t (A,N)
5 ave = 0 . 0
6
7 do i = 1 , N
8 ave = ave + A( i )
9 enddo
10 ave = ave /N
26 Chapter 5

Figure 5.5: Worksharing-loop with a reduction

An OpenMP reduction –Each thread has a private copy of the variable ave to use
for its loop iterations. At the end of the loop, these values are combined to create the final
value of the reduction which is then combined with the globally visible, shared copy of the
variable ave.
1 integer : : i
2 r e a l ∗8 : : ave , A(N)
3
4 c a l l I n i t (A,N)
5 ave = 0 . 0
6
7 ! $omp p a r a l l e l do r e d u c t i o n (+: ave )
8 do i = 1 , N
9 ave = ave + A( i )
10 enddo
11 ! $omp end p a r a l l e l do
12
13 ave = ave /N
Parallel Loops 27

Figure 5.6: Loop schedules specified “at compile time”

A worksharing-loop with a static schedule – In this example, the work is pre-


dictable and balanced for each loop index. Using the static schedule is expected to work
best in this case.

1 program main
2 use omp lib
3 i m p l i c i t none
4
5 ! Use a s m a l l e r v a l u e o f ITER i f a v a i l a b l e memory i s t o o s m a l l
6 i n t e g e r , parameter : : ITER = 100000000
7 integer : : i , id
8 r e a l ∗8 : : A( i t e r )
9 r e a l ∗8 : : t d a t a
10 real : : x
11
12 do i = 1 , ITER
13 A( i ) = 2 . 0 ∗ i
14 enddo
15
16 ! $omp p a r a l l e l p r i v a t e ( id , tdata , x )
17
18 i d = omp get thread num ( )
19 t d a t a = omp get wtime ( )
20
21 ! $omp do s c h e d u l e ( s t a t i c )
22 do i = 1 , ITER
23 x = i ∗ 1.0
24 A( i ) = A( i ) ∗ s q r t ( x ) / ( s i n ( x ) ∗∗ tan ( x ) )
25 enddo
26
27 t d a t a = omp get wtime ( ) − t d a t a
28
29 i f ( i d == 0 ) p r i n t ∗ , ”Time s p e n t i s ” , tdata , ” s e c ”
30
31 ! $omp end p a r a l l e l
32 end program main
28 Chapter 5

Figure 5.7: Dynamic loop schedules that vary at runtime

A worksharing-loop with a dynamic schedule – In this program, the work per


iteration is highly variable. The dynamic schedule should be much better at balancing the
load across the team of threads.

1 program main
2
3 use omp lib
4 i m p l i c i t none
5
6 ! Use a s m a l l e r v a l u e o f ITER i f a v a i l a b l e memory i s t o o s m a l l
7 i n t e g e r , parameter : : ITER = 50000000
8 integer : : i , id
9 r e a l ∗8 : : t d a t a
10 i n t e g e r : : sum = 0
11
12 ! $omp p a r a l l e l p r i v a t e ( i , id , t d a t a )
13 i d = omp get thread num ( )
14 t d a t a = omp get wtime ( )
15
16 ! $omp do r e d u c t i o n (+:sum ) s c h e d u l e ( dynamic )
17 do i = 2 , ITER
18 i f ( c h e c k p r i m e ( i ) == 1 ) sum = sum + 1
19 enddo
20 ! $omp end do
21
22 t d a t a = omp get wtime ( ) − t d a t a
23
24 i f ( i d == 0 ) p r i n t ∗ , ”Number o f prime numbers i s ” , &
25 & sum , ” i n ” , tdata , ” s e c ”
26 ! $omp end p a r a l l e l
27
28 contains
29 i n t e g e r f u n c t i o n c h e c k p r i m e (num)
30 i m p l i c i t none
31 i n t e g e r , i n t e n t ( i n ) : : num
32 i n t e g e r : : i , iend
33
Parallel Loops 29

34 i e n d = i n t ( s q r t (num ∗ 1 . 0 ) )
35 do i = 2 , i e n d
36 i f (mod(num , i ) == 0 ) then
37 check prime = 0
38 return
39 endif
40 enddo
41
42 check prime = 1
43
44 end f u n c t i o n c h e c k p r i m e
45
46 end program main
30 Chapter 5

Figure 5.8: Using nowait to disable implied barriers

Using a nowait clause with worksharing-loops – In this example, we explore


the need for barriers and cases where they can be disabled with a nowait clause.

1 r e a l ∗8 : : A( b i g ) , B( b i g ) , C( b i g )
2 integer : : id
3
4 ! $omp p a r a l l e l p r i v a t e ( i d )
5 i d = omp get thread num ( ) + 1
6 A( i d ) = b i g c a l c 1 ( i d )
7
8 ! $omp b a r r i e r
9
10 ! $omp do
11 do i = 1 , N
12 B( i ) = b i g c a l c 2 (C, i )
13 end do
14 ! $omp enddo nowait
15
16 A( i d ) = b i g c a l c 4 ( i d )
17 ! $omp end p a r a l l e l
Parallel Loops 31

Figure 5.9: A particularly simple parallel Pi program

Pi program with a worksharing-loop and a reduction – The program computes


the integral of a function by filling the area under a curve with rectangles and summing
their areas. Loop iterations are divided among threads by the compiler under direction
of the worksharing-loop construct. The reduction creates a private copy of sum for each
thread, initializes it to zero, accumulates partial sums into the sum variable, and then
combines partial sums to generate the global sum.

1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , i d
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER : : NTHREADS = 4
8 REAL∗8 : : x , pi , sum , step
9 REAL∗8 : : s t a r t t i m e , run time
10
11 sum = 0 . 0
12 s t e p = 1 . 0 / num steps
13 s t a r t t i m e = OMP GET WTIME( )
14
15 CALL OMP SET NUM THREADS(NTHREADS)
16
17 !$OMP PARALLEL PRIVATE( i , x )
18 !$OMP DO REDUCTION(+:sum )
19 DO i = 1 , num steps
20 x = ( i − 0.5) ∗ step
21 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
22 ENDDO
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 p i = s t e p ∗ sum
27 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
28 WRITE( ∗ , 1 0 0 ) pi , r u n t i m e
29 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s ’ )
30 END PROGRAM MAIN
32 Chapter 5

Figure 5.10: Making loop iterations independent

Loop dependence example –The first loop is sequential and contains a loop-carried
dependence. The value of j for a loop index is dependent on the value of j for the previous
loop index. In the parallel code in the second loop, the loop-carried dependence has been
removed by calculating j from the loop control index.
1 ! S e q u e n t i a l code with l o o p dependence
2 i n t e g e r : : i , j , A(MAX)
3 j = 5
4 do i = 1 , MAX
5 j = j + 2
6 A( i ) = b i g ( j )
7 end do
8
9 ! p a r a l l e l code with l o o p dependence removed
10 i n t e g e r : : i , j , A(MAX)
11 ! $omp p a r a l l e l do p r i v a t e ( j )
12 do i = 1 , MAX
13 do j = 5 + 2 ∗ ( i +1)
14 A( i ) = b i g ( j )
15 end do
16 end do
17 ! $omp end p a r a l l e l do
6 OpenMP Data Environment

Firgure 6.1: How data is stored by default

An example of default storage attributes – A, index, count are shared variables


since A is a global variable defined in a module, index is defined prior to the parallel region,
and count is a saved variable. temp is a private variable since it is declared inside the
parallel region.

1 ! F i l e #1:
2 module data mod
3 r e a l ∗8 : : A( 1 0 )
4 end module data mod
5
6 program main
7 u s e data mod
8 i m p l i c i t none
9 i n t e g e r : : index (10)
10 ! $omp p a r a l l e l
11 c a l l work ( i n d e x )
12 ! $omp end p a r a l l e l
13 p r i n t ∗ , index (1)
14 end program main
15
16
17 ! F i l e #2:
18 s u b r o u t i n e work ( i n d e x )
19 u s e data mod
20 i m p l i c i t none
21 i n t e g e r : : index
22 r e a l ∗8 : : temp ( 1 0 )
23 i n t e g e r , s a v e : : count
24 ...
25 end s u b r o u t i n e work
34 Chapter 6

Figure 6.3: The shared and private clauses

The shared clause – An example of a shared clause on a parallel construct. Strictly


this clause is not needed. It is included in this case to remind the programmer that of the
three variables A, B, and C, only B and C are shared. A copy of the variable A is created for
each thread by the clause private(A).
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 6 . 3 sharedEx . f 9 0
3
4 program sharedEx
5 use omp lib
6 i m p l i c i t none
7 i n t e g e r : : A, B, C
8
9 interface
10 s u b r o u t i n e i n i t i a l i z e (A, B, C)
11 i n t e g e r , i n t e n t ( out ) : : A, B, C
12 end s u b r o u t i n e
13 end i n t e r f a c e
14
15 c a l l i n i t i a l i z e (A, B, C)
16
17 ! remember t h e v a l u e o f A b e f o r e t h e p a r a l l e l r e g i o n
18 p r i n t ∗ , ’A b e f o r e = ’ , A
19
20 ! $omp p a r a l l e l s h a r e d (B,C) p r i v a t e (A)
21 A = omp get thread num ( )
22 ! $omp c r i t i c a l
23 C = B + A
24 ! $omp end c r i t i c a l
25 ! $omp end p a r a l l e l
26
27 ! A i n t h e p a r a l l e l r e g i o n g o e s out o f scope , we r e v e r t
28 ! to the o r i g i n a l v a r i a b l e f o r A
29 p r i n t ∗ , ’A a f t e r = ’ , A, ’ and C = ’ , C
30
31 end program sharedEx
OpenMP Data Environment 35

Figure 6.4: The Private clause (note: this program is not correct)

An example of a private clause – The original variable tmp is masked by the private
copy of the variable inside the parallel do region. This program is incorrect since a private
variable is not initialized.

1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 6 . 4 w r o n g P r i v a t e . f 9 0
3
4 program wrong
5 i n t e g e r : : tmp
6 tmp = 0
7
8 ! $omp p a r a l l e l do p r i v a t e ( tmp )
9 do j = 1 , 1000
10 tmp = tmp +j
11 enddo
12 ! $omp end p a r a l l e l do
13
14 p r i n t ∗ , tmp ! tmp i s 0 h e r e
15 end program wrong
36 Chapter 6

Figure 6.5: OK to use a local variable of same name outside of a


module

A second example of the private clause – This Fortran program works (unlike
the corresponding C code which has a subtle bug). tmp is a local variable in subroutine OK,
not the one from the module file as updated in subroutine work, hence the value printed
in line 13 should be the same as the original value 0 as defined in line 9 before the parallel
region.

1 ! F i l e #1
2 module data mod
3 i n t e g e r : : tmp
4 end module data mod
5
6 s u b r o u t i n e OK( )
7 i m p l i c i t none
8 i n t e g e r : : tmp
9 tmp = 0
10 ! $omp p a r a l l e l p r i v a t e ( tmp )
11 c a l l work ( )
12 ! $omp end p a r a l l e l
13 p r i n t ∗ , tmp ! tmp i s 0 , same a s t h e o r i g i n a l l o c a l v a l u e
14 ! d ef i ne d b e f o r e the p a r a l l e l r e g i o n
15 end s u b r o u t i n e OK
16
17 ! F i l e #2
18 s u b r o u t i n e work ( )
19 u s e data mod
20 i m p l i c i t none
21 tmp = 5
22 end s u b r o u t i n e work
OpenMP Data Environment 37

Figure 6.6: Creating private variables that are initialized

Example of using the firstprivate clause – incr is a firstprivate variable so it is


private to each thread and has an initial value (zero).

1 incr = 0
2 ! $omp p a r a l l e l do f i r s t p r i v a t e ( i n c r )
3 do i = 1 , MAX
4 i f (mod( i , 2 ) == 0 ) i n c r = i n c r + 1
5 A( i ) = i n c r
6 end do
7 ! $omp end p a r a l l e l do

Figure 6.7: Data environment quiz

An OpenMP data environment quiz – Consider the storage attributes and values
for A, B and C.

1 A = 1
2 B = 1
3 C = 1
4 ! $omp p a r a l l e l p r i v a t e (B) f i r s t p r i v a t e (C)
38 Chapter 6

Figure 6.8 and 6.9: Find the area of the Mandelbrot set

Mandelbrot set area: original code with errors – This version of the program
has multiple bugs. Your job is to inspect the code and find the bugs.

!C PROGRAM: Mandelbrot area


!C
!C PURPOSE: Program to compute the area of a Mandelbrot set.
!C Correct answer should be around 1.510659.
!C WARNING: this program may contain errors
!C
!C USAGE: Program runs without input ... just run the executable
!C

MODULE mandel_module
implicit none

INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(14)

REAL(KIND = DP) :: r

INTEGER, PARAMETER :: NPOINTS=1000


INTEGER, PARAMETER :: MAXITER=1000
INTEGER :: numoutside=0

TYPE d_complex
REAL(KIND = DP) :: r
REAL(KIND = DP) :: i
END TYPE d_complex

TYPE(d_complex) :: c

contains

SUBROUTINE testpoint()

!C iterate over z=z*z+c. |z| > 2 means the point is outside set
OpenMP Data Environment 39

!C If loop count reaches MAXITER, point is inside the set

implicit none
TYPE(d_complex) :: z
INTEGER :: iter
REAL(KIND = DP) :: temp

z = c

DO iter = 1, MAXITER
temp = (z%r*z%r) - (z%i*z%i) + c%r
z%i = z%r*z%i*2 + c%i
z%r = temp

IF ((z%r*z%r + z%i*z%i) > 4.0) THEN


numoutside = numoutside + 1
EXIT
ENDIF
ENDDO

END SUBROUTINE

END MODULE mandel_module

PROGRAM mandel_wrong
USE OMP_LIB
USE mandel_module
IMPLICIT NONE

INTEGER :: i, j
REAL(KIND = DP) :: area, error
REAL(KIND = DP) :: eps = 1.0e-5

!C Loop over grid of complex points in the domain of the


!C Mandelbrot set, testing each point to see whether it is
!C inside or outside the set.
40 Chapter 6

!$OMP PARALLEL DO DEFAULT(shared) PRIVATE(c,eps)

DO i = 1, NPOINTS
DO j = 1, NPOINTS
c%r = -2.0 + 2.5 * DBLE(i-1) / DBLE(NPOINTS) + eps
c%i = 1.125 * DBLE(j-1) / DBLE(NPOINTS) + eps
CALL testpoint()
ENDDO
ENDDO
!$OMP END PARALLEL DO

!C Calculate area of set and error estimate and output the results

area = 2.0 * 2.5 *1.125 * DBLE(NPOINTS*NPOINTS - numoutside) &


& /DBLE(NPOINTS*NPOINTS)
error = area / DBLE(NPOINTS)

PRINT *, "numoutside=", numoutside


WRITE(*,100) area, error
100 FORMAT("Area of Mandlebrot set = ", f12.8, " +/-", f12.8)
PRINT *, "Correct answer should be around 1.510659"

END PROGRAM mandel_wrong


OpenMP Data Environment 41

Figures 6.10 and 6.11: Debugging the Mandelbrot set program

Mandelbrot set area solution – c is passed as an argument to subroutine testpoint.


eps is declared as firstprivate and the inner loop index j is declared as private.

!C PROGRAM: Mandelbrot area


!C
!C PURPOSE: Program to compute the area of a Mandelbrot set.
!C Correct answer should be around 1.510659.
!C
!C USAGE: Program runs without input ... just run the executable
!C

MODULE mandel_par_module
implicit none

INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(14)

INTEGER, PARAMETER :: NPOINTS=1000


INTEGER, PARAMETER :: MAXITER=1000
INTEGER :: numoutside=0

TYPE d_complex
REAL(KIND = DP) :: r
REAL(KIND = DP) :: i
END TYPE d_complex

contains

SUBROUTINE testpoint(c)

!C iterate over z=z*z+c. |z| > 2 means the point is outside set
!C If loop count reaches MAXITER, point is inside the set

implicit none
TYPE(d_complex) :: z,c
INTEGER :: iter
42 Chapter 6

REAL(KIND = DP) :: temp

z = c

DO iter = 1, MAXITER
temp = (z%r*z%r) - (z%i*z%i) + c%r
z%i = z%r*z%i*2 + c%i
z%r = temp

IF ((z%r*z%r + z%i*z%i) > 4.0) THEN


!$OMP CRITICAL
numoutside = numoutside + 1
!$OMP END CRITICAL
EXIT
ENDIF
ENDDO

END SUBROUTINE

END MODULE mandel_par_module

PROGRAM mandel_par

USE OMP_LIB
USE mandel_par_module
IMPLICIT NONE

INTEGER :: i, j
REAL(KIND = DP) :: area, error
REAL(KIND = DP) :: eps = 1.0e-5

TYPE(d_complex) :: c

! CALL OMP_SET_NUM_THREADS(4)

!C Loop over grid of complex points in the domain of the


OpenMP Data Environment 43

!C Mandelbrot set, testing each point to see whether it is


!C inside or outside the set.

!$OMP PARALLEL DO DEFAULT(shared) FIRSTPRIVATE(eps) PRIVATE(c,j)

DO i = 1, NPOINTS
DO j = 1, NPOINTS
c%r = -2.0 + 2.5 * DBLE(i-1) / DBLE(NPOINTS) + eps
c%i = 1.125 * DBLE(j-1) / DBLE(NPOINTS) + eps
CALL testpoint(c)
ENDDO
ENDDO
!$OMP END PARALLEL DO

!C Calculate area of set and error estimate and output the results
write(*,*)"numoutside=", numoutside

area = 2.0*2.5*1.125 * DBLE(NPOINTS*NPOINTS - numoutside) &


& / DBLE(NPOINTS*NPOINTS)
error = area / DBLE(NPOINTS)

WRITE(*,100) area, error


100 FORMAT("Area of Mandlebrot set = ", f12.8, f12.8)
WRITE(*,*)"Correct answer should be around 1.510659"

END PROGRAM mandel_par


44 Chapter 6

Figure 6.12: Worksharing-loops and data environment clauses


help us write particularly simple parallel Pi programs

Pi Program with combined parallel worksharing-loop and reduction –


Each thread accumulates its local sum that is later combined into the global sum with the
reduction operation. Variable x is declared as private with a data environment clause.
1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , i d
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER : : NTHREADS = 4
8 REAL∗8 x , pi , sum , s t e p
9 REAL∗8 s t a r t t i m e , r u n t i m e
10
11 sum = 0 . 0
12 s t e p = 1 . 0 / num steps
13
14 CALL OMP SET NUM THREADS(NTHREADS)
15 s t a r t t i m e = OMP GET WTIME( )
16
17 !$OMP PARALLEL DO PRIVATE( i , x ) REDUCTION(+:sum )
18 DO i = 1 , num steps
19 x = ( i − 0.5) ∗ step
20 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
21 ENDDO
22 !$OMP END PARALLEL DO
23
24 p i = s t e p ∗ sum
25 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
26 WRITE( ∗ , 1 0 0 ) pi , r u n t i m e
27 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s ’ )
28
29 END PROGRAM MAIN
OpenMP Data Environment 45

Figure 6.13: Static arrays in data environment clauses

Static arrays in data environment clauses – The compiler creates a private array
with 1000 values of type int on the stack for each thread.

int varray(1000)
call initv(1000, varray) ! function to initialize the array

!$omp parallel private(varray)


! body of parallel region not shown
!$omp end parallel

Figure 6.14: Data environment clauses with dynamic arrays and


pointers

Dynamic arrays and pointers in data environment clauses – The compiler


gives each thread its own pointer pointing to the same block of memory.

real, pointer :: vptr(:)

allocate (vptr(1000))
call initv(1000, vptr) ! function to initialize the array

!$omp parallel firstprivate(vptr)


! body of parallel region not shown
!$omp end parallel
46 Chapter 6

Page 119: Array sections

You define an array section in terms of the lower-bound, the length of the section,
and the stride.

(lower-bound:upper-bound:stride)
(lower-bound:upper-bound) ! stride implied as one
(:upper-bound:stride ) ! lower-bound implied as one

Using an array section in the previous example, we can have each thread allocate and
copy an original variable that is an array into a parallel region with the directive:

!$omp parallel firstprivate(vptr(1:1000:1))

Array sections also work for the other clauses that create private copies of variables
such as private and reduction.
7 Tasks in OpenMP

Figure 7.1: Traversing a linked list

Serial linked list program– Traverse the linked list and do a block of work (processwork(p))
for each node in the list where we assume processwork(p) for any node is independent of
the other nodes.
1 p => head
2 do
3 c a l l processwork (p)
4 p => p%next
5 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
6 end do
48 Chapter 7

Figure 7.2: Traversing a linked list in parallel using worksharing-


loop constructs

Parallel linked list program without using tasks – Three passes through the
data to count the length of the list, collect values into an array, and process the array in
parallel. This is an example of the inspector-executor design pattern.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 7 . 2 l i n k e d L i s t N o T a s k s . f 9 0
3
4 module l i s t m o d
5 i n t e g e r , parameter : : NMAX = 10
6 type : : node
7 i n t e g e r : : data
8 integer : : procResult
9 type ( node ) , p o i n t e r : : next
10 end type node
11 end module l i s t m o d
12
13 program main
14 use l i s t m o d
15 i m p l i c i t none
16
17 type ( node ) , p o i n t e r : : p => n u l l ( )
18 type ( node ) , p o i n t e r : : temp => n u l l ( )
19 type ( node ) , p o i n t e r : : head => n u l l ( )
20 type ( node ) , dimension ( : ) , a l l o c a t a b l e , t a r g e t : : parr
21
22 interface
23 ! i n i t i a l i z e t h e l i s t ( not shown )
24 subroutine i n i t L i s t (p)
25 use l i s t m o d
26 i m p l i c i t none
27 type ( node ) , p o i n t e r : : p
28 end s u b r o u t i n e i n i t L i s t
29
30 ! a l o n g computation ( not shown )
31 i n t e g e r f u n c t i o n work ( data )
32 i m p l i c i t none
Tasks in OpenMP 49

33 i n t e g e r : : data
34 end f u n c t i o n work
35 end i n t e r f a c e
36
37 i n t e g e r : : i , count
38
39 c a l l i n i t L i s t (p)
40
41 ! s a v e head o f t h e l i s t
42 head => p
43
44 count = 0
45 do
46 p = p%next
47 count = count + 1
48 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
49 end do
50
51 a l l o c a t e ( p a r r ( count ) )
52
53 p => head
54 do i = 1 , count
55 p a r r ( i )%data = p%data
56 p => p%next
57 end do
58
59 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c , 1 )
60 do i = 1 , count
61 c a l l procWork ( p a r r ( i ) )
62 end do
63 ! $omp end p a r a l l e l do
64
65 contains
66
67 s u b r o u t i n e procWork ( a node )
68 use l i s t m o d
69 i m p l i c i t none
70 type ( node ) : : a node
71 integer : : n
50 Chapter 7

72 i n t e g e r , e x t e r n a l : : work
73 n = a node%data
74 a node%p r o c R e s u l t = work ( n )
75 end s u b r o u t i n e procWork
76
77 end program main
Tasks in OpenMP 51

Figure 7.4: A simple program with OpenMP tasks

Schrödinger’s Program – Two threads each generates two tasks. They wait a random
bit of time and then set a shared variable to true or false. Whichever task executes last
determines the final value of the variable and whether the cat is “dead” or “alive”.

1 ! S c h r o d i n g e r s r a c y program . . . i s t h e c a t dead o r a l i v e ?
2 !
3 ! You can u s e a t o m i c s and make t h e program r a c e f r e e , o r comment out
4 ! t h e a t o m i c s and run with a r a c e c o n d i t i o n . I t works i n both c a s e s
5 !
6 ! H i s t o r y : Written by Tim Mattson , Feb 2019
7 ! Converted t o F o r t r a n by Helen He , Nov 2019
8
9
10 program main
11 use omp lib
12 i m p l i c i t none
13
14 ! random number g e n e r a t o r p a r a m e t e r s
15 ! ( from n u m e r i c a l r e c i p i e s )
16 i n t e g e r , parameter : : MULT = 4096
17 i n t e g e r , parameter : : ADD = 150889
18 i n t e g e r , parameter : : MOD val = 714025
19
20 r e a l ∗8 : : w a i t v a l , v a l
21 i n t e g e r ∗8 : : rand , i , dcount , l c o u n t , c o i n
22 l o g i c a l : : d e a d o r a l i v e , HorT
23 i n t e g e r , parameter : : NTRIALS = 10
24
25 dcount = 0
26 lcount = 0
27
28 do i = 1 , NTRIALS
29 ! $omp p a r a l l e l num threads ( 2 ) s h a r e d ( d e a d o r a l i v e ) p r i v a t e ( v a l )
30 i f ( omp get thread num ( ) == 0 ) then
31 p r i n t ∗ , ” with ” , omp get num threads ( ) , ” t h r e a d s . ”
32 w r i t e ( ∗ , ’ ( a ) ’ , advance =’no ’ ) ” S c h r o d i n g e r s program s a y s t h e c a t i s ”
33 endif
52 Chapter 7

34
35 ! $omp s i n g l e
36 ! ” f l i p a c o i n ” t o c h o o s e which t a s k i s f o r t h e dead
37 ! c a t and which f o r t h e l i v i n g c a t .
38 c a l l seedIt ( coin )
39 HorT = f l i p ( c o i n )
40
41 ! w i t h o u t t h e atomics , t h e s e t a s k s a r e p a r t i c i p a t i n g i n a data r a c e
42 ! $omp t a s k
43 val = waitAbit ( )
44 ! a s t o r e o f a s i n g l e machine word ( b o o l )
45 ! $omp atomic w r i t e
46 d e a d o r a l i v e = HorT
47 ! $omp end atomic
48 ! $omp end t a s k
49 ! $omp t a s k
50 val = waitAbit ( )
51 ! a s t o r e o f a s i n g l e machine word ( b o o l )
52 ! $omp atomic w r i t e
53 d e a d o r a l i v e = . not . HorT
54 ! $omp end atomic
55 ! $omp end t a s k
56 ! $omp end s i n g l e
57 ! $omp end p a r a l l e l
58
59 i f ( d e a d o r a l i v e ) then
60 print ∗ , ” alive .”
61 lcount = lcount + 1
62 else
63 p r i n t ∗ , ” dead . ”
64 dcount = dcount + 1
65 endif
66 end do ! end l o o p o v e r t r i a l s ( f o r t e s t i n g o n l y )
67
68 p r i n t ∗ , ” dead ” , dcount , ” t i m e s ” , ” and a l i v e ” , l c o u n t , ” t i m e s . ”
69
70 contains
71
72 ! s e e d t h e pseudo random s e q u e n c e with time o f day
Tasks in OpenMP 53

73 subroutine seedIt ( val )


74 i m p l i c i t none
75 i n t e g e r ∗8 : : v a l
76 v a l = i n t ( omp get wtime ( ) )
77 end s u b r o u t i n e s e e d I t
78
79 ! L i n e a r c o n g r u e n t i a l random number g e n e r a t o r
80 i n t e g e r ∗8 f u n c t i o n nextRan ( l a s t ) r e s u l t ( next )
81 i m p l i c i t none
82 i n t e g e r ∗8 , i n t e n t ( in ) : : l a s t
83 next = mod(MULT∗ l a s t+ADD, MOD val )
84 end f u n c t i o n nextRan
85
86
87 ! f l i p a c o i n . . . heads ( t r u e ) o r t a i l s ( f a l s e )
88 l o g i c a l function f l i p ( coin )
89 i m p l i c i t none
90 i n t e g e r ∗8 : : c o i n
91 c o i n = nextRan ( c o i n )
92 i f ( c o i n > MOD val / 2 ) then
93 f l i p = . true .
94 else
95 flip = . false .
96 endif
97 end f u n c t i o n f l i p
98
99 ! w a i t a s h o r t random amount o f time
100 r e a l ∗8 f u n c t i o n w a i t A b i t ( ) r e s u l t ( v a l )
101 i m p l i c i t none
102 i n t e g e r ∗8 : : i , count , rand
103 val = 0.0
104 c a l l s e e d I t ( rand )
105 count = nextRan ( rand )
106
107 ! do some math t o make us w a i t a w h i l e
108 do i = 1 , count
109 rand = nextRan ( rand )
110 v a l = v a l + d b l e ( rand ) / d b l e (MULT)
111 end do
54 Chapter 7

112 end f u n c t i o n w a i t A b i t
113
114 end program main
Tasks in OpenMP 55

Figure 7.5: Single worksharing construct

An OpenMP single construct example – All threads execute do many things and
do many other things, but only one thread executes exchange boundaries.

!$omp parallel
call do_many_things()
!$omp single
call exchange_boundaries()
!$omp end single
call do_many_other_things()
!$omp end parallel

Figure 7.6: Creating explicit tasks

A basic task example – Inside a parallel region, 3 tasks are created by a single thread.
1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 ! $omp t a s k
4 call fred ()
5 ! $omp end t a s k
6 ! $omp t a s k
7 c a l l daisy ()
8 ! $omp end t a s k
9 ! $omp t a s k
10 call billy ()
11 ! $omp end t a s k
12 ! $omp end s i n g l e
13 ! $omp end p a r a l l e l
56 Chapter 7

Figure 7.7: waiting for tasks to finish with taskwait

A taskwait example – Tasks fred and daisy must complete before task billy starts.
1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 ! $omp t a s k
4 call fred ()
5 ! $omp end t a s k
6 ! $omp t a s k
7 c a l l daisy ()
8 ! $omp end t a s k
9 ! $omp t a s k w a i t
10 ! $omp t a s k
11 call billy ()
12 ! $omp end t a s k
13 ! $omp end s i n g l e
14 ! $omp end p a r a l l e l

Figure 7.8: How data moves between tasks

Tasks data environment example – A is shared, B is firstprivate, and C is private.


1 integer : : C
2 ! $omp p a r a l l e l s h a r e d (A) p r i v a t e (B)
3 ...
4 ! $omp t a s k p r i v a t e (C)
5 c a l l compute (A, B, C)
6 ! $omp end t a s k
7 ! $omp end p a r a l l e l
Tasks in OpenMP 57

Figure 7.9: Tasks make parallel linked list traversal really simple

Linked list with tasks – The implementation with OpenMP tasks is much more
elegant than the three-pass solution in Figure 7.2.

1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 p => head
4 do
5 ! $omp t a s k f i r s t p r i v a t e ( p )
6 c a l l processwork (p)
7 ! $omp end t a s k
8 p => p%next
9 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
10 end do
11 ! $omp end s i n g l e
12 ! $omp end p a r a l l e l
58 Chapter 7

Figure 7.10: A really inefficient way to compute Fibonacci num-


bers

Fibonacci example – This is the serial recursive implementation.


1 recursive integer function fib (n) result ( res )
2 i m p l i c i t none
3 integer , intent ( in ) : : n
4 integer : : x , y
5 i f ( n < 2 ) then
6 res = n
7 else
8 x = f i b ( n−1)
9 y = f i b ( n−2)
10 res = x + y
11 endif
12 end f u n c t i o n f i b
13
14 program main
15 i m p l i c i t none
16
17 interface
18 function fib (n)
19 integer : : fib
20 integer , intent ( in ) : : n
21 end f u n c t i o n f i b
22 end i n t e r f a c e
23
24 i n t e g e r : : NW, r e s u l t
25 NW = 30
26 r e s u l t = f i b (NW)
27 p r i n t ∗ , ” f i b ( ” ,NW, ” ) = ” , r e s u l t
28 end program main
Tasks in OpenMP 59

Figure 7.11: Parallel Fibonacci program using the divide and con-
quer pattern

Parallel implementation of the Fibonacci program using OpenMP tasks


– Two tasks create child tasks recursively. taskwait ensures the direct child tasks com-
plete before the merge. The base case to exit the recursion is defined for when n < 2.
code:fibonacciTasks
1 recursive integer function fib (n) result ( res )
2 use omp lib
3 i m p l i c i t none
4
5 integer , intent ( in ) : : n
6 integer : : x , y
7 i f ( n < 2 ) then
8 res = n
9 else
10 ! $omp t a s k s h a r e d ( x )
11 x = f i b ( n−1)
12 ! $omp end t a s k
13 ! $omp t a s k s h a r e d ( y )
14 y = f i b ( n−2)
15 ! $omp end t a s k
16 ! $omp t a s k w a i t
17 res = x + y
18 endif
19 end f u n c t i o n f i b
20
21 program main
22 use omp lib
23 i m p l i c i t none
24
25 interface
26 function fib (n)
27 integer : : fib
28 integer , intent ( in ) : : n
29 end f u n c t i o n f i b
30 end i n t e r f a c e
31
60 Chapter 7

32 i n t e g e r : : NW, r e s u l t
33 NW = 30
34 ! $omp p a r a l l e l
35 ! $omp s i n g l e
36 r e s u l t = f i b (NW)
37 ! $omp end s i n g l e
38 ! $omp end p a r a l l e l
39 p r i n t ∗ , ” f i b ( ” ,NW, ” ) = ” , r e s u l t
40 end program main
Tasks in OpenMP 61

Figure 7.13: The Pi program (from Figure 4.5)

Serial Pi program to numerically estimate a definite integral using the


midpoint rule – The loop iterations are independent other than the summation into
sum.

1 PROGRAM MAIN
2
3 USE OMP LIB
4 IMPLICIT NONE
5
6 INTEGER : : i
7 INTEGER, PARAMETER : : num steps = 100000000
8 REAL∗8 : : x , pi , sum , s t e p
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10
11 sum = 0 . 0
12
13 s t e p = 1 . 0 / num steps
14 s t a r t t i m e = OMP GET WTIME( )
15
16 DO i = 1 , num steps
17 x = ( i − 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20
21 p i = s t e p ∗ sum
22 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
23
24 WRITE( ∗ , 1 0 0 ) pi , num steps , r u n t i m e
25 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s , ’ , f 8 . 3 , ’ s e c s ’ )
26
27 END PROGRAM MAIN
62 Chapter 7

Figure 7.14: Serial recursive Pi program

Serial Pi program using the divide and conquer pattern –Just to make the
code simpler, we pick a number of steps that is a power of 2. This way we can split the
number of steps in half repeatedly and always create intervals that are divisible by 2.

1 module data mod


2 i n t e g e r , parameter : : num steps = 1024∗1024∗1024
3 i n t e g e r , parameter : : MIN BLK = 1024∗256
4
5 contains
6 r e a l ∗8 r e c u r s i v e f u n c t i o n pi comp ( Nstart , N f i n i s h , s t e p ) r e s u l t ( sum )
7 i m p l i c i t none
8
9 i n t e g e r , i n t e n t ( i n ) : : Nstart , N f i n i s h
10 r e a l ∗8 : : x , sum1 , sum2 , s t e p
11 integer : : i , iblk
12
13 sum = 0 . 0
14
15 i f ( N f i n i s h − N s t a r t < MIN BLK) then
16 do i = Nstart , N f i n i s h − 1
17 x = ( i + 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 enddo
20 else
21 i b l k = Nfinish − Nstart
22 sum1 = pi comp ( Nstart , N f i n i s h − i b l k / 2 , s t e p )
23 sum2 = pi comp ( N f i n i s h − i b l k / 2 , N f i n i s h , s t e p )
24 sum = sum1 + sum2
25 endif
26
27 end f u n c t i o n
28
29 end module data mod
30
31 program main
32 u s e data mod
33 i m p l i c i t none
Tasks in OpenMP 63

34
35 integer : : i
36 r e a l ∗8 : : pi , sum , s t e p
37
38 s t e p = 1 . 0 / num steps
39 sum = pi comp ( 0 , num steps , s t e p )
40 p i = s t e p ∗ sum
41
42 WRITE( ∗ , 1 0 0 ) pi , num steps
43 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i14 , ’ steps ’ )
44
45 end program main
64 Chapter 7

Figure 7.15: Parallel recursive Pi program

Parallel Pi program using tasks – It is accomplished with the divide and conquer
pattern by splitting the problem into two subtasks to calculate sum1 and sum2, recursively
solving each task, and then combining the results.

1 module data mod


2 i n t e g e r , parameter : : num steps = 1024∗1024∗1024
3 i n t e g e r , parameter : : MIN BLK = 1024∗256
4
5 contains
6 r e a l ∗8 r e c u r s i v e f u n c t i o n pi comp ( Nstart , N f i n i s h , s t e p ) r e s u l t ( sum )
7 use omp lib
8 i m p l i c i t none
9
10 i n t e g e r , i n t e n t ( i n ) : : Nstart , N f i n i s h
11 r e a l ∗8 : : x , sum1 , sum2 , s t e p
12 integer : : i , iblk
13
14 sum = 0 . 0
15
16 i f ( N f i n i s h − N s t a r t < MIN BLK) then
17 do i = Nstart , N f i n i s h − 1
18 x = ( i + 0.5) ∗ step
19 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
20 enddo
21 else
22 i b l k = Nfinish − Nstart
23 ! $omp t a s k s h a r e d ( sum1 )
24 sum1 = pi comp ( Nstart , N f i n i s h − i b l k / 2 , s t e p )
25 ! $omp end t a s k
26 ! $omp t a s k s h a r e d ( sum2 )
27 sum2 = pi comp ( N f i n i s h − i b l k / 2 , N f i n i s h , s t e p )
28 ! $omp end t a s k
29 ! $omp t a s k w a i t
30 sum = sum1 + sum2
31 endif
32
33 end f u n c t i o n
Tasks in OpenMP 65

34
35 end module data mod
36
37 program main
38 use omp lib
39 u s e data mod
40 i m p l i c i t none
41
42 integer : : i
43 r e a l ∗8 : : pi , sum , s t e p
44
45 s t e p = 1 . 0 / num steps
46
47 ! $omp p a r a l l e l
48 ! $omp s i n g l e
49 sum = pi comp ( 0 , num steps , s t e p ) ;
50 ! $omp end s i n g l e
51 ! $omp end p a r a l l e l
52
53 p i = s t e p ∗ sum
54
55 WRITE( ∗ , 1 0 0 ) pi , num steps
56 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i14 , ’ steps ’ )
57
58 end program main
8 OpenMP Memory Model

Figure 8.2: Relaxed memory models and race conditions

A program with a race condition – A relaxed memory model permits the assertion
to fail; i.e., the thread with id == 1 can observe values in memory such that r ==1 while
x is still 0.
1 program main
2 use omp lib
3 i m p l i c i t none
4
5 integer : : x , y , r
6 i n t e g e r : : id , n t h r d s
7 x = 0
8 y = 0
9 r = 0
10 c a l l o m p s e t n u m t h r e a d s ( 2 ) ! r e q u e s t two t h r e a d s
11 ! $omp p a r a l l e l p r i v a t e ( id , n t h r d s )
12 i d = omp get thread num ( )
13 ! $omp s i n g l e
14 n t h r d s = omp get num threads ( )
15 ! v e r i f y t h a t we have a t l e a s t two t h r e a d s
16 i f ( nthrds < 2) stop 1
17 ! $omp end s i n g l e
18
19 i f ( i d == 0 ) then
20 x = 1
21 r = x
22 e l s e i f ( i d == 1 ) then
23 i f ( r == 1 ) then
24 y = x;
25 i f ( y /= 1 ) then
26 s t o p ” f a i l s y==1” ! Assertion w i l l occasionally f a i l ;
27 ! i . e . , r == 1 w h i l e x == 0
28 endif
29 endif
30 endif
31 ! $omp end p a r a l l e l
32 end program main
68 Chapter 8

Figure 8.3: Updates may not be fully shared

An erroneous program where updates may not be fully shared – This


program carries out an iterative computation over the elements of an array A. Assume
the function doit() carries out a computation that is embarrassingly parallel with a fixed
subset of the array A selected by the thread ID. This program could fall into an infinite
loop if the value of conv does not issue the break from the while loop and the shared
variable iter is not propagated across all the threads allowing it to trigger the loop exit
condition (iter < MAX).

1
2 ! sample c o m p i l e command :
3 ! g f o r t r a n −fopenmp −c F i g 8 . 3 regPromote . f 9 0
4 ! to generate ∗ . o o b j e c t f i l e
5
6 program main
7 use omp lib
8 i m p l i c i t none
9
10 interface
11 f u n c t i o n d o i t (A, N, i d )
12 i n t e g e r : : N, i d
13 r e a l ∗8 : : A(N)
14 r e a l ∗8 : : d o i t
15 end f u n c t i o n
16 end i n t e r f a c e
17
18 i n t e g e r , parameter : : MAX = 10000
19 i n t e g e r , parameter : : NMAX = 1000
20 r e a l , parameter : : TOL = 0 . 0 0 0 1
21
22 integer : : iter , N
23 r e a l ∗8 : : A(NMAX)
24 r e a l ∗8 : : conv
25 integer : : id , nthrd
26
27 iter = 0
28 N = 1000
29 A = 0.0
OpenMP Memory Model 69

30 conv = 0 . 0
31
32 ! $omp p a r a l l e l s h a r e d (A, N, i t e r ) f i r s t p r i v a t e ( conv ) p r i v a t e ( id , nthrd )
33 i d = omp get thread num ( )
34 nthrd = omp get num threads ( )
35
36 do w h i l e ( i t e r < MAX)
37 conv = d o i t (A, N, i d )
38 i f ( conv < TOL) e x i t
39 i f ( i d == 0 ) i t e r = i t e r + 1
40 end do
41
42 ! $omp end p a r a l l e l
43
44 end program main
70 Chapter 8

Figure 8.4: Races due to nowait

Reductions need a barrier – This program carries out a computation inside a parallel
loop and accumulates the result with a reduction. The function called after the loop uses
the SPMD pattern and does not use any of the values computed in the loop, hence the
programmer used a nowait clause. The last function uses the reduction variable which may
not be available for all threads since the reduction is only guaranteed to complete at the
next barrier following the loop. As a result, this is an incorrect program.

1 i n t e g e r : : id , nthrds , i
2 ! $omp p a r a l l e l s h a r e d (A, B, sum ) p r i v a t e ( id , n t h r d s )
3 i d = omp get thread num ( )
4 n t h r d s = omp get num threads ( )
5
6 ! $omp do r e d u c t i o n (+:sum )
7 do i = 1 , N
8 sum = sum + b i g j o b (A,N)
9 end do
10 ! $omp end do nowait
11
12 b i g g e r j o b (B, i d ) ! a f u n c t i o n t h a t d o e s not u s e A
13 a n o t h e r j o b ( sum , i d ) ! sum may not be a v a i l a b l e
14 ! $omp end p a r a l l e l
OpenMP Memory Model 71

Figure 8.5: Synchronization in producer-consumer programs

Pairwise synchronization – A producer-consumer pattern with one thread producing


a result that another thread will consume. This program uses a spin-lock to make the
consumer wait for the producer to finish. Note: This program is not properly synchronized
and as written will not work.
1 integer : : flag ! f l a g t o communicate when consumer can s t a r t
2 i n t e g e r : : id , n t h r d s
3 flag = 0
4
5 ! $omp p a r a l l e l s h a r e d (A, f l a g ) p r i v a t e ( id , n t h r d s )
6 i d = omp get thread num ( )
7 n t h r d s = omp get num threads ( )
8
9 ! we need two o r more t h r e a d s f o r t h i s program
10 i f ( ( i d == 0 ) . and . ( n t h r d s < 2 ) ) s t o p 1
11
12 i f ( i d == 0 ) then
13 c a l l produce (A)
14 flag = 1
15 endif
16 i f ( i d == 1 ) then
17 do w h i l e ( f l a g == 0 )
18 ! s p i n through t h e l o o p w a i t i n g f o r f l a g t o change
19 enddo
20 c a l l consume (A)
21 endif
22 ! $omp end p a r a l l e l
9 Common Core Recap

This chapter provides a summary of the items from OpenMP that make up the
Common Core. Hence, there are not any example programs in this chapter.
10 Multithreading beyond the Common Core

Page 176: Additional clauses for the Parallel construct

integer :: nthreads
nthreads = omp_get_num_threads()

!$omp parallel if (nthreads < maxthreads/4) num_threads(4)


...

!$omp end parallel


76 Chapter 10

Figure 10.1: Clauses on parallel constructs

Examples of clauses on the Parallel construct – The matrix A is transformed


by a transformation which is assumed to be a unitary transform (i.e., a trace preserving
transform). Notice how continuation of a pragma onto an additional line is indicated by an
ampersand (&). We do not show code for initMats() and transform() as their function
bodies are not relevant for this example.

1 ! sample command t o c o m p i l e t o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 1 parClaw . f 9 0
3
4 program main
5 use omp lib
6
7 interface
8 ! i n i t i a l i z a t i o n function
9 s u b r o u t i n e i n i t M a t s (N, A, T)
10 integer : : N
11 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
12 end s u b r o u t i n e
13 ! transform function
14 s u b r o u t i n e t r a n s f o r m (N, id , Nthrds , A, T)
15 i n t e g e r : : N, id , Nthrds
16 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
17 end s u b r o u t i n e
18 end i n t e r f a c e
19
20 real : : trace = 0
21 i n t e g e r : : i , id , N, Nthrds
22 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
23
24 i n t e g e r : : narg ! number o f Arg
25 c h a r a c t e r ( l e n =10) : : name ! Arg name
26
27 narg = command argument count ( )
28 i f ( narg == 1 ) then
29 c a l l get command argument ( 1 , name )
30 r e a d ( name , ∗ ) N
31 else
Multithreading beyond the Common Core 77

32 N = 10
33 endif
34 p r i n t ∗ , ”N=”, N
35
36 ! a l l o c a t e s p a c e f o r two N x N m a t r i c e s and i n i t i a l i z e them
37 a l l o c a t e (T(N,N) )
38 a l l o c a t e (A(N,N) )
39 c a l l i n i t M a t s (N, A, T)
40
41 ! $omp p a r a l l e l i f (N>100) num threads ( 4 ) d e f a u l t ( none ) &
42 ! $omp& s h a r e d (A, T,N) p r i v a t e ( i , id , Nthrds ) r e d u c t i o n (+: t r a c e )
43 i d = omp get thread num ( )
44 Nthrds = omp get num threads ( )
45 c a l l t r a n s f o r m (N, id , Nthrds , A, T)
46
47 ! compute t r a c e o f A matrix
48 ! i . e . , t h e sum o f d i a g o n a l e l e m e n t s
49 ! $omp do
50 do i = 1 , N
51 t r a c e = t r a c e + A( i , i )
52 enddo
53 ! $omp end do
54 ! $omp end p a r a l l e l
55 p r i n t ∗ , ” t r a n s f o r m c o m p l e t e with t r a c e = ” , t r a c e
56 end program main

Page 180: The lastprivate clause

integer :: i

!$omp do lastprivate(ierr)
do i = 1, N
ierr = work(i)
enddo
!$omp end do
78 Chapter 10

Figure 10.3: Manipulating schedules for worksharing-loops at run-


time

Use of runtime schedules – Function computes forces in a simple molecular dynamics


program. Prints information about the runtime schedule when enabled by the DEBUG
variable. Notice how we do line continuation for an OpenMP compiler directive in our
parallel construct.

1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 3 runtimeEx . f 9 0
3
4 s u b r o u t i n e f o r c e s ( npart , x , f , s i d e , r c o f f )
5 use omp lib
6 i m p l i c i t none
7
8 interface
9 ! e x t e r n a l f u n c t i o n f o r p o t e n t i a l e n e r g y term
10 f u n c t i o n pot ( d i s t ) r e s u l t ( r e s )
11 r e a l ∗8 : : d i s t
12 r e a l ∗8 : : r e s
13 end f u n c t i o n pot
14 end i n t e r f a c e
15
16 i n t e g e r ( kind=omp sched kind ) : : kind
17 integer : : chunk size
18 l o g i c a l : : DEBUG
19 i n t e g e r : : npart , i , j
20 r e a l ∗8 : : x ( 0 : n p a r t ∗3+2) , f ( 0 : n p a r t ∗3+2)
21 r e a l ∗8 : : s i d e , r c o f f
22 r e a l ∗8 : : f x i , f y i , f z i
23 r e a l ∗8 : : xx , yy , zz , rd , fcomp
24
25 c h a r a c t e r ( l e n = : ) , a l l o c a t a b l e : : schdKind ( : )
26 a l l o c a t e ( c h a r a c t e r ( 8 ) : : schdKind ( 0 : 4 ) )
27 ! map s c h e d u l e kind enum v a l u e s t o s t r i n g s f o r p r i n t i n g
28
29 schdKind ( 0 ) = ”ERR”
30 schdKind ( 1 ) = ” s t a t i c ”
31 schdKind ( 2 ) = ” dynamic ”
Multithreading beyond the Common Core 79

32 schdKind ( 3 ) = ” g u i d e d ”
33 schdKind ( 4 ) = ” auto ”
34 DEBUG = . t r u e .
35
36 ! $omp p a r a l l e l do s c h e d u l e ( runtime ) &
37 ! $omp p r i v a t e ( f x i , f y i , f z i , j , xx , yy , zz , rd , fcomp )
38
39 do i = 0 , n p a r t ∗3 −1 , 3
40 ! zero f o r c e components on p a r t i c l e i
41 fxi = 0.0
42 fyi = 0.0
43 fzi = 0.0
44
45 ! l o o p o v e r a l l p a r t i c l e s with i n d e x > i
46 do j = i +3, n p a r t ∗3 −1 , 3
47
48 ! compute d i s t a n c e between i and j with wraparound
49 xx = x ( i ) − x ( j )
50 yy = x ( i +1) − x ( j +1)
51 z z = x ( i +2) − x ( j +2)
52
53 if ( xx < ( −0.5∗ s i d e ) ) xx = xx + s i d e
54 if ( xx > ( 0 . 5 ∗ s i d e ) ) xx = xx − s i d e
55 if ( yy < ( −0.5∗ s i d e ) ) yy = yy + s i d e
56 if ( yy > ( 0 . 5 ∗ s i d e ) ) yy = yy − s i d e
57 if ( z z < ( −0.5∗ s i d e ) ) z z = z z + s i d e
58 if ( zz > ( 0 . 5 ∗ s i d e ) ) zz = zz − s i d e
59 rd = xx ∗ xx + yy ∗ yy + z z ∗ z z
60
61 ! i f d i s t a n c e i s i n s i d e c u t o f f r a d i u s , compute f o r c e s
62 i f ( rd <= r c o f f ∗ r c o f f ) then
63 fcomp = pot ( rd )
64 f x i = f x i + xx∗ fcomp
65 f y i = f y i + yy∗ fcomp
66 f z i = f z i + z z ∗ fcomp
67 !$OMP c r i t i c a l
68 f ( j ) = f ( j ) − xx∗ fcomp
69 f ( j +1) = f ( j +1) − yy∗ fcomp
70 f ( j +2) = f ( j +2) − z z ∗ fcomp
80 Chapter 10

71 !$OMP end c r i t i c a l
72 endif
73 enddo
74 ! update f o r c e s on p a r t i c l e i
75 f ( i ) = f ( i ) + fxi
76 f ( i +1) = f ( i +1) + f y i
77 f ( i +2) = f ( i +2) + f z i
78 enddo
79 ! $omp end p a r a l l e l do
80
81 i f (DEBUG) then
82 c a l l o m p g e t s c h e d u l e ( kind , c h u n k s i z e )
83 p r i n t ∗ , ” s c h e d u l e ” , schdKind ( kind ) , ” c h u n k s i z e =”, c h u n k s i z e
84 endif
85 end s u b r o u t i n e f o r c e s
Multithreading beyond the Common Core 81

Figure 10.4: Combining loops to generate one large worksharing-


loop

The collapse clause on a worksharing-loop construct – The Apply function


applies a function input as a function pointer to each element of an N by M array, A. Note
that the pointer expression (A+i*M+j) points to the (i,j) element of the array A.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 4 l o o p C o l l a p s e . f 9 0
3
4 s u b r o u t i n e Apply (N, M, A, MFUNC)
5 use omp lib
6 i m p l i c i t none
7
8 i n t e g e r : : N, M
9 r e a l : : A(N,M)
10 integer : : i , j
11
12 interface
13 s u b r o u t i n e MFUNC ( i , j , x )
14 integer , intent ( in ) : : i , j
15 real : : x
16 end s u b r o u t i n e MFUNC
17 end i n t e r f a c e
18
19 ! apply a f u n c t i o n MFUNC t o each e l e m e n t o f an N by M a r r a y
20
21 ! $omp p a r a l l e l do num threads ( 4 ) c o l l a p s e ( 2 ) i f (N∗M>100)
22 do i = 1 , N
23 do j = 1 , M
24 c a l l MFUNC( i , j , A( i , j ) )
25 enddo
26 enddo
27 ! $omp end p a r a l l e l do
28 end s u b r o u t i n e
82 Chapter 10

Figure 10.6: Creating a DAG of tasks

Task Dependencies – This program implements the DAG (Directed Acyclic Graph)
shown in Figure 10.5. The functions represent the nodes and the edges of the DAG are
captured by the patterns of depend clauses.

1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 6 taskDep . f 9 0
3
4 program main
5 use omp lib
6 i m p l i c i t none
7 e x t e r n a l : : AWork , BWork , Cwork , Dwork , Ework
8 r e a l : : A, B, C, D, E
9
10 ! $omp p a r a l l e l s h a r e d (A, B, C, D, E)
11 ! $omp s i n g l e
12 ! $omp t a s k depend ( out :A)
13 c a l l Awork (A)
14 ! $omp end t a s k
15 ! $omp t a s k depend ( out : E)
16 c a l l Ework (E)
17 ! $omp end t a s k
18 ! $omp t a s k depend ( i n :A) depend ( out : B)
19 c a l l Bwork (B)
20 ! $omp end t a s k
21 ! $omp t a s k depend ( i n :A) depend ( out : C)
22 c a l l Cwork (C)
23 ! $omp end t a s k
24 ! $omp t a s k depend ( i n : B, C, E)
25 c a l l Dwork (E)
26 ! $omp end t a s k
27 ! $omp end s i n g l e
28 ! $omp end p a r a l l e l
29 end program main
Multithreading beyond the Common Core 83

Figure 10.7: Threadprivate variables to make variables private to


a thread but global inside a thread

Counting task executions with a threadprivate counter – This program


traverses a linked list in parallel with tasks doing a random amount of work for each
node in the list. A threadprivate variable is used to keep track of how many tasks were
executed by each thread. Note: we do not provide the functions used for the list nor the
list processing.

1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 7 t h r e a d p r i v . f 9 0
3
4 module data mod
5 i m p l i c i t none
6 i n t e g e r : : counter
7 ! $omp t h r e a d p r i v a t e ( c o u n t e r )
8 type node
9 i n t e g e r : : data
10 type ( node ) , p o i n t e r : : next
11 end type node
12 contains
13 subroutine i n i t l i s t (p)
14 type ( node ) , p o i n t e r : : p
15 ! i n i t l i s t here
16 end s u b r o u t i n e
17 subroutine processwork (p)
18 type ( node ) , p o i n t e r : : p
19 ! p r o c e s work h e r e
20 end s u b r o u t i n e
21 subroutine f r e e L i s t (p)
22 type ( node ) , p o i n t e r : : p
23 ! f r e e l i s t here
24 end s u b r o u t i n e
25 subroutine inc count ()
26 counter = counter + 1
27 end s u b r o u t i n e
28 end module data mod
29
30 program main
84 Chapter 10

31 use omp lib


32 u s e data mod
33 i m p l i c i t none
34
35 type ( node ) , p o i n t e r : : head
36 type ( node ) , p o i n t e r : : p
37 counter = 0
38 c a l l i n i t l i s t (p)
39 head => p
40
41 ! $omp p a r a l l e l p r i v a t e ( p ) c o p y i n ( c o u n t e r )
42 ! $omp s i n g l e
43 p => head
44 do
45 ! $omp t a s k f i r s t p r i v a t e ( p )
46 c a l l inc count ()
47 c a l l processwork (p)
48 ! $omp end t a s k
49 p => p%next
50 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
51 end do
52 ! $omp end s i n g l e
53 ! $omp end p a r a l l e l
54
55 c a l l f r e e L i s t (p)
56 end program main
Multithreading beyond the Common Core 85

Figure 10.8: Threadprivate variables and Fortran common blocks

A counter with threadprivate in Fortran – This code come from the OpenMP
4.5 Examples document (threadprivate.1.f). This Fortran function creates a global scope
variable in Fortran through common blocks. Hence, the counter is placed in a named
common block and that block is made threadprivate.
1 INTEGER FUNCTION INCREMENT COUNTER( )
2 COMMON/INC COMMON/COUNTER
3 !$OMP THREADPRIVATE( /INC COMMON/ )
4 COUNTER = COUNTER +1
5 INCREMENT COUNTER = COUNTER
6 RETURN
7 END FUNCTION INCREMENT COUNTER
11 Synchronization and the OpenMP Memory Model

Figure 11.1: Sequence points from C expressed in Fortran

Examples of sequence points – This code shows the most common sequence points
and the relations sequenced-before, indeterminately sequenced, and unsequenced.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 1 1 . 1 s e q P t s . f 9 0
3 program main
4 i m p l i c i t none
5 integer : : a , b, c , d, e
6 i n t e g e r : : i , N = 100
7 i n t e g e r , e x t e r n a l : : func1 , func2 , f u n c 3
8
9 ! Each s e m i c o l o n d e f i n e s a s e q u e n c e p o i n t . . .
10 ! a l l o r d e r e d by sequenced−b e f o r e r e l a t i o n s .
11
12 a = 1; b = 2; c = 0
13
14 ! 3 sequence points : the f u l l statement plus the 2 f u n c t i o n c a l l s .
15 ! The + o p e r a t o r i s not a s e q u e n c e p o i n t s o t h e f u n c t i o n c a l l s
16 ! a r e u n o r d e r e d and t h e r e f o r e , i n d e t e r m i n a t e l y s e q u e n c e d .
17
18 d = func2 ( a ) + func3 (b)
19
20 ! each e x p r e s s i o n i n t h e f o r s t a t e m e n t i s a s e q u e n c e p o i n t .
21 ! they o c c u r i n a sequenced−b e f o r e r e l a t i o n .
22
23 do i = 1 , N
24 ! f u n c t i o n i n v o c a t i o n s a r e each a s e q u e n c e p o i n t . Argument
25 ! e v a l u a t i o n s are unordered or i n d e t e r m i n a t e l y sequenced .
26 e = func1 ( func2 ( a ) , func3 (b ) )
27 enddo
28
29 ! There i s no F o r t r a n i n c r e m e n t s y n t a x such a s a++ i n C .
30 ! a = a + 1 e v a l u a t e s a + 1 f i r s t , then s t o r e t h e new v a l u e i n a .
31
32 a = a + 1
33 end program main
88 Chapter 11

Figure 11.3: Producer-consumer program with data races

Pairwise synchronization with incorrect synchronization – A producer con-


sumer pattern with one thread producing a result that another thread will consume. This
program uses a spin-lock to make the consumer wait for the producer to finish. Note:
While the logic in this program is correct, it contains a data race. Hence it is not a valid
OpenMP program and as written will not work.

1 integer : : flag ! f l a g t o s i g n a l when t h e consumer can s t a r t


2 i n t e g e r : : id , n t h r d s
3 flag = 0
4 c a l l omp set num threads (2)
5
6 ! $omp p a r a l l e l s h a r e d (A, f l a g ) p r i v a t e ( id , n t h r d s )
7 i d = omp get thread num ( )
8 n t h r d s = omp get num threads ( )
9
10 ! we need two o r more t h r e a d s f o r t h i s program
11 i f ( ( i d == 0 ) . and . ( n t h r d s < 2 ) ) s t o p 1
12
13 i f ( i d == 0 ) then
14 c a l l produce (A)
15 flag = 1
16 endif
17 i f ( i d == 1 ) then
18 do w h i l e ( f l a g == 0 )
19 ! s p i n through t h e l o o p w a i t i n g f o r f l a g t o change
20 enddo
21 c a l l consume (A)
22 endif
23 ! $omp end p a r a l l e l
Synchronization and the OpenMP Memory Model 89

Figure 11.4: Spin locks and flushes

Pairwise synchronization with flushes – A producer consumer program with a


spin lock and explicit flushes. This code is incorrect since the operations on the flag
define a data race.

1 integer : : flag ! f l a g t o s i g n a l when t h e consumer can s t a r t


2 i n t e g e r : : id , n t h r d s
3 flag = 0
4 c a l l omp set num threads (2)
5
6 ! $omp p a r a l l e l s h a r e d (A, f l a g ) p r i v a t e ( id , n t h r d s )
7 i d = omp get thread num ( )
8 n t h r d s = omp get num threads ( )
9
10 ! we need two o r more t h r e a d s f o r t h i s program
11 i f ( ( i d == 0 ) . and . ( n t h r d s < 2 ) ) s t o p 1
12
13 i f ( i d == 0 ) then
14 c a l l produce (A)
15 ! $omp f l u s h
16 flag = 1
17 ! $omp f l u s h ( f l a g )
18 endif
19 i f ( i d == 1 ) then
20 ! $omp f l u s h ( f l a g )
21 do w h i l e ( f l a g == 0 )
22 ! $omp f l u s h ( f l a g )
23 enddo
24 ! $omp f l u s h
25 c a l l consume (A)
26 endif
27 ! $omp end p a r a l l e l
90 Chapter 11

Figure 11.5: Spin-lock and flushes with atomics

Pairwise synchronization with flushes and atomics – A producer consumer


program with a spin lock and explicit flushes. With the use of atomics to update and then
read flag, this program is race free on any processor.

1 integer : : flag , flag temp ! f l a g t o s i g n a l when t h e consumer can s t a r t


2 i n t e g e r : : id , n t h r d s
3 flag = 0
4 c a l l omp set num threads (2)
5
6 ! $omp p a r a l l e l s h a r e d (A, B, f l a g ) p r i v a t e ( id , nthrds , f l a g t e m p )
7 i d = omp get thread num ( )
8 n t h r d s = omp get num threads ( )
9
10 ! we need two o r more t h r e a d s f o r t h i s program
11 i f ( ( i d == 0 ) . and . ( n t h r d s < 2 ) ) s t o p 1
12
13 i f ( i d == 0 ) then
14 c a l l produce (A)
15 ! $omp f l u s h
16 ! $omp atomic w r i t e
17 flag = 1
18 ! $omp end atomic
19 endif
20 i f ( i d == 1 ) then
21 do w h i l e ( f l a g t e m p /= 0 )
22 ! $omp atomic r e a d
23 flag temp = f l a g
24 ! $omp end atomic
25 enddo
26 ! $omp f l u s h
27 c a l l consume (A)
28 endif
29 ! $omp end p a r a l l e l
Synchronization and the OpenMP Memory Model 91

Figure 11.6: Synchronization mapped onto the elements of a data


structure

Locks to protect updates to a histogram – Generate a sequence of pseudorandom


numbers and assigns them to a histogram.

1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 1 . 6 h i s t . f 9 0
3
4 program main
5 use omp lib
6 i m p l i c i t none
7
8 i n t e g e r , parameter : : n u m t r i a l s = 1000000 ! number o f x v a l u e s
9 i n t e g e r , parameter : : num bins = 100 ! number o f b i n s i n h i s t o g r a m
10 r e a l ∗ 8 , s a v e : : xlow = 0 . 0 ; ! low end o f x r a n g e
11 r e a l ∗8 , save : : xhi = 1 0 . 0 ; ! h i g h end o f x r a n g e
12
13 r e a l ∗8 : : x
14 i n t e g e r ∗8 : : h i s t ( num bins ) ! the histogram
15 i n t e g e r ∗8 : : i v a l , i
16 r e a l ∗8 : : b i n w i d t h ! t h e width o f each b i n i n t h e hi , stogram
17 r e a l ∗8 : : sumh , sumhsq , ave , s t d d e v
18 ! h i s t l c k s i s an a r r a y o f l o c k s , one p e r bucket
19 i n t e g e r ( kind=o m p l o c k k i n d ) : : h i s t l c k s ( num bins )
20
21
22 interface
23 f u n c t i o n drandom ( ) r e s u l t ( v a l )
24 r e a l ∗8 : : v a l
25 end f u n c t i o n
26 subroutine seed ( low in , h i i n )
27 r e a l ∗8 , i n t e n t ( in ) : : low in , h i i n
28 end s u b r o u t i n e
29 end i n t e r f a c e
30
31 c a l l s e e d ( xlow , x h i ) ! s e e d random g e n e r a t o r o v e r r a n g e o f x
32 b i n w i d t h = ( x h i − xlow ) / d b l e ( num bins )
33
92 Chapter 11

34 ! i n i t i a l i z e t h e h i s t o g r a m and t h e a r r a y o f l o c k s
35 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
36 do i = 1 , num bins
37 hist ( i ) = 0
38 c a l l omp init lock ( hist lcks ( i ))
39 enddo
40 ! $omp end p a r a l l e l do
41 ! t e s t uniform pseudorandom s e q u e n c e by a s s i g n i n g v a l u e s
42 ! to the r i g h t histogram bin
43 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c ) p r i v a t e ( x , i v a l )
44 do i = 1 , n u m t r i a l s
45 x = drandom ( )
46 i v a l = i n t 8 ( ( x − xlow ) / b i n w i d t h )
47 ! protect histogram bins .
48 ! Low overhead due t o uncontended l o c k s
49 c a l l omp set lock ( h i s t l c k s ( i v a l ))
50 hist ( ival ) = hist ( ival ) + 1
51 c a l l omp unset lock ( h i s t l c k s ( i v a l ))
52 enddo
53 ! $omp end p a r a l l e l do
54
55 sumh = 0 . 0
56 sumhsq = 0 . 0
57 ! compute s t a t i s t i c s ( ave , s t d d e v ) and d e s t r o y l o c k s
58 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c ) r e d u c t i o n (+:sumh , sumhsq )
59 do i = 1 , num bins
60 sumh = sumh + h i s t ( i )
61 sumhsq = sumhsq + h i s t ( i ) ∗ h i s t ( i )
62 c a l l omp destroy lock ( h i s t l c k s ( i ))
63 enddo
64 ! $omp end p a r a l l e l do
65
66 ave = sumh / d b l e ( num bins )
67 s t d d e v = s q r t ( sumhsq / d b l e ( num bins ) − ave ∗ ave )
68 end program main
Synchronization and the OpenMP Memory Model 93

Figure 11.7: Atomics make our producer-consumer program much


simpler

Pairwise synchronization with sequentially consistent atomics – A producer


consumer program but now the form of atomic construct used implies all the flushes we
need.
1 integer : : flag , temp flag ! f l a g t o s i g n a l when t h e consumer can s t a r t
2 i n t e g e r : : id , nthrd
3 flag = 0
4
5 c a l l omp set num threads (2)
6
7 ! $omp p a r a l l e l s h a r e d (A, f l a g ) p r i v a t e ( id , nthrd , t e m p f l a g )
8 i d = omp get thread num ( )
9 n t h r d s = omp get num threads ( )
10
11 ! we need two o r more t h r e a d s f o r t h i s program
12 i f ( ( i d == 0 ) . and . ( n t h r d s < 2 ) ) s t o p −1
13
14 i f ( i d == 0 ) then
15 c a l l produce (A)
16 ! $omp atomic w r i t e s e q c s t
17 flag = 1
18 ! $omp end atomic
19 endif
20
21 i f ( i d == 1 ) then
22 do w h i l e ( 1 )
23 ! $omp atomic r e a d s e q c s t
24 flag temp = f l a g
25 i f ( f l a g t e m p /= 0 ) e x i t
26 enddo
27 c a l l consume (A)
28 endif
29 ! $omp end p a r a l l e l
12 Beyond OpenMP Common Core Hardware

Figure 12.6: First touch and reducing memory movement costs

STREAM initialization with and without first touch – Without first touch:
step 1.a + step 2. With first touch: step 1.b + step 2.

1 ! Step 1 . a I n i t i a l i z a t i o n by i n i t i a l t h r e a d o n l y
2 do j = 1 , V e c t o r S i z e
3 a( j ) = 1.0
4 b( j ) = 2.0
5 c ( j ) = 0.0
6 enddo
7
8 ! Step 1 . b I n i t i a l i z a t i o n by a l l t h r e a d s ( f i r s t touch )
9 c a l l omp set dynamic ( 0 )
10 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
11 do j = 1 , V e c t o r S i z e
12 a( j ) = 1.0
13 b( j ) = 2.0
14 c ( j ) = 0.0
15 enddo
16 ! $omp end p a r a l l e l do
17
18 ! Step 2 Compute
19 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
20 do j = 1 , V e c t o r S i z e
21 a( j ) = b( j ) + d ∗ c ( j )
22 enddo
23 ! $omp end p a r a l l e l do
96 Chapter 12

Figure 12.9: Nested parallelism

Nested OpenMP parallel constructs – There are 3 levels of nested OpenMP


parallel regions, 2 threads in each level. The num threads clause is used to specify the
number of threads desired for each parallel region.

1 subroutine report num threads ( l e v e l )


2 use omp lib
3 i m p l i c i t none
4
5 integer : : level
6 ! $omp s i n g l e
7 w r i t e ( ∗ , 1 0 0 ) l e v e l , omp get num threads ( )
8 100 format ( ” L e v e l ” , I3 , ” : number o f t h r e a d s i n t h e team i s ” , I 6 )
9 ! $omp end s i n g l e
10 end s u b r o u t i n e r e p o r t n u m t h r e a d s
11
12 program main
13 use omp lib
14 i m p l i c i t none
15 external : : report num threads
16
17 c a l l omp set dynamic ( . f a l s e . )
18 ! $omp p a r a l l e l num threads ( 2 )
19 c a l l report num threads (1)
20 ! $omp p a r a l l e l num threads ( 2 )
21 c a l l report num threads (2)
22 ! $omp p a r a l l e l num threads ( 2 )
23 c a l l report num threads (3)
24 ! $omp end p a r a l l e l
25 ! $omp end p a r a l l e l
26 ! $omp end p a r a l l e l
27 end program main
Beyond OpenMP Common Core Hardware 97

Figure 12.12: Controlling thread affinity

Affinity format example – We set the thread affinity format string and then ran the
STREAM benchmark on the server-node with logical CPU numbering from Figure 12.4.
We show two different executions of the STREAM benchmark: one with OMP PROC BIND
set to spread and the other with OMP PROC BIND set to close.

$ ifort -qopenmp -DNTIMES=20 -DSTREAM_ARRAY_SIZE=64000000 -c stream.f


$ ifort -qopenmp -o stream stream.o
$ export OMP_DISPLAY_AFFINITY=true
$ export OMP_AFFINITY_FORMAT="Thrd Lev=%3L, thrd_num=%5n, thrd_aff=%15A"
$ export OMP_PLACES=threads
$ export OMP_NUM_THREADS=8
$ export OMP_PROC_BIND=spread

$ ./stream | sort -k3


<stream results omitted ...>
Thrd Lev=1 , thrd_num=0 , thrd_aff=0
Thrd Lev=1 , thrd_num=1 , thrd_aff=8
Thrd Lev=1 , thrd_num=2 , thrd_aff=16
Thrd Lev=1 , thrd_num=3 , thrd_aff=24
Thrd Lev=1 , thrd_num=4 , thrd_aff=1
Thrd Lev=1 , thrd_num=5 , thrd_aff=9
Thrd Lev=1 , thrd_num=6 , thrd_aff=17
Thrd Lev=1 , thrd_num=7 , thrd_aff=25

$ export OMP_PROC_BIND=close
$ ./stream |sort -k3
<stream results omitted ...>
Thrd Lev=1 , thrd_num=0 , thread_aff=0
Thrd Lev=1 , thrd_num=1 , thread_aff=32
Thrd Lev=1 , thrd_num=2 , thread_aff=2
Thrd Lev=1 , thrd_num=3 , thread_aff=34
Thrd Lev=1 , thrd_num=4 , thread_aff=4
Thrd Lev=1 , thrd_num=5 , thread_aff=36
Thrd Lev=1 , thrd_num=6 , thread_aff=6
Thrd Lev=1 , thrd_num=7 , thread_af=38
98 Chapter 12

Figure 12.13: Serial Pi program (from Figure 4.5)

Serial Pi program –This program approximates a definite integral using the midpoint
rule The loop iterations are independent other than the summation into sum. Note that we
must explicitly represent all constants as floats to prevent internal operations from using
double precision.
1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 1000000
6 REAL : : x , pi , sum , s t e p
7
8 sum = 0 . 0
9
10 s t e p = 1 . 0 / num steps
11
12 DO i = 1 , num steps
13 x = ( i − 0.5) ∗ step
14 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
15 ENDDO
16
17 p i = s t e p ∗ sum
18 p r i n t ∗ , ” p i =”, p i
19
20 END PROGRAM MAIN
Beyond OpenMP Common Core Hardware 99

Figure 12.14: Unrolled loop in the Pi program

Serial Pi program with loops unrolled by 4 – Numerical integration to estimate


Pi. We assume the number of steps is evenly divided by 4 just to keep the program simpler.

1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 100000
6 REAL : : x0 , x1 , x2 , x3 , pi , sum
7 REAL : : s t e p
8
9 sum = 0 . 0
10 s t e p = 1 . 0 / num steps
11
12 DO i = 1 , num steps , 4
13 x0 = ( i − 0 . 5 ) ∗ s t e p
14 x1 = ( i + 0 . 5 ) ∗ s t e p
15 x2 = ( i + 1 . 5 ) ∗ s t e p
16 x3 = ( i + 2 . 5 ) ∗ s t e p
17 sum = sum + 4 . 0 ∗ ( 1 . 0 / ( 1 . 0 + x0 ∗ x0 ) &
18 & + 1 . 0 / ( 1 . 0 + x1 ∗ x1 ) &
19 & + 1 . 0 / ( 1 . 0 + x2 ∗ x2 ) &
20 & + 1 . 0 / ( 1 . 0 + x3 ∗ x3 ) )
21 ENDDO
22
23 p i = s t e p ∗ sum
24
25 WRITE( ∗ , 1 0 0 ) pi , num steps
26 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s ’ )
27
28 END PROGRAM MAIN
100 Chapter 12

Figure 12.15: Calling a C function with SSE code from Fortran

Pi program using SSE vector intrinsics – Numerical integration to estimate Pi.


We assume the number of steps is evenly divided by 4 just to keep the program simpler.

Save the contents in 2 files as below:


! To build and run:
! % gcc -c get_pi_vec.c
! % gfortran get_pi_vec.o Fig_12.15_explicitVecPi.f90
! % ./a.out

# File 1: "Fig_12.15_explicitVecPi.f90"

program main
interface
function get_pi_vec () result(r) bind(C, name="get_pi_vec")
use, intrinsic :: iso_c_binding, only : c_float
real(c_float) :: r
end function get_pi_vec
end interface

real :: pi
pi = get_pi_vec()
print *, "in Fortran: pi=", pi
end program

#File 2: "get_pi_vec.c"

#include <x86intrin.h>
static long num_steps = 100000;
float scalar_four = 4.0f, scalar_zero = 0.0f, scalar_one = 1.0f;
float step;
float get_pi_vec ()
{
int i;
float pi;
float vsum[4], ival;
Beyond OpenMP Common Core Hardware 101

step = 1.0f/(double) num_steps;

__m128 ramp = _mm_setr_ps(0.5, 1.5, 2.5, 3.5);


__m128 one = _mm_load1_ps(&scalar_one);
__m128 four = _mm_load1_ps(&scalar_four);
__m128 vstep = _mm_load1_ps(&step);
__m128 sum = _mm_load1_ps(&scalar_zero);
__m128 xvec;
__m128 denom;
__m128 eye;

for (i = 0; i < num_steps; i = i + 4){


ival = (float) i;
eye = _mm_load1_ps(&ival);
xvec = _mm_mul_ps(_mm_add_ps(eye,ramp), vstep);
denom = _mm_add_ps(_mm_mul_ps(xvec,xvec), one);
sum = _mm_add_ps(_mm_div_ps(four,denom), sum);
}
_mm_store_ps(&vsum[0], sum);

pi = step * (vsum[0] + vsum[1] + vsum[2] + vsum[3]);


return pi;
}
102 Chapter 12

Figure 12.16: Multithreading with SSE vectorization

A multithreaded and vectorized Pi program – This program carries out a


numerical integration to estimate Pi. We assume the number of steps is evenly divisible by
4 and that we got 4 threads just to keep the program simple.

1 Save t h e c o n t e n t s i n 2 f i l e s a s below :
2 ! To b u i l d and run :
3 ! % g c c −c −fopenmp g e t p i p a r v e c . c
4 ! % g f o r t r a n −fopenmp g e t p i p a r v e c . o F i g 1 2 . 1 6 pa r Ve cP i . f 9 0
5 ! % . / a . out
6
7 ! Save t h e c o n t e n t s i n 2 f i l e s a s below :
8 ! To b u i l d and run :
9 ! % g c c −c −fopenmp g e t p i p a r v e c . c
10 ! % g f o r t r a n −fopenmp g e t p i p a r v e c . o F i g 1 2 . 1 6 pa r Ve cP i . f 9 0
11 ! % . / a . out
12
13 #F i l e 1 : ” F i g 1 2 . 1 6 p ar Ve cP i . f 9 0 ”
14
15 program main
16 use omp lib
17 interface
18 f u n c t i o n g e t p a r p i v e c ( ) r e s u l t ( r ) bind (C, name=” g e t p a r p i v e c ” )
19 use , i n t r i n s i c : : i s o c b i n d i n g , o n l y : c f l o a t
20 real ( c float ) : : r
21 end f u n c t i o n g e t p a r p i v e c
22 end i n t e r f a c e
23
24 real : : pi
25 pi = get par pi vec ()
26 p r i n t ∗ , ” i n F o r t r a n : p i =”, p i
27 end program
28
29 #F i l e 2 : ” g e t p i p a r v e c . c ”
30
31 #i n c l u d e <x 8 6 i n t r i n . h>
32 s t a t i c l o n g num steps = 1 0 0 0 0 0 ;
33 #d e f i n e MAX THREADS 4
Beyond OpenMP Common Core Hardware 103

34 f l o a t scalar four = 4.0 f , scalar zero = 0.0 f , scalar one = 1.0 f ;


35 f l o a t step ;
36 float get pi par vec ()
37 {
38 int i , k ;
39 f l o a t l o c a l s u m [MAX THREADS ] ;
40 f l o a t pi , sum = 0 . 0 ;
41 s t e p = 1 . 0 f / ( d o u b l e ) num steps ;
42
43 f o r ( k = 0 ; k < MAX THREADS; k++) l o c a l s u m [ k ] = 0 . 0 ;
44
45 #pragma omp p a r a l l e l num threads ( 4 ) p r i v a t e ( i )
46 {
47 i n t ID = omp get thread num ( ) ;
48 f l o a t vsum [ 4 ] , i v a l , s c a l a r f o u r = 4 . 0 ;
49
50 m128 ramp = mm setr ps ( 0 . 5 , 1 . 5 , 2 . 5 , 3 . 5 ) ;
51 m128 one = mm load1 ps (& s c a l a r o n e ) ;
52 m128 four = mm load1 ps (& s c a l a r f o u r ) ;
53 m128 vstep = mm load1 ps (& s t e p ) ;
54 m128 sum = mm load1 ps (& s c a l a r z e r o ) ;
55 m128 xvec ;
56 m128 denom ;
57 m128 eye ;
58
59 // u n r o l l l o o p 4 t i m e s . . . assume num steps\%4 =0
60 #pragma omp f o r s c h e d u l e ( s t a t i c )
61 f o r ( i = 0 ; i < num steps ; i = i + 4 ) {
62 ival = ( float ) i ;
63 eye = mm load1 ps (& i v a l ) ;
64 xvec = mm mul ps ( mm add ps ( eye , ramp ) , v s t e p ) ;
65 denom = mm add ps ( mm mul ps ( xvec , xvec ) , one ) ;
66 sum = mm add ps ( mm div ps ( f o u r , denom ) , sum ) ;
67 }
68 m m s t o r e p s (&vsum [ 0 ] , sum ) ;
69 l o c a l s u m [ ID ] = s t e p ∗ ( vsum [ 0 ] + vsum [ 1 ] + vsum [ 2 ] + vsum [ 3 ] ) ;
70 }
71 f o r ( k = 0 ; k < MAX THREADS; k++) p i += l o c a l s u m [ k ] ;
72 return pi ;
104 Chapter 12

73 }
Beyond OpenMP Common Core Hardware 105

Figure 12.17: Vectorization with the OpenMP SIMD construct

OpenMP program to vectorize the Pi program – The simd clause directs the
compiler to explicitly vectorize the program. As with many OpenMP features, this clause
asserts to the compiler that it is safe to vectorize the code and it will do so, even if there
are loop-carried dependencies that should prevent vectorization.
1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 100000
6 REAL : : x , pi , sum , s t e p
7
8 sum = 0 . 0
9
10 s t e p = 1 . 0 / num steps
11
12 !$OMP p a r a l l e l SIMD p r i v a t e ( x ) r e d u c t i o n (+:sum )
13 DO i = 1 , num steps
14 x = ( i − 0.5) ∗ step
15 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
16 ENDDO
17 !$OMP end p a r a l l e l SIMD
18
19 p i = s t e p ∗ sum
20 p r i n t ∗ , ” p i =”, p i
21
22 END PROGRAM MAIN
106 Chapter 12

Figure 12.18: Combining multithreading and vectorization

OpenMP program to multithread and vectorize the Pi program – This is


a familiar “parallel do” approach to solving the problem but we have added one additional
clause: a simd clause for explicit vectorization.

1 PROGRAM MAIN
2
3 USE OMP LIB
4 IMPLICIT NONE
5
6 INTEGER : : i
7
8 INTEGER, PARAMETER : : num steps = 100000
9 REAL : : x , pi , sum , s t e p
10
11 sum = 0 . 0
12
13 s t e p = 1 . 0 / num steps
14
15 ! $omp p a r a l l e l do simd p r i v a t e ( x ) r e d u c t i o n (+:sum )
16 DO i = 1 , num steps
17 x = ( i − 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20 ! $omp end p a r a l l e l do simd
21
22 p i = s t e p ∗ sum
23
24 p r i n t ∗ , ” p i =”, p i
25 END PROGRAM MAIN
Beyond OpenMP Common Core Hardware 107

Figure 12.19: Target construct with default data movement

OpenMP program for elementwise multiplication of vectors on a GPU –


Default data movement moves the vectors a, b, and c onto the device before the computa-
tions starts and back onto the host (the CPU) when the computation has completed.

1 program main
2 use omp lib
3 i m p l i c i t none
4
5 i n t e g e r , parameter : : N = 1024
6 r e a l ∗8 : : a (N) , b (N) , c (N)
7 integer : : i
8
9 ! i n i t i a l i z e a , b , and c ( code not shown )
10
11 ! $omp t a r g e t
12 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
13 do i = 1 , N
14 c ( i ) = c ( i ) + a( i ) ∗ b( i )
15 enddo
16 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
17 ! $omp end t a r g e t
18
19 end program main
108 Chapter 12

Figure 12.20: Explicit data movement between the host and a


device

Explicit data movement with the target directive – The map clause controls
movement of data from the host to a device or from the device onto the host. When
working with pointers to arrays, you need to use array sections to define precisely which
data to move.

1 program main
2 use omp lib
3
4 i n t e g e r , parameter : : N = 1024
5 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c
6 integer : : i
7
8 a l l o c a t e ( a (N) )
9 a l l o c a t e ( b (N) )
10 a l l o c a t e ( c (N) )
11
12 ! i n i t i a l i z e a , b , and c ( code not shown )
13
14 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : c ( 1 :N) )
15 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
16 do i = 1 , N
17 c ( i ) = c ( i ) + a( i ) ∗ b( i )
18 enddo
19 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
20 ! $omp end t a r g e t
21
22 end program main
Beyond OpenMP Common Core Hardware 109

Figure 12.21: Data movement between multiple target regions

Multiple target regions – The map clause controls movement of data from the host
to a device or from the device onto the host. When working with pointers to arrays, you
need to use array sections to define precisely which data to move.

1 program main
2 use omp lib
3
4 i n t e g e r , parameter : : N = 1024
5 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c , d
6 integer : : i
7
8 allocate ( a (N) )
9 allocate ( b (N) )
10 allocate ( c (N) )
11 allocate ( d (N) )
12
13 ! i n i t i a l i z e a , b , and c ( code not shown )
14
15 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : c ( 1 :N) )
16 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
17 do i = 1 , N
18 c ( i ) = c ( i ) + a( i ) ∗ b( i )
19 enddo
20 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
21 ! $omp end t a r g e t
22
23 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : d ( 1 :N) )
24 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
25 do i = 1 , N
26 d( i ) = d( i ) + a( i ) ∗ c ( i )
27 enddo
28 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
29 ! $omp end t a r g e t
30
31 end program main
110 Chapter 12

Figure 12.22: Managing data movement across multiple target


regions

Target Data Region – A single target data region manages data at the level of a
device. It persists and is used between multiple target constructs. code:ompTargDat
1 program main
2 use omp lib
3 i n t e g e r , parameter : : N = 1024
4 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c , d
5 integer : : i
6 a l l o c a t e ( a (N) )
7 a l l o c a t e ( b (N) )
8 a l l o c a t e ( c (N) )
9 a l l o c a t e ( d (N) )
10
11 ! i n i t i a l i z e a , b , and c ( code not shown )
12
13 ! $omp t a r g e t data map( t o : a ( 1 :N) , b ( 1 :N) , c ( 1 :N) ) map( tofrom : d ( 1 :N) )
14
15 ! $omp t a r g e t
16 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
17 do i = 1 , N
18 c ( i ) = c ( i ) + a( i ) ∗ b( i )
19 enddo
20 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
21 ! $omp end t a r g e t
22
23 ! $omp t a r g e t
24 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
25 do i = 1 , N
26 d( i ) = d( i ) + a( i ) ∗ c ( i )
27 enddo
28 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
29 ! $omp end t a r g e t
30
31 ! $omp end t a r g e t data
32
33 end program main
13 Your Continuing Education in OpenMP

Figure 13.1: Task with critical constructs can deadlock

A subtle deadlock with tasks: – This is the tasking.9.c example from the OpenMP
4.5 Examples document. This function can deadlock if the thread suspends task 1 to begin
work on task 2.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n c −fopenmp −c F i g 1 3 . 1 taskBug . f 9 0
3
4 s u b r o u t i n e work ( )
5 use omp lib
6 i m p l i c i t none
7
8 ! $omp t a s k ! task 1
9 ! $omp t a s k ! task 2
10 ! $omp c r i t i c a l ! Critical region 1
11 ! do work h e r e
12 ! $omp end c r i t i c a l ! end C r i t i c a l Region 1
13 ! $omp end t a s k ! end t a s k 2
14 ! $omp c r i t i c a l ! C r i t i c a l Region 2
15 ! $omp t a s k ! t a s k 3
16 ! do work h e r e }
17 ! $omp end t a s k ! end t a s k 3
18 ! $omp end c r i t i c a l ! end C r i t i c a l Region 2
19 ! $omp end t a s k ! end t a s k 1
20 end s u b r o u t i n e

You might also like