Fortran Supplement Ver 1.1
Fortran Supplement Ver 1.1
1 Parallel Computing 1
3 What is OpenMP? 7
5 Parallel Loops 23
7 Tasks in OpenMP 47
OpenMP is defined for C, C++, and Fortran. Ideally, when you write a book about
OpenMP, everything is covered in triplicate; once for each of these programming
languages. If that was done, however, the resulting book would be cluttered and
painful to read. What’s a poor author to do?
After struggling with the language problem for many moons we came upon what
we hope is an effective compromise. Our book on the OpenMP Common Core covers
C and Fortran. Every time we present an item from the OpenMP API, we define it
in both C and Fortran. The code discussed in the book, however, only addresses C.
The result is a book tightly woven around a set of C examples; free from the clutter
of replicated content from other languages.
For C++ programmers, this solution works quite well. While offensive to a “proper”
C++ programmer, you can think of C++ as a superset of C. With few exceptions,
if you move OpenMP for C into C++, things just work. We thought this was an
adequate solution for Fortran programmers as well since surely modern Fortran
programmers understand C. Based on informal surveys at numerous OpenMP
tutorials, however, we’ve learned that this assumption is not universally true. Many
Fortran programmers are not comfortable with C. The authors of the OpenMP
Common Core started life as Fortran programmers. We love Fortran and would
hate to leave our fellow “Formula Translation” buddies out in the cold without the
benefit of our excellent book.
We came up with a simple solution to this problem. We produced a supplement
to our book that presents every example from the book implemented in Fortran.
Our Fortran friends would buy the OpenMP Common Core book and download this
free supplement. Having the two side by side, Fortran programmers could easily
absorb the contents of our book and apply what they have learned to their own
Fortran programs.
There is one technical complication to this solution. When presenting OpenMP
to C and C++ programmers, we delay introduction of clauses that manipulate the
data environment. There is so much to grapple with when learning multithreaded
programming. It greatly simplifies the discussion if we don’t move beyond the
default rules for data sharing until much later.
For Fortran programmers however, this is not possible. A C programmer can
declare a new variable almost anywhere. Fortran, on the other hand, requires that
all variables are declared before any executable code. Therefore, to discuss OpenMP
with Fortran programmers, we need to present one of the clauses from Chapter 6
(OpenMP Data Environment) right from the beginning. This clause is the private
clause.
vi Preface to the Fortran supplement
• The figure number for programs presented in the OpenMP Common Core
book
• The page number from the OpenMP Common Core book where code-embedded-
in-text is found.
We hope you find our solution to the multi-language problem in OpenMP useful.
We really want our Fortran readers to benefit from our wonderful little book on the
OpenMP Common Core.
Preface to the Fortran supplement vii
Acknowledgments
When we first started teaching OpenMP, Scientific Computing was almost exclusively
centered on Fortran. Even if programmers used a different language, they were
generally comfortable reading Fortran. Over the years, the dominance of Fortran has
slipped only to be replaced by C. In response, our materials for teaching OpenMP
shifted taking us to place where we largely ignored Fortran.
This situation is changing. We are moving Fortran back into the core materials we
use for teaching OpenMP. We can only do this, however, with help from the OpenMP
Fortran community. We all still use Fortran and are comfortable with the language,
but we need an expert who is up to date with the latest developments in Fortran
to check our work and make sure it is correct. Henry Jin (NASA Ames Research
Center) played the “expert Fortran reviewer” role for us. He found numerous errors,
both large and small, in our code. We are grateful to Henry for the many hours he
spent reviewing our code and making sure that this Fortran Supplement is of the
highest quality.
1 Parallel Computing
In this chapter, we focus on the words we use when talking about the performance
of a parallel program. These concepts are language independent. Hence, there is no
C, C++, or Fortran code in this chapter.
3 What is OpenMP?
A structured block is implied between any directive and its “end directive” form.
Square brackets ([ ]) denote optional items.
subroutine omp set num threads(numthrds) Set default number of threads to request
integer numthrds when creating a team of threads
double precision function omp get wtime() Wall clock time
export OMP NUM THREADS=N Environment Variable: number of threads
!$OMP barrier Wait for all threads in the team
Data movement and parallel regions – This simple program sets the default
number of threads to request for a parallel region to 4. A parallel region is defined with a
private clause so each thread has its own copy of ID. The thread ID is set and a simple
subroutine is called. Key points form this program: (1) all the threads independently
execute the same block of code in this parallel region, (2) all threads have access to the
array declared prior to the parallel region, and (3) each thread has its own, private copy of
the integer ID.
1 Program parReg
2 use omp lib
3 i m p l i c i t none
4
5 r e a l : : A( 1 0 )
6 i n t e g e r : : ID
7
8 A = 0
9 c a l l omp set num threads (4)
10
11 ! $omp p a r a l l e l p r i v a t e ( ID )
12 ID = omp get thread num ( ) + 1
13 c a l l pooh ( ID , A)
14 w r i t e ( ∗ , 1 0 0 ) ID , A( ID )
15 100 format ( ”A o f ID ( ” , I3 , ”)=” , f 1 0 . 4 )
16 ! $omp end p a r a l l e l
17
18 contains
19
20 s u b r o u t i n e pooh ( ID , A)
21 i n t e g e r : : ID
22 r e a l , dimension ( : ) : : A
23 A( ID ) = ID
24 end s u b r o u t i n e pooh
25
26 end Program parReg
10 Chapter 4
Library routines to manage threads – This program shows how to set the default
number of threads to request in parallel regions, query the number of threads in a team,
and set a unique thread ID. Notice the care taken to avoid a data race when assigning to
size of team.
1 Program parReg1
2 use omp lib
3 i m p l i c i t none
4
5 i n t e g e r : : ID , s i z e o f t e a m , NThrds
6 c a l l omp set num threads (4)
7 ! $omp p a r a l l e l p r i v a t e ( ID , NThrds )
8 ID = omp get thread num ( )
9 NThrds = omp get num threads ( )
10 i f ( ID == 0 ) s i z e o f t e a m = NThrds
11 ! $omp end p a r a l l e l
12 p r i n t ∗ , ”We j u s t d i d t h e j o i n on a team o f s i z e ” , s i z e o f t e a m
13 end Program parReg1
Threads and the OpenMP Programming Model 11
integer :: i
ID = omp_get_thread_num()
numthreads = omp_get_num_threads()
do i = ID + 1, num_steps, numthreads
! body of the loop
end do
Threads and the OpenMP Programming Model 13
1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER, PARAMETER : : MAX THREADS = 4
6 INTEGER : : i , j , id , numthreads , n t h r e a d s
7 INTEGER, PARAMETER : : num steps = 100000000
8 REAL∗8 : : pi , r e a l s u m , s t e p , x
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10 REAL∗8 : : sum ( 0 :MAX THREADS−1)
11
12 s t e p = 1 . 0 / num steps
13
14 CALL OMP SET NUM THREADS(MAX THREADS)
15 s t a r t t i m e = omp get wtime ( )
16
17 !$OMP PARALLEL PRIVATE( id , x , numthreads )
18 i d = omp get thread num ( )
19 numthreads = OMP GET NUM THREADS( )
20 sum ( i d ) = 0 . 0
21
22 IF ( i d == 0 ) THEN
23 n t h r e a d s = numthreads
24 ENDIF
25
26 DO i = id , num steps − 1 , numthreads
27 x = ( i + 0.5) ∗ step
28 sum ( i d ) = sum ( i d ) + 4 . 0 / ( 1 . 0 + x ∗ x )
29 ENDDO
30 !$OMP END PARALLEL
31
14 Chapter 4
32 pi = 0.0
33 DO i = 0 , n t h r e a d s −1
34 p i = p i + sum ( i )
35 ENDDO
36
37 pi = step ∗ pi
38 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
39 WRITE( ∗ , 1 0 0 ) pi , num steps , r u n t i m e
40 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s , ’ , f 8 . 3 , ’ s e c s ’ )
41
42 END PROGRAM MAIN
Threads and the OpenMP Programming Model 15
Padded sum array numerical integration – The sum array padded to fill an L1
cache line with the extra dimension and put subsequent rows of sum, i.e., each sum(0,id),
on different cache lines.
1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , j , id , numthreads , n t h r e a d s
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER, PARAMETER : : MAX THREADS=4
8 INTEGER, PARAMETER : : CBLK=8
9 REAL∗8 : : pi , s t e p , x
10 REAL∗8 : : s t a r t t i m e , r u n t i m e
11 REAL∗8 : : sum (CBLK, 0 :MAX THREADS−1)
12
13 s t e p = 1 . 0 / num steps
14
15 CALL OMP SET NUM THREADS(MAX THREADS)
16 s t a r t t i m e = omp get wtime ( )
17
18 !$OMP PARALLEL PRIVATE( id , x , numthreads )
19 i d = omp get thread num ( )
20 numthreads = OMP GET NUM THREADS( )
21 sum ( 1 , i d ) = 0 . 0
22
23 IF ( i d == 0 ) THEN
24 n t h r e a d s = numthreads
25 ENDIF
26
27 DO i = id , num steps −1, numthreads
28 x = ( i + 0.5) ∗ step
29 sum ( 1 , i d ) = sum ( 1 , i d ) + 4 . 0 / ( 1 . 0 + x ∗ x )
30 ENDDO
31 !$OMP END PARALLEL
32
33 pi = 0.0
Threads and the OpenMP Programming Model 17
34 DO i = 0 , n t h r e a d s − 1
35 p i = p i + sum ( 1 , i )
36 ENDDO
37
38 pi = step ∗ pi
39 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
40 WRITE( ∗ , 1 0 0 ) pi , run ti me , n t h r e a d s
41 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s and ’ , i 3 , ’ t h r e a d s ’ )
42
43 END PROGRAM MAIN
18 Chapter 4
Example of an explicit barrier – An explicit barrier is used to assure that all threads
complete filling the array Arr before using it to compute Brr. We assume the SPMD
pattern so we pass the thread id and the number of threads to all the functions. Notice
that only one thread saves the number of threads to a shared variable should it be needed
after the parallel region.
1 r e a l ∗8 : : Arr ( 8 ) , Brr ( 8 )
2 i n t e g e r : : numthrds
3 i n t e g e r : : id , n t h r d s
4 r e a l ∗8 , e x t e r n a l : : l o t s o f w o r k , n e e d s a l l o f A r r
5
6 c a l l omp set num threads (8)
7 ! $omp p a r a l l e l p r i v a t e ( id , n t h r d s )
8 i d = omp get thread num ( ) + 1
9 n t h r d s = omp get num threads ( )
10 i f ( i d == 1 ) numthrds = n t h r d s
11 Arr ( i d ) = l o t s o f w o r k ( id , n t h r d s )
12 #pragma omp b a r r i e r
13 Brr ( i d ) = n e e d s a l l o f A r r ( id , nthrds , Arr )
14 ! $omp end p a r a l l e l
5 Parallel Loops
do i = 1, N
a(i) = a(i) + b(i)
end do
SPMD parallel vector add program – Create a team of threads and assign one
chunk of loop iterations to each thread.
!$omp do
...
!$omp end do
24 Chapter 5
The following pattern with a pair of OpenMP constructs, one to create the team of
threads and the other to split up loop iterations among threads, is very common:
!$omp parallel
!$omp do
do-loop
!$omp end do
!$omp end parallel
!$omp parallel do
do-loop
!$omp end parallel do
An OpenMP reduction –Each thread has a private copy of the variable ave to use
for its loop iterations. At the end of the loop, these values are combined to create the final
value of the reduction which is then combined with the globally visible, shared copy of the
variable ave.
1 integer : : i
2 r e a l ∗8 : : ave , A(N)
3
4 c a l l I n i t (A,N)
5 ave = 0 . 0
6
7 ! $omp p a r a l l e l do r e d u c t i o n (+: ave )
8 do i = 1 , N
9 ave = ave + A( i )
10 enddo
11 ! $omp end p a r a l l e l do
12
13 ave = ave /N
Parallel Loops 27
1 program main
2 use omp lib
3 i m p l i c i t none
4
5 ! Use a s m a l l e r v a l u e o f ITER i f a v a i l a b l e memory i s t o o s m a l l
6 i n t e g e r , parameter : : ITER = 100000000
7 integer : : i , id
8 r e a l ∗8 : : A( i t e r )
9 r e a l ∗8 : : t d a t a
10 real : : x
11
12 do i = 1 , ITER
13 A( i ) = 2 . 0 ∗ i
14 enddo
15
16 ! $omp p a r a l l e l p r i v a t e ( id , tdata , x )
17
18 i d = omp get thread num ( )
19 t d a t a = omp get wtime ( )
20
21 ! $omp do s c h e d u l e ( s t a t i c )
22 do i = 1 , ITER
23 x = i ∗ 1.0
24 A( i ) = A( i ) ∗ s q r t ( x ) / ( s i n ( x ) ∗∗ tan ( x ) )
25 enddo
26
27 t d a t a = omp get wtime ( ) − t d a t a
28
29 i f ( i d == 0 ) p r i n t ∗ , ”Time s p e n t i s ” , tdata , ” s e c ”
30
31 ! $omp end p a r a l l e l
32 end program main
28 Chapter 5
1 program main
2
3 use omp lib
4 i m p l i c i t none
5
6 ! Use a s m a l l e r v a l u e o f ITER i f a v a i l a b l e memory i s t o o s m a l l
7 i n t e g e r , parameter : : ITER = 50000000
8 integer : : i , id
9 r e a l ∗8 : : t d a t a
10 i n t e g e r : : sum = 0
11
12 ! $omp p a r a l l e l p r i v a t e ( i , id , t d a t a )
13 i d = omp get thread num ( )
14 t d a t a = omp get wtime ( )
15
16 ! $omp do r e d u c t i o n (+:sum ) s c h e d u l e ( dynamic )
17 do i = 2 , ITER
18 i f ( c h e c k p r i m e ( i ) == 1 ) sum = sum + 1
19 enddo
20 ! $omp end do
21
22 t d a t a = omp get wtime ( ) − t d a t a
23
24 i f ( i d == 0 ) p r i n t ∗ , ”Number o f prime numbers i s ” , &
25 & sum , ” i n ” , tdata , ” s e c ”
26 ! $omp end p a r a l l e l
27
28 contains
29 i n t e g e r f u n c t i o n c h e c k p r i m e (num)
30 i m p l i c i t none
31 i n t e g e r , i n t e n t ( i n ) : : num
32 i n t e g e r : : i , iend
33
Parallel Loops 29
34 i e n d = i n t ( s q r t (num ∗ 1 . 0 ) )
35 do i = 2 , i e n d
36 i f (mod(num , i ) == 0 ) then
37 check prime = 0
38 return
39 endif
40 enddo
41
42 check prime = 1
43
44 end f u n c t i o n c h e c k p r i m e
45
46 end program main
30 Chapter 5
1 r e a l ∗8 : : A( b i g ) , B( b i g ) , C( b i g )
2 integer : : id
3
4 ! $omp p a r a l l e l p r i v a t e ( i d )
5 i d = omp get thread num ( ) + 1
6 A( i d ) = b i g c a l c 1 ( i d )
7
8 ! $omp b a r r i e r
9
10 ! $omp do
11 do i = 1 , N
12 B( i ) = b i g c a l c 2 (C, i )
13 end do
14 ! $omp enddo nowait
15
16 A( i d ) = b i g c a l c 4 ( i d )
17 ! $omp end p a r a l l e l
Parallel Loops 31
1 PROGRAM MAIN
2 USE OMP LIB
3 IMPLICIT NONE
4
5 INTEGER : : i , i d
6 INTEGER, PARAMETER : : num steps =100000000
7 INTEGER : : NTHREADS = 4
8 REAL∗8 : : x , pi , sum , step
9 REAL∗8 : : s t a r t t i m e , run time
10
11 sum = 0 . 0
12 s t e p = 1 . 0 / num steps
13 s t a r t t i m e = OMP GET WTIME( )
14
15 CALL OMP SET NUM THREADS(NTHREADS)
16
17 !$OMP PARALLEL PRIVATE( i , x )
18 !$OMP DO REDUCTION(+:sum )
19 DO i = 1 , num steps
20 x = ( i − 0.5) ∗ step
21 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
22 ENDDO
23 !$OMP END DO
24 !$OMP END PARALLEL
25
26 p i = s t e p ∗ sum
27 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
28 WRITE( ∗ , 1 0 0 ) pi , r u n t i m e
29 100 FORMAT( ’ p i i s ’ , f 1 5 . 8 , ’ i n ’ , f 8 . 3 , ’ s e c s ’ )
30 END PROGRAM MAIN
32 Chapter 5
Loop dependence example –The first loop is sequential and contains a loop-carried
dependence. The value of j for a loop index is dependent on the value of j for the previous
loop index. In the parallel code in the second loop, the loop-carried dependence has been
removed by calculating j from the loop control index.
1 ! S e q u e n t i a l code with l o o p dependence
2 i n t e g e r : : i , j , A(MAX)
3 j = 5
4 do i = 1 , MAX
5 j = j + 2
6 A( i ) = b i g ( j )
7 end do
8
9 ! p a r a l l e l code with l o o p dependence removed
10 i n t e g e r : : i , j , A(MAX)
11 ! $omp p a r a l l e l do p r i v a t e ( j )
12 do i = 1 , MAX
13 do j = 5 + 2 ∗ ( i +1)
14 A( i ) = b i g ( j )
15 end do
16 end do
17 ! $omp end p a r a l l e l do
6 OpenMP Data Environment
1 ! F i l e #1:
2 module data mod
3 r e a l ∗8 : : A( 1 0 )
4 end module data mod
5
6 program main
7 u s e data mod
8 i m p l i c i t none
9 i n t e g e r : : index (10)
10 ! $omp p a r a l l e l
11 c a l l work ( i n d e x )
12 ! $omp end p a r a l l e l
13 p r i n t ∗ , index (1)
14 end program main
15
16
17 ! F i l e #2:
18 s u b r o u t i n e work ( i n d e x )
19 u s e data mod
20 i m p l i c i t none
21 i n t e g e r : : index
22 r e a l ∗8 : : temp ( 1 0 )
23 i n t e g e r , s a v e : : count
24 ...
25 end s u b r o u t i n e work
34 Chapter 6
Figure 6.4: The Private clause (note: this program is not correct)
An example of a private clause – The original variable tmp is masked by the private
copy of the variable inside the parallel do region. This program is incorrect since a private
variable is not initialized.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 6 . 4 w r o n g P r i v a t e . f 9 0
3
4 program wrong
5 i n t e g e r : : tmp
6 tmp = 0
7
8 ! $omp p a r a l l e l do p r i v a t e ( tmp )
9 do j = 1 , 1000
10 tmp = tmp +j
11 enddo
12 ! $omp end p a r a l l e l do
13
14 p r i n t ∗ , tmp ! tmp i s 0 h e r e
15 end program wrong
36 Chapter 6
A second example of the private clause – This Fortran program works (unlike
the corresponding C code which has a subtle bug). tmp is a local variable in subroutine OK,
not the one from the module file as updated in subroutine work, hence the value printed
in line 13 should be the same as the original value 0 as defined in line 9 before the parallel
region.
1 ! F i l e #1
2 module data mod
3 i n t e g e r : : tmp
4 end module data mod
5
6 s u b r o u t i n e OK( )
7 i m p l i c i t none
8 i n t e g e r : : tmp
9 tmp = 0
10 ! $omp p a r a l l e l p r i v a t e ( tmp )
11 c a l l work ( )
12 ! $omp end p a r a l l e l
13 p r i n t ∗ , tmp ! tmp i s 0 , same a s t h e o r i g i n a l l o c a l v a l u e
14 ! d ef i ne d b e f o r e the p a r a l l e l r e g i o n
15 end s u b r o u t i n e OK
16
17 ! F i l e #2
18 s u b r o u t i n e work ( )
19 u s e data mod
20 i m p l i c i t none
21 tmp = 5
22 end s u b r o u t i n e work
OpenMP Data Environment 37
1 incr = 0
2 ! $omp p a r a l l e l do f i r s t p r i v a t e ( i n c r )
3 do i = 1 , MAX
4 i f (mod( i , 2 ) == 0 ) i n c r = i n c r + 1
5 A( i ) = i n c r
6 end do
7 ! $omp end p a r a l l e l do
An OpenMP data environment quiz – Consider the storage attributes and values
for A, B and C.
1 A = 1
2 B = 1
3 C = 1
4 ! $omp p a r a l l e l p r i v a t e (B) f i r s t p r i v a t e (C)
38 Chapter 6
Figure 6.8 and 6.9: Find the area of the Mandelbrot set
Mandelbrot set area: original code with errors – This version of the program
has multiple bugs. Your job is to inspect the code and find the bugs.
MODULE mandel_module
implicit none
REAL(KIND = DP) :: r
TYPE d_complex
REAL(KIND = DP) :: r
REAL(KIND = DP) :: i
END TYPE d_complex
TYPE(d_complex) :: c
contains
SUBROUTINE testpoint()
!C iterate over z=z*z+c. |z| > 2 means the point is outside set
OpenMP Data Environment 39
implicit none
TYPE(d_complex) :: z
INTEGER :: iter
REAL(KIND = DP) :: temp
z = c
DO iter = 1, MAXITER
temp = (z%r*z%r) - (z%i*z%i) + c%r
z%i = z%r*z%i*2 + c%i
z%r = temp
END SUBROUTINE
PROGRAM mandel_wrong
USE OMP_LIB
USE mandel_module
IMPLICIT NONE
INTEGER :: i, j
REAL(KIND = DP) :: area, error
REAL(KIND = DP) :: eps = 1.0e-5
DO i = 1, NPOINTS
DO j = 1, NPOINTS
c%r = -2.0 + 2.5 * DBLE(i-1) / DBLE(NPOINTS) + eps
c%i = 1.125 * DBLE(j-1) / DBLE(NPOINTS) + eps
CALL testpoint()
ENDDO
ENDDO
!$OMP END PARALLEL DO
!C Calculate area of set and error estimate and output the results
MODULE mandel_par_module
implicit none
TYPE d_complex
REAL(KIND = DP) :: r
REAL(KIND = DP) :: i
END TYPE d_complex
contains
SUBROUTINE testpoint(c)
!C iterate over z=z*z+c. |z| > 2 means the point is outside set
!C If loop count reaches MAXITER, point is inside the set
implicit none
TYPE(d_complex) :: z,c
INTEGER :: iter
42 Chapter 6
z = c
DO iter = 1, MAXITER
temp = (z%r*z%r) - (z%i*z%i) + c%r
z%i = z%r*z%i*2 + c%i
z%r = temp
END SUBROUTINE
PROGRAM mandel_par
USE OMP_LIB
USE mandel_par_module
IMPLICIT NONE
INTEGER :: i, j
REAL(KIND = DP) :: area, error
REAL(KIND = DP) :: eps = 1.0e-5
TYPE(d_complex) :: c
! CALL OMP_SET_NUM_THREADS(4)
DO i = 1, NPOINTS
DO j = 1, NPOINTS
c%r = -2.0 + 2.5 * DBLE(i-1) / DBLE(NPOINTS) + eps
c%i = 1.125 * DBLE(j-1) / DBLE(NPOINTS) + eps
CALL testpoint(c)
ENDDO
ENDDO
!$OMP END PARALLEL DO
!C Calculate area of set and error estimate and output the results
write(*,*)"numoutside=", numoutside
Static arrays in data environment clauses – The compiler creates a private array
with 1000 values of type int on the stack for each thread.
int varray(1000)
call initv(1000, varray) ! function to initialize the array
allocate (vptr(1000))
call initv(1000, vptr) ! function to initialize the array
You define an array section in terms of the lower-bound, the length of the section,
and the stride.
(lower-bound:upper-bound:stride)
(lower-bound:upper-bound) ! stride implied as one
(:upper-bound:stride ) ! lower-bound implied as one
Using an array section in the previous example, we can have each thread allocate and
copy an original variable that is an array into a parallel region with the directive:
Array sections also work for the other clauses that create private copies of variables
such as private and reduction.
7 Tasks in OpenMP
Serial linked list program– Traverse the linked list and do a block of work (processwork(p))
for each node in the list where we assume processwork(p) for any node is independent of
the other nodes.
1 p => head
2 do
3 c a l l processwork (p)
4 p => p%next
5 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
6 end do
48 Chapter 7
Parallel linked list program without using tasks – Three passes through the
data to count the length of the list, collect values into an array, and process the array in
parallel. This is an example of the inspector-executor design pattern.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 7 . 2 l i n k e d L i s t N o T a s k s . f 9 0
3
4 module l i s t m o d
5 i n t e g e r , parameter : : NMAX = 10
6 type : : node
7 i n t e g e r : : data
8 integer : : procResult
9 type ( node ) , p o i n t e r : : next
10 end type node
11 end module l i s t m o d
12
13 program main
14 use l i s t m o d
15 i m p l i c i t none
16
17 type ( node ) , p o i n t e r : : p => n u l l ( )
18 type ( node ) , p o i n t e r : : temp => n u l l ( )
19 type ( node ) , p o i n t e r : : head => n u l l ( )
20 type ( node ) , dimension ( : ) , a l l o c a t a b l e , t a r g e t : : parr
21
22 interface
23 ! i n i t i a l i z e t h e l i s t ( not shown )
24 subroutine i n i t L i s t (p)
25 use l i s t m o d
26 i m p l i c i t none
27 type ( node ) , p o i n t e r : : p
28 end s u b r o u t i n e i n i t L i s t
29
30 ! a l o n g computation ( not shown )
31 i n t e g e r f u n c t i o n work ( data )
32 i m p l i c i t none
Tasks in OpenMP 49
33 i n t e g e r : : data
34 end f u n c t i o n work
35 end i n t e r f a c e
36
37 i n t e g e r : : i , count
38
39 c a l l i n i t L i s t (p)
40
41 ! s a v e head o f t h e l i s t
42 head => p
43
44 count = 0
45 do
46 p = p%next
47 count = count + 1
48 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
49 end do
50
51 a l l o c a t e ( p a r r ( count ) )
52
53 p => head
54 do i = 1 , count
55 p a r r ( i )%data = p%data
56 p => p%next
57 end do
58
59 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c , 1 )
60 do i = 1 , count
61 c a l l procWork ( p a r r ( i ) )
62 end do
63 ! $omp end p a r a l l e l do
64
65 contains
66
67 s u b r o u t i n e procWork ( a node )
68 use l i s t m o d
69 i m p l i c i t none
70 type ( node ) : : a node
71 integer : : n
50 Chapter 7
72 i n t e g e r , e x t e r n a l : : work
73 n = a node%data
74 a node%p r o c R e s u l t = work ( n )
75 end s u b r o u t i n e procWork
76
77 end program main
Tasks in OpenMP 51
Schrödinger’s Program – Two threads each generates two tasks. They wait a random
bit of time and then set a shared variable to true or false. Whichever task executes last
determines the final value of the variable and whether the cat is “dead” or “alive”.
1 ! S c h r o d i n g e r s r a c y program . . . i s t h e c a t dead o r a l i v e ?
2 !
3 ! You can u s e a t o m i c s and make t h e program r a c e f r e e , o r comment out
4 ! t h e a t o m i c s and run with a r a c e c o n d i t i o n . I t works i n both c a s e s
5 !
6 ! H i s t o r y : Written by Tim Mattson , Feb 2019
7 ! Converted t o F o r t r a n by Helen He , Nov 2019
8
9
10 program main
11 use omp lib
12 i m p l i c i t none
13
14 ! random number g e n e r a t o r p a r a m e t e r s
15 ! ( from n u m e r i c a l r e c i p i e s )
16 i n t e g e r , parameter : : MULT = 4096
17 i n t e g e r , parameter : : ADD = 150889
18 i n t e g e r , parameter : : MOD val = 714025
19
20 r e a l ∗8 : : w a i t v a l , v a l
21 i n t e g e r ∗8 : : rand , i , dcount , l c o u n t , c o i n
22 l o g i c a l : : d e a d o r a l i v e , HorT
23 i n t e g e r , parameter : : NTRIALS = 10
24
25 dcount = 0
26 lcount = 0
27
28 do i = 1 , NTRIALS
29 ! $omp p a r a l l e l num threads ( 2 ) s h a r e d ( d e a d o r a l i v e ) p r i v a t e ( v a l )
30 i f ( omp get thread num ( ) == 0 ) then
31 p r i n t ∗ , ” with ” , omp get num threads ( ) , ” t h r e a d s . ”
32 w r i t e ( ∗ , ’ ( a ) ’ , advance =’no ’ ) ” S c h r o d i n g e r s program s a y s t h e c a t i s ”
33 endif
52 Chapter 7
34
35 ! $omp s i n g l e
36 ! ” f l i p a c o i n ” t o c h o o s e which t a s k i s f o r t h e dead
37 ! c a t and which f o r t h e l i v i n g c a t .
38 c a l l seedIt ( coin )
39 HorT = f l i p ( c o i n )
40
41 ! w i t h o u t t h e atomics , t h e s e t a s k s a r e p a r t i c i p a t i n g i n a data r a c e
42 ! $omp t a s k
43 val = waitAbit ( )
44 ! a s t o r e o f a s i n g l e machine word ( b o o l )
45 ! $omp atomic w r i t e
46 d e a d o r a l i v e = HorT
47 ! $omp end atomic
48 ! $omp end t a s k
49 ! $omp t a s k
50 val = waitAbit ( )
51 ! a s t o r e o f a s i n g l e machine word ( b o o l )
52 ! $omp atomic w r i t e
53 d e a d o r a l i v e = . not . HorT
54 ! $omp end atomic
55 ! $omp end t a s k
56 ! $omp end s i n g l e
57 ! $omp end p a r a l l e l
58
59 i f ( d e a d o r a l i v e ) then
60 print ∗ , ” alive .”
61 lcount = lcount + 1
62 else
63 p r i n t ∗ , ” dead . ”
64 dcount = dcount + 1
65 endif
66 end do ! end l o o p o v e r t r i a l s ( f o r t e s t i n g o n l y )
67
68 p r i n t ∗ , ” dead ” , dcount , ” t i m e s ” , ” and a l i v e ” , l c o u n t , ” t i m e s . ”
69
70 contains
71
72 ! s e e d t h e pseudo random s e q u e n c e with time o f day
Tasks in OpenMP 53
112 end f u n c t i o n w a i t A b i t
113
114 end program main
Tasks in OpenMP 55
An OpenMP single construct example – All threads execute do many things and
do many other things, but only one thread executes exchange boundaries.
!$omp parallel
call do_many_things()
!$omp single
call exchange_boundaries()
!$omp end single
call do_many_other_things()
!$omp end parallel
A basic task example – Inside a parallel region, 3 tasks are created by a single thread.
1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 ! $omp t a s k
4 call fred ()
5 ! $omp end t a s k
6 ! $omp t a s k
7 c a l l daisy ()
8 ! $omp end t a s k
9 ! $omp t a s k
10 call billy ()
11 ! $omp end t a s k
12 ! $omp end s i n g l e
13 ! $omp end p a r a l l e l
56 Chapter 7
A taskwait example – Tasks fred and daisy must complete before task billy starts.
1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 ! $omp t a s k
4 call fred ()
5 ! $omp end t a s k
6 ! $omp t a s k
7 c a l l daisy ()
8 ! $omp end t a s k
9 ! $omp t a s k w a i t
10 ! $omp t a s k
11 call billy ()
12 ! $omp end t a s k
13 ! $omp end s i n g l e
14 ! $omp end p a r a l l e l
Figure 7.9: Tasks make parallel linked list traversal really simple
Linked list with tasks – The implementation with OpenMP tasks is much more
elegant than the three-pass solution in Figure 7.2.
1 ! $omp p a r a l l e l
2 ! $omp s i n g l e
3 p => head
4 do
5 ! $omp t a s k f i r s t p r i v a t e ( p )
6 c a l l processwork (p)
7 ! $omp end t a s k
8 p => p%next
9 i f ( . not . a s s o c i a t e d ( p ) ) e x i t
10 end do
11 ! $omp end s i n g l e
12 ! $omp end p a r a l l e l
58 Chapter 7
Figure 7.11: Parallel Fibonacci program using the divide and con-
quer pattern
32 i n t e g e r : : NW, r e s u l t
33 NW = 30
34 ! $omp p a r a l l e l
35 ! $omp s i n g l e
36 r e s u l t = f i b (NW)
37 ! $omp end s i n g l e
38 ! $omp end p a r a l l e l
39 p r i n t ∗ , ” f i b ( ” ,NW, ” ) = ” , r e s u l t
40 end program main
Tasks in OpenMP 61
1 PROGRAM MAIN
2
3 USE OMP LIB
4 IMPLICIT NONE
5
6 INTEGER : : i
7 INTEGER, PARAMETER : : num steps = 100000000
8 REAL∗8 : : x , pi , sum , s t e p
9 REAL∗8 : : s t a r t t i m e , r u n t i m e
10
11 sum = 0 . 0
12
13 s t e p = 1 . 0 / num steps
14 s t a r t t i m e = OMP GET WTIME( )
15
16 DO i = 1 , num steps
17 x = ( i − 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20
21 p i = s t e p ∗ sum
22 r u n t i m e = OMP GET WTIME( ) − s t a r t t i m e
23
24 WRITE( ∗ , 1 0 0 ) pi , num steps , r u n t i m e
25 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s , ’ , f 8 . 3 , ’ s e c s ’ )
26
27 END PROGRAM MAIN
62 Chapter 7
Serial Pi program using the divide and conquer pattern –Just to make the
code simpler, we pick a number of steps that is a power of 2. This way we can split the
number of steps in half repeatedly and always create intervals that are divisible by 2.
34
35 integer : : i
36 r e a l ∗8 : : pi , sum , s t e p
37
38 s t e p = 1 . 0 / num steps
39 sum = pi comp ( 0 , num steps , s t e p )
40 p i = s t e p ∗ sum
41
42 WRITE( ∗ , 1 0 0 ) pi , num steps
43 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i14 , ’ steps ’ )
44
45 end program main
64 Chapter 7
Parallel Pi program using tasks – It is accomplished with the divide and conquer
pattern by splitting the problem into two subtasks to calculate sum1 and sum2, recursively
solving each task, and then combining the results.
34
35 end module data mod
36
37 program main
38 use omp lib
39 u s e data mod
40 i m p l i c i t none
41
42 integer : : i
43 r e a l ∗8 : : pi , sum , s t e p
44
45 s t e p = 1 . 0 / num steps
46
47 ! $omp p a r a l l e l
48 ! $omp s i n g l e
49 sum = pi comp ( 0 , num steps , s t e p ) ;
50 ! $omp end s i n g l e
51 ! $omp end p a r a l l e l
52
53 p i = s t e p ∗ sum
54
55 WRITE( ∗ , 1 0 0 ) pi , num steps
56 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i14 , ’ steps ’ )
57
58 end program main
8 OpenMP Memory Model
A program with a race condition – A relaxed memory model permits the assertion
to fail; i.e., the thread with id == 1 can observe values in memory such that r ==1 while
x is still 0.
1 program main
2 use omp lib
3 i m p l i c i t none
4
5 integer : : x , y , r
6 i n t e g e r : : id , n t h r d s
7 x = 0
8 y = 0
9 r = 0
10 c a l l o m p s e t n u m t h r e a d s ( 2 ) ! r e q u e s t two t h r e a d s
11 ! $omp p a r a l l e l p r i v a t e ( id , n t h r d s )
12 i d = omp get thread num ( )
13 ! $omp s i n g l e
14 n t h r d s = omp get num threads ( )
15 ! v e r i f y t h a t we have a t l e a s t two t h r e a d s
16 i f ( nthrds < 2) stop 1
17 ! $omp end s i n g l e
18
19 i f ( i d == 0 ) then
20 x = 1
21 r = x
22 e l s e i f ( i d == 1 ) then
23 i f ( r == 1 ) then
24 y = x;
25 i f ( y /= 1 ) then
26 s t o p ” f a i l s y==1” ! Assertion w i l l occasionally f a i l ;
27 ! i . e . , r == 1 w h i l e x == 0
28 endif
29 endif
30 endif
31 ! $omp end p a r a l l e l
32 end program main
68 Chapter 8
1
2 ! sample c o m p i l e command :
3 ! g f o r t r a n −fopenmp −c F i g 8 . 3 regPromote . f 9 0
4 ! to generate ∗ . o o b j e c t f i l e
5
6 program main
7 use omp lib
8 i m p l i c i t none
9
10 interface
11 f u n c t i o n d o i t (A, N, i d )
12 i n t e g e r : : N, i d
13 r e a l ∗8 : : A(N)
14 r e a l ∗8 : : d o i t
15 end f u n c t i o n
16 end i n t e r f a c e
17
18 i n t e g e r , parameter : : MAX = 10000
19 i n t e g e r , parameter : : NMAX = 1000
20 r e a l , parameter : : TOL = 0 . 0 0 0 1
21
22 integer : : iter , N
23 r e a l ∗8 : : A(NMAX)
24 r e a l ∗8 : : conv
25 integer : : id , nthrd
26
27 iter = 0
28 N = 1000
29 A = 0.0
OpenMP Memory Model 69
30 conv = 0 . 0
31
32 ! $omp p a r a l l e l s h a r e d (A, N, i t e r ) f i r s t p r i v a t e ( conv ) p r i v a t e ( id , nthrd )
33 i d = omp get thread num ( )
34 nthrd = omp get num threads ( )
35
36 do w h i l e ( i t e r < MAX)
37 conv = d o i t (A, N, i d )
38 i f ( conv < TOL) e x i t
39 i f ( i d == 0 ) i t e r = i t e r + 1
40 end do
41
42 ! $omp end p a r a l l e l
43
44 end program main
70 Chapter 8
Reductions need a barrier – This program carries out a computation inside a parallel
loop and accumulates the result with a reduction. The function called after the loop uses
the SPMD pattern and does not use any of the values computed in the loop, hence the
programmer used a nowait clause. The last function uses the reduction variable which may
not be available for all threads since the reduction is only guaranteed to complete at the
next barrier following the loop. As a result, this is an incorrect program.
1 i n t e g e r : : id , nthrds , i
2 ! $omp p a r a l l e l s h a r e d (A, B, sum ) p r i v a t e ( id , n t h r d s )
3 i d = omp get thread num ( )
4 n t h r d s = omp get num threads ( )
5
6 ! $omp do r e d u c t i o n (+:sum )
7 do i = 1 , N
8 sum = sum + b i g j o b (A,N)
9 end do
10 ! $omp end do nowait
11
12 b i g g e r j o b (B, i d ) ! a f u n c t i o n t h a t d o e s not u s e A
13 a n o t h e r j o b ( sum , i d ) ! sum may not be a v a i l a b l e
14 ! $omp end p a r a l l e l
OpenMP Memory Model 71
This chapter provides a summary of the items from OpenMP that make up the
Common Core. Hence, there are not any example programs in this chapter.
10 Multithreading beyond the Common Core
integer :: nthreads
nthreads = omp_get_num_threads()
1 ! sample command t o c o m p i l e t o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 1 parClaw . f 9 0
3
4 program main
5 use omp lib
6
7 interface
8 ! i n i t i a l i z a t i o n function
9 s u b r o u t i n e i n i t M a t s (N, A, T)
10 integer : : N
11 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
12 end s u b r o u t i n e
13 ! transform function
14 s u b r o u t i n e t r a n s f o r m (N, id , Nthrds , A, T)
15 i n t e g e r : : N, id , Nthrds
16 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
17 end s u b r o u t i n e
18 end i n t e r f a c e
19
20 real : : trace = 0
21 i n t e g e r : : i , id , N, Nthrds
22 r e a l , d i m e n s i o n ( : , : ) , a l l o c a t a b l e : : A, T
23
24 i n t e g e r : : narg ! number o f Arg
25 c h a r a c t e r ( l e n =10) : : name ! Arg name
26
27 narg = command argument count ( )
28 i f ( narg == 1 ) then
29 c a l l get command argument ( 1 , name )
30 r e a d ( name , ∗ ) N
31 else
Multithreading beyond the Common Core 77
32 N = 10
33 endif
34 p r i n t ∗ , ”N=”, N
35
36 ! a l l o c a t e s p a c e f o r two N x N m a t r i c e s and i n i t i a l i z e them
37 a l l o c a t e (T(N,N) )
38 a l l o c a t e (A(N,N) )
39 c a l l i n i t M a t s (N, A, T)
40
41 ! $omp p a r a l l e l i f (N>100) num threads ( 4 ) d e f a u l t ( none ) &
42 ! $omp& s h a r e d (A, T,N) p r i v a t e ( i , id , Nthrds ) r e d u c t i o n (+: t r a c e )
43 i d = omp get thread num ( )
44 Nthrds = omp get num threads ( )
45 c a l l t r a n s f o r m (N, id , Nthrds , A, T)
46
47 ! compute t r a c e o f A matrix
48 ! i . e . , t h e sum o f d i a g o n a l e l e m e n t s
49 ! $omp do
50 do i = 1 , N
51 t r a c e = t r a c e + A( i , i )
52 enddo
53 ! $omp end do
54 ! $omp end p a r a l l e l
55 p r i n t ∗ , ” t r a n s f o r m c o m p l e t e with t r a c e = ” , t r a c e
56 end program main
integer :: i
!$omp do lastprivate(ierr)
do i = 1, N
ierr = work(i)
enddo
!$omp end do
78 Chapter 10
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 3 runtimeEx . f 9 0
3
4 s u b r o u t i n e f o r c e s ( npart , x , f , s i d e , r c o f f )
5 use omp lib
6 i m p l i c i t none
7
8 interface
9 ! e x t e r n a l f u n c t i o n f o r p o t e n t i a l e n e r g y term
10 f u n c t i o n pot ( d i s t ) r e s u l t ( r e s )
11 r e a l ∗8 : : d i s t
12 r e a l ∗8 : : r e s
13 end f u n c t i o n pot
14 end i n t e r f a c e
15
16 i n t e g e r ( kind=omp sched kind ) : : kind
17 integer : : chunk size
18 l o g i c a l : : DEBUG
19 i n t e g e r : : npart , i , j
20 r e a l ∗8 : : x ( 0 : n p a r t ∗3+2) , f ( 0 : n p a r t ∗3+2)
21 r e a l ∗8 : : s i d e , r c o f f
22 r e a l ∗8 : : f x i , f y i , f z i
23 r e a l ∗8 : : xx , yy , zz , rd , fcomp
24
25 c h a r a c t e r ( l e n = : ) , a l l o c a t a b l e : : schdKind ( : )
26 a l l o c a t e ( c h a r a c t e r ( 8 ) : : schdKind ( 0 : 4 ) )
27 ! map s c h e d u l e kind enum v a l u e s t o s t r i n g s f o r p r i n t i n g
28
29 schdKind ( 0 ) = ”ERR”
30 schdKind ( 1 ) = ” s t a t i c ”
31 schdKind ( 2 ) = ” dynamic ”
Multithreading beyond the Common Core 79
32 schdKind ( 3 ) = ” g u i d e d ”
33 schdKind ( 4 ) = ” auto ”
34 DEBUG = . t r u e .
35
36 ! $omp p a r a l l e l do s c h e d u l e ( runtime ) &
37 ! $omp p r i v a t e ( f x i , f y i , f z i , j , xx , yy , zz , rd , fcomp )
38
39 do i = 0 , n p a r t ∗3 −1 , 3
40 ! zero f o r c e components on p a r t i c l e i
41 fxi = 0.0
42 fyi = 0.0
43 fzi = 0.0
44
45 ! l o o p o v e r a l l p a r t i c l e s with i n d e x > i
46 do j = i +3, n p a r t ∗3 −1 , 3
47
48 ! compute d i s t a n c e between i and j with wraparound
49 xx = x ( i ) − x ( j )
50 yy = x ( i +1) − x ( j +1)
51 z z = x ( i +2) − x ( j +2)
52
53 if ( xx < ( −0.5∗ s i d e ) ) xx = xx + s i d e
54 if ( xx > ( 0 . 5 ∗ s i d e ) ) xx = xx − s i d e
55 if ( yy < ( −0.5∗ s i d e ) ) yy = yy + s i d e
56 if ( yy > ( 0 . 5 ∗ s i d e ) ) yy = yy − s i d e
57 if ( z z < ( −0.5∗ s i d e ) ) z z = z z + s i d e
58 if ( zz > ( 0 . 5 ∗ s i d e ) ) zz = zz − s i d e
59 rd = xx ∗ xx + yy ∗ yy + z z ∗ z z
60
61 ! i f d i s t a n c e i s i n s i d e c u t o f f r a d i u s , compute f o r c e s
62 i f ( rd <= r c o f f ∗ r c o f f ) then
63 fcomp = pot ( rd )
64 f x i = f x i + xx∗ fcomp
65 f y i = f y i + yy∗ fcomp
66 f z i = f z i + z z ∗ fcomp
67 !$OMP c r i t i c a l
68 f ( j ) = f ( j ) − xx∗ fcomp
69 f ( j +1) = f ( j +1) − yy∗ fcomp
70 f ( j +2) = f ( j +2) − z z ∗ fcomp
80 Chapter 10
71 !$OMP end c r i t i c a l
72 endif
73 enddo
74 ! update f o r c e s on p a r t i c l e i
75 f ( i ) = f ( i ) + fxi
76 f ( i +1) = f ( i +1) + f y i
77 f ( i +2) = f ( i +2) + f z i
78 enddo
79 ! $omp end p a r a l l e l do
80
81 i f (DEBUG) then
82 c a l l o m p g e t s c h e d u l e ( kind , c h u n k s i z e )
83 p r i n t ∗ , ” s c h e d u l e ” , schdKind ( kind ) , ” c h u n k s i z e =”, c h u n k s i z e
84 endif
85 end s u b r o u t i n e f o r c e s
Multithreading beyond the Common Core 81
Task Dependencies – This program implements the DAG (Directed Acyclic Graph)
shown in Figure 10.5. The functions represent the nodes and the edges of the DAG are
captured by the patterns of depend clauses.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 6 taskDep . f 9 0
3
4 program main
5 use omp lib
6 i m p l i c i t none
7 e x t e r n a l : : AWork , BWork , Cwork , Dwork , Ework
8 r e a l : : A, B, C, D, E
9
10 ! $omp p a r a l l e l s h a r e d (A, B, C, D, E)
11 ! $omp s i n g l e
12 ! $omp t a s k depend ( out :A)
13 c a l l Awork (A)
14 ! $omp end t a s k
15 ! $omp t a s k depend ( out : E)
16 c a l l Ework (E)
17 ! $omp end t a s k
18 ! $omp t a s k depend ( i n :A) depend ( out : B)
19 c a l l Bwork (B)
20 ! $omp end t a s k
21 ! $omp t a s k depend ( i n :A) depend ( out : C)
22 c a l l Cwork (C)
23 ! $omp end t a s k
24 ! $omp t a s k depend ( i n : B, C, E)
25 c a l l Dwork (E)
26 ! $omp end t a s k
27 ! $omp end s i n g l e
28 ! $omp end p a r a l l e l
29 end program main
Multithreading beyond the Common Core 83
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 0 . 7 t h r e a d p r i v . f 9 0
3
4 module data mod
5 i m p l i c i t none
6 i n t e g e r : : counter
7 ! $omp t h r e a d p r i v a t e ( c o u n t e r )
8 type node
9 i n t e g e r : : data
10 type ( node ) , p o i n t e r : : next
11 end type node
12 contains
13 subroutine i n i t l i s t (p)
14 type ( node ) , p o i n t e r : : p
15 ! i n i t l i s t here
16 end s u b r o u t i n e
17 subroutine processwork (p)
18 type ( node ) , p o i n t e r : : p
19 ! p r o c e s work h e r e
20 end s u b r o u t i n e
21 subroutine f r e e L i s t (p)
22 type ( node ) , p o i n t e r : : p
23 ! f r e e l i s t here
24 end s u b r o u t i n e
25 subroutine inc count ()
26 counter = counter + 1
27 end s u b r o u t i n e
28 end module data mod
29
30 program main
84 Chapter 10
A counter with threadprivate in Fortran – This code come from the OpenMP
4.5 Examples document (threadprivate.1.f). This Fortran function creates a global scope
variable in Fortran through common blocks. Hence, the counter is placed in a named
common block and that block is made threadprivate.
1 INTEGER FUNCTION INCREMENT COUNTER( )
2 COMMON/INC COMMON/COUNTER
3 !$OMP THREADPRIVATE( /INC COMMON/ )
4 COUNTER = COUNTER +1
5 INCREMENT COUNTER = COUNTER
6 RETURN
7 END FUNCTION INCREMENT COUNTER
11 Synchronization and the OpenMP Memory Model
Examples of sequence points – This code shows the most common sequence points
and the relations sequenced-before, indeterminately sequenced, and unsequenced.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n −fopenmp −c F i g 1 1 . 1 s e q P t s . f 9 0
3 program main
4 i m p l i c i t none
5 integer : : a , b, c , d, e
6 i n t e g e r : : i , N = 100
7 i n t e g e r , e x t e r n a l : : func1 , func2 , f u n c 3
8
9 ! Each s e m i c o l o n d e f i n e s a s e q u e n c e p o i n t . . .
10 ! a l l o r d e r e d by sequenced−b e f o r e r e l a t i o n s .
11
12 a = 1; b = 2; c = 0
13
14 ! 3 sequence points : the f u l l statement plus the 2 f u n c t i o n c a l l s .
15 ! The + o p e r a t o r i s not a s e q u e n c e p o i n t s o t h e f u n c t i o n c a l l s
16 ! a r e u n o r d e r e d and t h e r e f o r e , i n d e t e r m i n a t e l y s e q u e n c e d .
17
18 d = func2 ( a ) + func3 (b)
19
20 ! each e x p r e s s i o n i n t h e f o r s t a t e m e n t i s a s e q u e n c e p o i n t .
21 ! they o c c u r i n a sequenced−b e f o r e r e l a t i o n .
22
23 do i = 1 , N
24 ! f u n c t i o n i n v o c a t i o n s a r e each a s e q u e n c e p o i n t . Argument
25 ! e v a l u a t i o n s are unordered or i n d e t e r m i n a t e l y sequenced .
26 e = func1 ( func2 ( a ) , func3 (b ) )
27 enddo
28
29 ! There i s no F o r t r a n i n c r e m e n t s y n t a x such a s a++ i n C .
30 ! a = a + 1 e v a l u a t e s a + 1 f i r s t , then s t o r e t h e new v a l u e i n a .
31
32 a = a + 1
33 end program main
88 Chapter 11
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e
2 ! g f o r t r a n −fopenmp −c F i g 1 1 . 6 h i s t . f 9 0
3
4 program main
5 use omp lib
6 i m p l i c i t none
7
8 i n t e g e r , parameter : : n u m t r i a l s = 1000000 ! number o f x v a l u e s
9 i n t e g e r , parameter : : num bins = 100 ! number o f b i n s i n h i s t o g r a m
10 r e a l ∗ 8 , s a v e : : xlow = 0 . 0 ; ! low end o f x r a n g e
11 r e a l ∗8 , save : : xhi = 1 0 . 0 ; ! h i g h end o f x r a n g e
12
13 r e a l ∗8 : : x
14 i n t e g e r ∗8 : : h i s t ( num bins ) ! the histogram
15 i n t e g e r ∗8 : : i v a l , i
16 r e a l ∗8 : : b i n w i d t h ! t h e width o f each b i n i n t h e hi , stogram
17 r e a l ∗8 : : sumh , sumhsq , ave , s t d d e v
18 ! h i s t l c k s i s an a r r a y o f l o c k s , one p e r bucket
19 i n t e g e r ( kind=o m p l o c k k i n d ) : : h i s t l c k s ( num bins )
20
21
22 interface
23 f u n c t i o n drandom ( ) r e s u l t ( v a l )
24 r e a l ∗8 : : v a l
25 end f u n c t i o n
26 subroutine seed ( low in , h i i n )
27 r e a l ∗8 , i n t e n t ( in ) : : low in , h i i n
28 end s u b r o u t i n e
29 end i n t e r f a c e
30
31 c a l l s e e d ( xlow , x h i ) ! s e e d random g e n e r a t o r o v e r r a n g e o f x
32 b i n w i d t h = ( x h i − xlow ) / d b l e ( num bins )
33
92 Chapter 11
34 ! i n i t i a l i z e t h e h i s t o g r a m and t h e a r r a y o f l o c k s
35 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
36 do i = 1 , num bins
37 hist ( i ) = 0
38 c a l l omp init lock ( hist lcks ( i ))
39 enddo
40 ! $omp end p a r a l l e l do
41 ! t e s t uniform pseudorandom s e q u e n c e by a s s i g n i n g v a l u e s
42 ! to the r i g h t histogram bin
43 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c ) p r i v a t e ( x , i v a l )
44 do i = 1 , n u m t r i a l s
45 x = drandom ( )
46 i v a l = i n t 8 ( ( x − xlow ) / b i n w i d t h )
47 ! protect histogram bins .
48 ! Low overhead due t o uncontended l o c k s
49 c a l l omp set lock ( h i s t l c k s ( i v a l ))
50 hist ( ival ) = hist ( ival ) + 1
51 c a l l omp unset lock ( h i s t l c k s ( i v a l ))
52 enddo
53 ! $omp end p a r a l l e l do
54
55 sumh = 0 . 0
56 sumhsq = 0 . 0
57 ! compute s t a t i s t i c s ( ave , s t d d e v ) and d e s t r o y l o c k s
58 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c ) r e d u c t i o n (+:sumh , sumhsq )
59 do i = 1 , num bins
60 sumh = sumh + h i s t ( i )
61 sumhsq = sumhsq + h i s t ( i ) ∗ h i s t ( i )
62 c a l l omp destroy lock ( h i s t l c k s ( i ))
63 enddo
64 ! $omp end p a r a l l e l do
65
66 ave = sumh / d b l e ( num bins )
67 s t d d e v = s q r t ( sumhsq / d b l e ( num bins ) − ave ∗ ave )
68 end program main
Synchronization and the OpenMP Memory Model 93
STREAM initialization with and without first touch – Without first touch:
step 1.a + step 2. With first touch: step 1.b + step 2.
1 ! Step 1 . a I n i t i a l i z a t i o n by i n i t i a l t h r e a d o n l y
2 do j = 1 , V e c t o r S i z e
3 a( j ) = 1.0
4 b( j ) = 2.0
5 c ( j ) = 0.0
6 enddo
7
8 ! Step 1 . b I n i t i a l i z a t i o n by a l l t h r e a d s ( f i r s t touch )
9 c a l l omp set dynamic ( 0 )
10 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
11 do j = 1 , V e c t o r S i z e
12 a( j ) = 1.0
13 b( j ) = 2.0
14 c ( j ) = 0.0
15 enddo
16 ! $omp end p a r a l l e l do
17
18 ! Step 2 Compute
19 ! $omp p a r a l l e l do s c h e d u l e ( s t a t i c )
20 do j = 1 , V e c t o r S i z e
21 a( j ) = b( j ) + d ∗ c ( j )
22 enddo
23 ! $omp end p a r a l l e l do
96 Chapter 12
Affinity format example – We set the thread affinity format string and then ran the
STREAM benchmark on the server-node with logical CPU numbering from Figure 12.4.
We show two different executions of the STREAM benchmark: one with OMP PROC BIND
set to spread and the other with OMP PROC BIND set to close.
$ export OMP_PROC_BIND=close
$ ./stream |sort -k3
<stream results omitted ...>
Thrd Lev=1 , thrd_num=0 , thread_aff=0
Thrd Lev=1 , thrd_num=1 , thread_aff=32
Thrd Lev=1 , thrd_num=2 , thread_aff=2
Thrd Lev=1 , thrd_num=3 , thread_aff=34
Thrd Lev=1 , thrd_num=4 , thread_aff=4
Thrd Lev=1 , thrd_num=5 , thread_aff=36
Thrd Lev=1 , thrd_num=6 , thread_aff=6
Thrd Lev=1 , thrd_num=7 , thread_af=38
98 Chapter 12
Serial Pi program –This program approximates a definite integral using the midpoint
rule The loop iterations are independent other than the summation into sum. Note that we
must explicitly represent all constants as floats to prevent internal operations from using
double precision.
1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 1000000
6 REAL : : x , pi , sum , s t e p
7
8 sum = 0 . 0
9
10 s t e p = 1 . 0 / num steps
11
12 DO i = 1 , num steps
13 x = ( i − 0.5) ∗ step
14 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
15 ENDDO
16
17 p i = s t e p ∗ sum
18 p r i n t ∗ , ” p i =”, p i
19
20 END PROGRAM MAIN
Beyond OpenMP Common Core Hardware 99
1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 100000
6 REAL : : x0 , x1 , x2 , x3 , pi , sum
7 REAL : : s t e p
8
9 sum = 0 . 0
10 s t e p = 1 . 0 / num steps
11
12 DO i = 1 , num steps , 4
13 x0 = ( i − 0 . 5 ) ∗ s t e p
14 x1 = ( i + 0 . 5 ) ∗ s t e p
15 x2 = ( i + 1 . 5 ) ∗ s t e p
16 x3 = ( i + 2 . 5 ) ∗ s t e p
17 sum = sum + 4 . 0 ∗ ( 1 . 0 / ( 1 . 0 + x0 ∗ x0 ) &
18 & + 1 . 0 / ( 1 . 0 + x1 ∗ x1 ) &
19 & + 1 . 0 / ( 1 . 0 + x2 ∗ x2 ) &
20 & + 1 . 0 / ( 1 . 0 + x3 ∗ x3 ) )
21 ENDDO
22
23 p i = s t e p ∗ sum
24
25 WRITE( ∗ , 1 0 0 ) pi , num steps
26 100 FORMAT( ’ p i = ’ , f 1 5 . 8 , ’ , ’ , i 1 4 , ’ s t e p s ’ )
27
28 END PROGRAM MAIN
100 Chapter 12
# File 1: "Fig_12.15_explicitVecPi.f90"
program main
interface
function get_pi_vec () result(r) bind(C, name="get_pi_vec")
use, intrinsic :: iso_c_binding, only : c_float
real(c_float) :: r
end function get_pi_vec
end interface
real :: pi
pi = get_pi_vec()
print *, "in Fortran: pi=", pi
end program
#File 2: "get_pi_vec.c"
#include <x86intrin.h>
static long num_steps = 100000;
float scalar_four = 4.0f, scalar_zero = 0.0f, scalar_one = 1.0f;
float step;
float get_pi_vec ()
{
int i;
float pi;
float vsum[4], ival;
Beyond OpenMP Common Core Hardware 101
1 Save t h e c o n t e n t s i n 2 f i l e s a s below :
2 ! To b u i l d and run :
3 ! % g c c −c −fopenmp g e t p i p a r v e c . c
4 ! % g f o r t r a n −fopenmp g e t p i p a r v e c . o F i g 1 2 . 1 6 pa r Ve cP i . f 9 0
5 ! % . / a . out
6
7 ! Save t h e c o n t e n t s i n 2 f i l e s a s below :
8 ! To b u i l d and run :
9 ! % g c c −c −fopenmp g e t p i p a r v e c . c
10 ! % g f o r t r a n −fopenmp g e t p i p a r v e c . o F i g 1 2 . 1 6 pa r Ve cP i . f 9 0
11 ! % . / a . out
12
13 #F i l e 1 : ” F i g 1 2 . 1 6 p ar Ve cP i . f 9 0 ”
14
15 program main
16 use omp lib
17 interface
18 f u n c t i o n g e t p a r p i v e c ( ) r e s u l t ( r ) bind (C, name=” g e t p a r p i v e c ” )
19 use , i n t r i n s i c : : i s o c b i n d i n g , o n l y : c f l o a t
20 real ( c float ) : : r
21 end f u n c t i o n g e t p a r p i v e c
22 end i n t e r f a c e
23
24 real : : pi
25 pi = get par pi vec ()
26 p r i n t ∗ , ” i n F o r t r a n : p i =”, p i
27 end program
28
29 #F i l e 2 : ” g e t p i p a r v e c . c ”
30
31 #i n c l u d e <x 8 6 i n t r i n . h>
32 s t a t i c l o n g num steps = 1 0 0 0 0 0 ;
33 #d e f i n e MAX THREADS 4
Beyond OpenMP Common Core Hardware 103
73 }
Beyond OpenMP Common Core Hardware 105
OpenMP program to vectorize the Pi program – The simd clause directs the
compiler to explicitly vectorize the program. As with many OpenMP features, this clause
asserts to the compiler that it is safe to vectorize the code and it will do so, even if there
are loop-carried dependencies that should prevent vectorization.
1 PROGRAM MAIN
2 IMPLICIT NONE
3
4 INTEGER : : i
5 INTEGER, PARAMETER : : num steps = 100000
6 REAL : : x , pi , sum , s t e p
7
8 sum = 0 . 0
9
10 s t e p = 1 . 0 / num steps
11
12 !$OMP p a r a l l e l SIMD p r i v a t e ( x ) r e d u c t i o n (+:sum )
13 DO i = 1 , num steps
14 x = ( i − 0.5) ∗ step
15 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
16 ENDDO
17 !$OMP end p a r a l l e l SIMD
18
19 p i = s t e p ∗ sum
20 p r i n t ∗ , ” p i =”, p i
21
22 END PROGRAM MAIN
106 Chapter 12
1 PROGRAM MAIN
2
3 USE OMP LIB
4 IMPLICIT NONE
5
6 INTEGER : : i
7
8 INTEGER, PARAMETER : : num steps = 100000
9 REAL : : x , pi , sum , s t e p
10
11 sum = 0 . 0
12
13 s t e p = 1 . 0 / num steps
14
15 ! $omp p a r a l l e l do simd p r i v a t e ( x ) r e d u c t i o n (+:sum )
16 DO i = 1 , num steps
17 x = ( i − 0.5) ∗ step
18 sum = sum + 4 . 0 / ( 1 . 0 + x ∗ x )
19 ENDDO
20 ! $omp end p a r a l l e l do simd
21
22 p i = s t e p ∗ sum
23
24 p r i n t ∗ , ” p i =”, p i
25 END PROGRAM MAIN
Beyond OpenMP Common Core Hardware 107
1 program main
2 use omp lib
3 i m p l i c i t none
4
5 i n t e g e r , parameter : : N = 1024
6 r e a l ∗8 : : a (N) , b (N) , c (N)
7 integer : : i
8
9 ! i n i t i a l i z e a , b , and c ( code not shown )
10
11 ! $omp t a r g e t
12 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
13 do i = 1 , N
14 c ( i ) = c ( i ) + a( i ) ∗ b( i )
15 enddo
16 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
17 ! $omp end t a r g e t
18
19 end program main
108 Chapter 12
Explicit data movement with the target directive – The map clause controls
movement of data from the host to a device or from the device onto the host. When
working with pointers to arrays, you need to use array sections to define precisely which
data to move.
1 program main
2 use omp lib
3
4 i n t e g e r , parameter : : N = 1024
5 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c
6 integer : : i
7
8 a l l o c a t e ( a (N) )
9 a l l o c a t e ( b (N) )
10 a l l o c a t e ( c (N) )
11
12 ! i n i t i a l i z e a , b , and c ( code not shown )
13
14 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : c ( 1 :N) )
15 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
16 do i = 1 , N
17 c ( i ) = c ( i ) + a( i ) ∗ b( i )
18 enddo
19 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
20 ! $omp end t a r g e t
21
22 end program main
Beyond OpenMP Common Core Hardware 109
Multiple target regions – The map clause controls movement of data from the host
to a device or from the device onto the host. When working with pointers to arrays, you
need to use array sections to define precisely which data to move.
1 program main
2 use omp lib
3
4 i n t e g e r , parameter : : N = 1024
5 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c , d
6 integer : : i
7
8 allocate ( a (N) )
9 allocate ( b (N) )
10 allocate ( c (N) )
11 allocate ( d (N) )
12
13 ! i n i t i a l i z e a , b , and c ( code not shown )
14
15 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : c ( 1 :N) )
16 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
17 do i = 1 , N
18 c ( i ) = c ( i ) + a( i ) ∗ b( i )
19 enddo
20 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
21 ! $omp end t a r g e t
22
23 ! $omp t a r g e t map( t o : a ( 1 :N) , b ( 1 :N) ) map( tofrom : d ( 1 :N) )
24 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
25 do i = 1 , N
26 d( i ) = d( i ) + a( i ) ∗ c ( i )
27 enddo
28 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
29 ! $omp end t a r g e t
30
31 end program main
110 Chapter 12
Target Data Region – A single target data region manages data at the level of a
device. It persists and is used between multiple target constructs. code:ompTargDat
1 program main
2 use omp lib
3 i n t e g e r , parameter : : N = 1024
4 r e a l ∗8 , a l l o c a t a b l e , dimension ( : ) :: a, b, c , d
5 integer : : i
6 a l l o c a t e ( a (N) )
7 a l l o c a t e ( b (N) )
8 a l l o c a t e ( c (N) )
9 a l l o c a t e ( d (N) )
10
11 ! i n i t i a l i z e a , b , and c ( code not shown )
12
13 ! $omp t a r g e t data map( t o : a ( 1 :N) , b ( 1 :N) , c ( 1 :N) ) map( tofrom : d ( 1 :N) )
14
15 ! $omp t a r g e t
16 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
17 do i = 1 , N
18 c ( i ) = c ( i ) + a( i ) ∗ b( i )
19 enddo
20 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
21 ! $omp end t a r g e t
22
23 ! $omp t a r g e t
24 ! $omp teams d i s t r i b u t e p a r a l l e l do simd
25 do i = 1 , N
26 d( i ) = d( i ) + a( i ) ∗ c ( i )
27 enddo
28 ! $omp end teams d i s t r i b u t e p a r a l l e l do simd
29 ! $omp end t a r g e t
30
31 ! $omp end t a r g e t data
32
33 end program main
13 Your Continuing Education in OpenMP
A subtle deadlock with tasks: – This is the tasking.9.c example from the OpenMP
4.5 Examples document. This function can deadlock if the thread suspends task 1 to begin
work on task 2.
1 ! sample c o m p i l e command t o g e n e r a t e ∗ . o o b j e c t f i l e :
2 ! g f o r t r a n c −fopenmp −c F i g 1 3 . 1 taskBug . f 9 0
3
4 s u b r o u t i n e work ( )
5 use omp lib
6 i m p l i c i t none
7
8 ! $omp t a s k ! task 1
9 ! $omp t a s k ! task 2
10 ! $omp c r i t i c a l ! Critical region 1
11 ! do work h e r e
12 ! $omp end c r i t i c a l ! end C r i t i c a l Region 1
13 ! $omp end t a s k ! end t a s k 2
14 ! $omp c r i t i c a l ! C r i t i c a l Region 2
15 ! $omp t a s k ! t a s k 3
16 ! do work h e r e }
17 ! $omp end t a s k ! end t a s k 3
18 ! $omp end c r i t i c a l ! end C r i t i c a l Region 2
19 ! $omp end t a s k ! end t a s k 1
20 end s u b r o u t i n e