0% found this document useful (0 votes)
63 views17 pages

Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center

The document describes exercises for an OpenMP lab that teach how to parallelize code using OpenMP directives. Students will modify example serial code to add OpenMP pragmas and compile with OpenMP support enabled. The exercises cover parallelizing loops, using reductions, and synchronizing updates between threads using constructs like critical sections. Students will run the code serially and in parallel to compare performance.

Uploaded by

Sidou Sissah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views17 pages

Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center

The document describes exercises for an OpenMP lab that teach how to parallelize code using OpenMP directives. Students will modify example serial code to add OpenMP pragmas and compile with OpenMP support enabled. The exercises cover parallelizing loops, using reductions, and synchronizing updates between threads using constructs like critical sections. Students will run the code serially and in parallel to compare performance.

Uploaded by

Sidou Sissah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

OpenMP Lab

Antonio Gómez-Iglesias
[email protected]
Texas Advanced Computing Center
Introduction
What you will learn
• How to compile Code (C and Fortran) with OpenMP
• How to parallelize code with OpenMP
– Use the correct header declarations
– Parallelize simple loops
• How to effectively hide OpenMP statements

What you will do


• Modify example code READ the CODE COMMENTS
• Compile and execute the example
• Compare the run-time of the serial codes and the OpenMP
parallel codes with different scheduling methods
Accessing Lab Files

• Log on to Stampede using your account.


• Untar the file lab_OpenMP.tar file (in ~train00).
• The new directory (lab_openmp) contains
sub-directories for exercises 1-3.
• cd into the appropriate subdirectory for an exercise.

ssh [email protected]
tar -xvf ~train00/lab_OpenMP.tar
cd lab_openmp
Running on compute nodes Interactively

• You will be compiling your code on the login node


• You will be running on the compute nodes
• In one of the sessions, run idev:
– idev -A TRAINING-HPC -t 1:00:00
– idev -A TG-TRA140011 -t 1:00:00
• This will give you access to a compute node
Compiling
• All OpenMP statements are activated by the OpenMP flag:
– Intel compiler: icc/ifort -openmp -fpp source.<c,f90>
• Compilation with the OpenMP flag (-openmp):
Activates OpenMP comment directives (…) :
Fortran: !$OMP ...
C: #pragma omp ...
Enables the macro named _OPENMP
#ifdef _OPENMP evaluates to true
(Fortraners: compile with –fpp)
Enables ”hidden” statements (Fortran only!)
!$ ...
Exercises – Lab 1
• Exercise 1: Kernel check
f_kernel.f90/c_kernel.c
Kernel of the calculation (see exercise 2)
Parallelize one Loop
• Exercise 2: Calculation of p
f_pi.f90/c_pi.c
Parallelize one Loop with a reduction
• Exercise 3: daxpy (a * x + b)
f_daxpy.f90/c_daxpy.c
Parallelize one Loop
Exercise 1: p Integration Kernel Check
• cd exercise_kernel
• Codes: f_kernel.f90/c_kernel.c
• Number of intervals is varied (Trial loop)
Kernel 1 Parallelize the Loop over i :
Trial Loop: itrial Use omp parallel do/for
Calculation of n and deltax Set appropriate variables to private
Loop over i
2 Compile with:
make sure area >0.0
ifort -openmp f_kernel.f90
•1 Parallelize the code icc -openmp c_kernel.c
•2 Compile
•3 Run with 1, 2, 4, 8,12, 16 threads
 Timings decrease with more
e.g. export OMP_NUM_THREADS=4 threads.
./a.out  If you execute with more threads
Try also: export KMP_AFFINITY=compact
than cores the timings will NOT
4 Compare the timings decrease. Why?
Exercise 2: p Integration
• cd exercise_pi
• Codes: f_pi.90/c_pi.c
• Number of intervals is varied (Trial loop)

1 Parallelize the Loop over i :


p calculation Use omp parallel do/for
Trial Loop: itrial with the default(none) clause
Calculation of n and deltax 2 Compile with:
Loop over i
make f_pi
or
make c_pi
•1 Parallelize the code • 3 Run with 1, 2, 4, 8,12 threads
•2 Complete OpenMP statements e.g. export OMP_NUM_THREADS=4
./c_pi or ./f_pi
– Initialization • 4 Compare timings
– omp_get_max_threads
 Timings decrease with more threads
– omp_get_thread_num  What is the scale up at 12 threads?.
Exercise 3: daxpy
• cd exercise_daxpy
• Codes: f_daxpy.f90/c_daxpy.c
• Number of intervals is varied (Trial loop)

1 Parallelize the Loop over i :


daxpy Use omp parallel do/for
Trial Loop: itrial with the default(none) clause
Loop over i 2 Compile with:
make f_daxpy
or
make c_daxpy
• Parallelize the code • 3 Run with 1 and 12
•1 complete OpenMP statements • 4 Compare timings
– Initialization • Why is performance only doubled?
– omp_get_max_threads  Hint: Parallel performance can be limited by
memory bandwidth– what is happening for every
daxpy operation? (Is there cache reuse?)
Exercises – Lab 2
• Exercise 4: Update from neighboring cells (2 arrays)
f_neighbor.f90/c_neighbor.c
Create a Parallel Region
Use a Single construct to initialize
Use a Critical construct to update
Use dynamic or guided scheduling
• Exercise 5: Update from neighboring cells (same array)
f_red black.f90/c_red black.c
Parallelize 3 individual loops, use a reduction
Create a Parallel Region
Combine loops 1 and 2
Use a Single construct to initialize
Exercise 4: Neighbor Update; Part 1
• cd exercise_neighbor
• Codes: f_neighbor.f90/c_neighbor.c

Compile with: make f_neighbor


neighbor update make c_neighbor
Parallel Region
Initialization: j_update
• Parallelize the Loop over i
Parallelize loop i
Loop i • Use a single construct
Loop j for initialization
increment j_update
Loop k • Would a master construct
b is calculated from a work, too?
• Use critical for increment
of j_update
• Use omp parallel do/for
• Try different schedules:
with the default(none) clause
static, dynamic, guided
Exercise 4: Neighbor Update; Part 2

neighbor update
Compile with: make f_neighbor
Parallel Region
Initialization: j_update make c_neighbor
Parallelize loop i
Loop i • Change the single
Loop j to a master construct
single or master
increment j_update • Run with 1 and 12 threads
end single or end master • How does the number
Loop k of j_update change?
b is calculated from a
Exercise 5: Red-Black Update; Part 1
• cd exercise_redblack
• Codes: f_red_black.f90/c_red_black.c
• make a copy and create f_red_black_v1.f90/c_read_black_v1.c
Compile with: make f_red_black_v1
red-black update make c_red_black_v1
Iteration Loop: niter Part 1
Loop: Update even elements
Loop: Update odd elements • Parallelize each loop separately
Initialize error • Use omp parallel do/for
Loop-summation: error for the Update loops
• Use a reduction
for the Error calculation
with the default(none) clause
• Try static scheduling
Exercise 5: Red-Black Update; Part 2
• cd exercise_redblack
• Start from version 1
• Codes: f_red_black.f90/c_red_black.c
• make a copy and create f_red_black_v2.f90/c_read_black_v2.c
Compile with: make f_red_black_v2
red-black update make c_red_black_v2
Iteration Loop: niter Part 2
Loop:
Update even and odd el. • Can the loops be combined?
Initialize error • Why can the update loops
Loop-summation: error
be combined?
• Why can the error loop
not be combined with the update
• Try static scheduling loops?
• Task:
Combine the update loops
Solution 5: Red-Black Update; Part 2

red-black update red-black update


!*** Update even elements !*** Update even and odd
do i=2, n, 2 !*** in one loop
a(i) = 0.5 * (a(i) + a(i-1)) do i=2, n, 2
enddo a(i) = 0.5 * (a(i) + a(i-1))
!*** Update odd elements a(i-1) = 0.5 * (a(i-1) + a(i))
do i=1, n-1, 2 enddo
a(i) = 0.5 * (a(i) + a(i+1))
enddo
Exercise 5: Red-Black Update; Part 3
• cd exercise_redblack
• Start from version 2
• Codes: f_red_black.f90/c_red_black.c
• make a copy and create f_red_black_v3.f90/c_read_black_v3.c
Compile with: make f_red_black_v3
red-black update make c_red_black_v3
Iteration Loop: niter
parallel region Part 3
Loop: • Make one parallel region
Update even and odd el. around both loops:
single update and error.
Initialize error
• The initialization of error
end single
Loop-summation: error has to be done by one thread
end parallel region • Use a single construct
• Would a master construct work?
Exercise 6: Orphaned work-sharing
• cd exercise_print
• Codes: f_print.f90/c_print.c
• make a copy and create f_print_v1.f90/c_print_v1.c

Orphaned work-sharing Compile with: make f_print


parallel region make c_print
print 1
parallel Loop • Inspect the code
print 2
call printer_sub • Run with 1, 2, ... threads
master • Explain the output
print 5
• How often are the 5 print
subroutine print_sub statements executed?
parallel Loop
print 3 • Why?
Loop
print 4

You might also like