0% found this document useful (0 votes)
24 views44 pages

OpenACC 1

This document provides an overview of parallel programming and introduces OpenACC. It defines parallel programming as exposing an algorithm's ability to execute tasks or operations in parallel to improve performance on modern hardware. A real-world example is presented where cancer detection was improved by simulating light propagation through tissue in parallel rather than sequentially. Amdahl's Law is discussed, noting that speedup from parallelization is limited by remaining serial parts of a program. Finally, OpenACC is introduced as a directive-based approach to parallel programming that aims for performance, portability, and ease of use across CPUs and GPUs.

Uploaded by

Ricardo Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views44 pages

OpenACC 1

This document provides an overview of parallel programming and introduces OpenACC. It defines parallel programming as exposing an algorithm's ability to execute tasks or operations in parallel to improve performance on modern hardware. A real-world example is presented where cancer detection was improved by simulating light propagation through tissue in parallel rather than sequentially. Amdahl's Law is discussed, noting that speedup from parallelization is limited by remaining serial parts of a program. Finally, OpenACC is introduced as a directive-based approach to parallel programming that aims for performance, portability, and ease of use across CPUs and GPUs.

Uploaded by

Ricardo Coelho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

INTRODUCTION

MODULE OVERVIEW
Topics to be covered

▪ Introduction to parallel programming


▪ Common difficulties in parallel programming
▪ Introduction to OpenACC
▪ Parallel programming in OpenACC
INTRODUCTION TO PARALLEL
PROGRAMMING
WHAT IS PARALLEL PROGRAMMING?
▪ “Performance Programming”
A+B+C+D
▪ Parallel programming involves exposing an
algorithm’s ability to execute in parallel Sequential Parallel
A B C D A B C D
▪ This may involve breaking a large operation
into smaller tasks (task parallelism)
▪ Or doing the same operation on multiple
data elements (data parallelism)
▪ Parallel execution enables better
performance on modern hardware 2 Steps

3 Steps
A REAL WORLD CASE STUDY
Modern cancer research
▪ The Russian Academy of Science created a program to
simulate light propagation through human tissue
▪ This program was used to be able to more accurately
detect cancerous cells by simulating billions of random
paths that the light could take through human tissue
▪ With parallel programming, they were able to run
thousands of these paths simultaneously
▪ The sequential program took 2.5 hours to run
▪ The parallel version took less than 2 minutes

Parallel Computing Illuminating a Path to Early Cancer Detection


WHAT IS PARALLEL PROGRAMMING?
A real world example

▪ A professor and his 3 teaching assistants (TA) are grading T


1,000 student exams Prof
A
▪ This exam has 8 questions on it
▪ Let’s assume it takes 1 minute to grade 1 question on 1 x1000
exam 8 questions per exam
8,000 questions in total
▪ To maintain fairness, if someone grades a question (for
example, question #1) then they must grade that question
on all other exams
▪ The following is a sequential version of exam grading 1 minute per question
SEQUENTIAL SOLUTION

Grade Exams 1-1000 : Questions #1, 2, 3, 4, 5, 6, 7, 8 : 8000m

8000 m
SEQUENTIAL SOLUTION

Exams 1-1000 Exams 1-1000


: Q #1 : 1000m : Q #5 : 1000m

Exams 1-1000 Exams 1-1000


: Q #2 : 1000m : Q #6 : 1000m

Exams 1-1000 Exams 1-1000


: Q #3 : 1000m : Q #7 : 1000m

Exams 1-1000 Exams 1-1000


: Q #4 : 1000m : Q #8 : 1000m

8000+ m
SEQUENTIAL SOLUTION

Exams 1-1000 : Q #1, 2 :


2000m

Exams 1-1000 : Q #3, 4 :


2000m

Exams 1-1000 : Q #5, 6 :


2000m

Exams 1-1000 : Q #7, 8 :


2000m

8000+ m
PARALLEL SOLUTION

Exams 1-250 : Q #1, 2 : Exams 751-1000 : Q #1, 2 : Exams 501-750 : Q #1, 2 : Exams 251-500 : Q #1, 2 :
500m 500m 500m 500m

Exams 251-500 : Q #3, 4 : Exams 1-250 : Q #3, 4 : Exams 751-1000 : Q #3, 4 : Exams 501-750 : Q #3, 4 :
500m 500m 500m 500m

Exams 501-750 : Q #5, 6 : Exams 251-500 : Q #5, 6 : Exams 1-250 : Q #5, 6 : Exams 751-1000 : Q #5, 6 :
500m 500m 500m 500m

Exams 751-1000 : Q #7, 8 : Exams 501-750 : Q #7, 8 : Exams 251-500 : Q #7, 8 : Exams 1-250 : Q #7, 8 :
500m 500m 500m 500m

2000+ m
PIPELINE

Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2


2m 2m 2m 2m 2m 2m 2m

Q #3, 4 Q #3, 4 Q #3, 4 Q #3, 4 Q #3, 4 Q #3, 4


2m 2m 2m 2m 2m 2m

Q #5, 6 Q #5, 6 Q #5, 6 Q #5, 6 Q #5, 6


2m 2m 2m 2m 2m

Q #7, 8 Q #7, 8 Q #7, 8 Q #7, 8


2m 2m 2m 2m

2006+ m
PIPELINE STALL

Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2 Q #1, 2


2m 2m 2m 2m 2m

Q #3, 4 Q #3, 4 Q #3, 4 Q #3, 4


2m 2m 2m 2m

Q #5, 6 Q #5, 6 Q #5, 6


2m 2m 2m

Q #7, 8 Q #7, 8 Q #7, 8


2m 2m 2m

2006+ m
GRADING EXAMPLE SUMMARY
It’s critical to understand the problem before trying to parallelize it
▪ Can the work be done in an arbitrary order, or must it be done in sequential order?
▪ Does each task take the same amount of time to complete? If not, it may be
necessary to “load balance.”
In our example, the only restriction is that a single question be graded by a single
grader, so we could divide the work easily, but had to communicate periodically.
▪ This case study is an example of task-based parallelism. Each grader is assigned a
task like “Grade questions 1 & 2 on the first 500 tests”
▪ If instead each question could be graded by different graders, then we could have
data parallelism: all graders work on Q1 of the following tests, then Q2, etc.
AMDAHL’S LAW
AMDAHL’S LAW
Serialization Limits Performance

▪ Amdahl’s law is an observation that how much


speed-up you get from parallelizing the code is
limited by the remaining serial part.
▪ Any remaining serial code will reduce the
possible speed-up
▪ This is why it’s important to focus on
parallelizing the most time consuming parts,
not just the easiest.
APPLYING AMDAHL’S LAW
Estimating Potential Speed-up

▪ What’s the maximum speed-up that can be


obtained by parallelizing 50% of the code?
( 1 / 100% - 50% ) = (1 / 1.0 - 0.50 ) = 2.0X
Total Parallel
Runtime (50%)
▪ What’s the maximum speed-up that can be
obtained by parallelizing 25% of the code?
( 1 / 100% - 25% ) = (1 / 1.0 - 0.25 ) = 1.3X
Total Parallel
Runtime (25%)
▪ What’s the maximum speed-up that can be
obtained by parallelizing 90% of the code?
Total Parallel
( 1 / 100% - 90% ) = (1 / 1.0 - 0.90 ) = 10.0X Runtime (90%)
Total Serial Runtime
INTRODUCTION TO OPENACC
OpenACC is a directives- Add Simple Compiler Directive
based programming approach
main()
to parallel computing {
<serial code>
#pragma acc kernels
designed for performance {
<parallel code>
}
and portability on CPUs }

and GPUs for HPC.


3 WAYS TO ACCELERATE
APPLICATIONS

Applications

Compiler Programming
Libraries
Directives Languages

Easy to use Easy to use Most Performance


Most Performance Portable code Most Flexibility
OpenACC
OPENACC PORTABILITY
Describing a generic parallel machine

▪ OpenACC is designed to be portable to many


Host
existing and future parallel platforms Device
▪ The programmer need not think about specific
hardware details, but rather express the
parallelism in generic terms
▪ An OpenACC program runs on a host
(typically a CPU) that manages one or more Host
parallel devices (GPUs, etc.). The host and Memory
device(s) are logically thought of as having Device
separate memories. Memory
OPENACC
Three major strengths

Incremental Single Source Low Learning Curve


OPENACC

Incremental Begin with a working


Enhance Sequential Code sequential code.

#pragma acc parallel loop


▪ Maintain existing for( i = 0; i < N; i++ )
sequential code {
< loop code > Parallelize it with OpenACC.
▪ Add annotations to }
expose parallelism
#pragma acc parallel loop
▪ After verifying for( i = 0; i < N; i++ )
correctness, annotate {
more of the code Rerun the code to verify
< loop code >
} correct behavior,
remove/alter OpenACC
code as needed.
OPENACC

Incremental Single Source Low Learning Curve

▪ Maintain existing
sequential code
▪ Add annotations to
expose parallelism
▪ After verifying
correctness, annotate
more of the code
OPENACC

The compiler can ignore your


Supported Platforms Single Source OpenACC code additions, so the same
code can be used for parallel or
POWER sequential execution.
▪ Rebuild the same code
Sunway on multiple int main(){
int main(){
architectures
x86 CPU ...
...
▪ Compiler determines
x86 Xeon Phi how to parallelize for #pragma acc
for(int i = 0; i <
the desired machine parallel loop
N; i++)
NVIDIA GPU ▪ Sequential code is < loop
maintained code >
PEZY-SC }
}
OPENACC

Incremental Single Source Low Learning Curve

▪ Rebuild the same code


▪ Maintain existing
on multiple
sequential code
architectures
▪ Add annotations to
▪ Compiler determines
expose parallelism
how to parallelize for
▪ After verifying the desired machine
correctness, annotate
▪ Sequential code is
more of the code
maintained
OPENACC
Parallel Hardware
CPU

Low Learning Curve

▪ OpenACC is meant to
int main(){ be easy to use, and
The programmer will
easy to learn
<sequential give hints to the
code> compiler about which ▪ Programmer remains
parts of the code to in familiar C, C++, or
#pragma acc
Compiler parallelize. Fortran
kernels
Hint The compiler will then ▪ No reason to learn
{
generate parallelism low-level details of the
<parallel for the target parallel hardware.
code> hardware.
}

}
OPENACC

Incremental Single Source Low Learning Curve

▪ Rebuild the same code ▪ OpenACC is meant to


▪ Maintain existing be easy to use, and
on multiple
sequential code easy to learn
architectures
▪ Add annotations to ▪ Programmer remains
▪ Compiler determines
expose parallelism in familiar C, C++, or
how to parallelize for
▪ After verifying the desired machine Fortran
correctness, annotate ▪ No reason to learn
▪ Sequential code is
more of the code low-level details of the
maintained
hardware.
GAUSSIAN 16 ANSYS FLUENT VASP
Prof. Georg Kresse
Mike Frisch, Ph.D. Sunil Sathe
Computational
President and CEO Lead Software Developer
Materials Physics
Gaussian, Inc. ANSYS Fluent
University of Vienna

“Using OpenACC allowed us to “For VASP, OpenACC is the way


continue development of our “We’ve effectively used forward for GPU acceleration.
fundamental algorithms and OpenACC for heterogeneous Performance is similar and in
software capabilities computing in ANSYS Fluent with some cases better than CUDA C,
simultaneously with the GPU- impressive performance. We’re and OpenACC dramatically
related work. In the end, we now applying this work to more decreases GPU development and
could use the same code base of our models and new maintenance efforts. We’re
for SMP, cluster/ network and platforms.” excited to collaborate with NVIDIA
GPU parallelism. PGI's and PGI as an early adopter of
compilers were essential to the CUDA Unified Memory.”
success of our efforts.”
OPENACC SUCCESSES

LSDalton PowerGrid COSMO INCOMP3D

Quantum Chemistry Medical Imaging Weather and Climate CFD


Aarhus University University of Illinois MeteoSwiss, CSCS NC State University
12X speedup 40 days to 40X speedup
1 week 2 hours 3X energy efficiency 4X speedup

MAESTRO
NekCEM CASTRO CloverLeaf FINE/Turbo

CFD
Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA
Argonne National Lab Stony Brook University AWE International
2.5X speedup 4.4X speedup 4X speedup 10X faster routines
60% less energy 4 weeks effort Single CPU/GPU code 2X faster app
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

Resources Success Stories


https://www.openacc.org/resources https://www.openacc.org/success-stories

FREE
Compilers

Compilers and Tools Events


https://www.openacc.org/tools https://www.openacc.org/events

https://www.openacc.org/community#slack
EXPRESSING PARALLELISM WITH
OPENACC
CODING WITH OPENACC
Array pairing example
void pairing(int *input, int *output, int N){

for(int i = 0; i < N; i++)


output[i] = input[i*2] +
input[i*2+1];
}

6 3 10 7 2 4 3 8 9 2 0 1
input

9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example
subroutine pairing(input, output, N)

do i=1,N
output(i) = input(i*2) +
input(i*2+1);
end do
end subroutine
6 3 10 7 2 4 3 8 9 2 0 1
input

9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example - parallel
void pairing(int *input, int *output, int N){
#pragma acc parallel loop
for(int i = 0; i < N; i++)
output[i] = input[i*2] +
input[i*2+1];
}

6 3 10 7 2 4 3 8 9 2 0 1
input

9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example - parallel
subroutine pairing(input, output, N)
void pairing(int *input, int *output, int N){
!$acc parallel loop
do i=1,N
for(int i = 0; i < N; i++)
output(i) = input(i*2) +
output[i] = input[i*2] +
input(i*2+1);
input[i*2+1];
end do
}
end subroutine
6 3 10 7 2 4 3 8 9 2 0 1
input

9 17 6 11 11 1
output
DATA DEPENDENCIES
Not all loops are parallel
void pairing(int *a, int N){

for(int i = 1; i < N; i++)


a[i] = a[i] + a[i-1];
}

1 2
3 3
6 10
4 15
5 21
6 28
7 36
8 45
9 10
55

i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9


DATA DEPENDENCIES
Not all loops are parallel
void pairing(int *a, int N){ If we attempted to parallelize this
#pragma acc parallel loop loop we would get wrong answers
for(int i = 1; i < N; i++) due to a forward dependency.
a[i] = a[i] + a[i-1];
}

1

3
⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

1 2
3 3
5 4
9 5
9 15
6 13
7 21
8 17
9 10
27
✓ ⌧ ⌧⌧ ⌧⌧ ⌧⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
void pairing(int *a, int N){ Even changing how the iterations
#pragma acc parallel loop are parallelized will not make this
for(int i = 1; i < N; i++) loop safe to parallelize.
a[i] = a[i] + a[i-1];
}

1

3
✓ ✓ ✓ ✓ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

1 2
3 3
6 10
4 15
5 21
6 13
7 21
8 30
9 10
40
✓ ✓ ✓ ✓ ✓ ⌧ ⌧ ⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
subroutine pairing(a, N)

do i = 1,N
a(i) = a(i) + a(i-1)
end do
end subroutine

1 2
3 3
6 10
4 15
5 21
6 28
7 36
8 45
9 10
55

i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9


DATA DEPENDENCIES
Not all loops are parallel
subroutine pairing(a, N)
void pairing(int *a, int N){ If we attempted to parallelize this
!$acc parallel loop
do i = 1,N loop we would get wrong answers
for(int i = 1; i < N; i++) due to a forward dependency.
a(i) = a(i) + a(i-1)
a[i] = a[i] + a[i-1];
end do
}
end subroutine

1

3
⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

1 2
3 3
5 4
9 5
9 15
6 13
7 21
8 17
9 10
27
✓ ⌧ ⌧⌧ ⌧⌧ ⌧⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
subroutine pairing(a, N)
void pairing(int *a, int N){ Even changing how the iterations
!$acc parallel loop
do i = 1,N are parallelized will not make this
for(int i = 1; i < N; i++) loop safe to parallelize.
a(i) = a(i) + a(i-1)
a[i] = a[i] + a[i-1];
end do
}
end subroutine

1

3
✓ ✓ ✓ ✓ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9

1 2
3 3
6 10
4 15
5 21
6 13
7 21
8 30
9 10
40
✓ ✓ ✓ ✓ ✓ ⌧ ⌧ ⌧⌧ Parallel
MODULE 1 REVIEW
CLOSING SUMMARY
Module One: Introduction

▪ Parallel programming is a technique of utilizing modern hardware to do lots of work


all at once.
▪ Amdahl’s law is the gravity of parallel programming, break this law at your own peril.
▪ Not all loops are parallel, but often can be rewritten to be parallelizable
▪ OpenACC is a high level model for generating parallel code from serial loops
THANK YOU

You might also like