OpenACC 1
OpenACC 1
MODULE OVERVIEW
Topics to be covered
3 Steps
A REAL WORLD CASE STUDY
Modern cancer research
▪ The Russian Academy of Science created a program to
simulate light propagation through human tissue
▪ This program was used to be able to more accurately
detect cancerous cells by simulating billions of random
paths that the light could take through human tissue
▪ With parallel programming, they were able to run
thousands of these paths simultaneously
▪ The sequential program took 2.5 hours to run
▪ The parallel version took less than 2 minutes
8000 m
SEQUENTIAL SOLUTION
8000+ m
SEQUENTIAL SOLUTION
8000+ m
PARALLEL SOLUTION
Exams 1-250 : Q #1, 2 : Exams 751-1000 : Q #1, 2 : Exams 501-750 : Q #1, 2 : Exams 251-500 : Q #1, 2 :
500m 500m 500m 500m
Exams 251-500 : Q #3, 4 : Exams 1-250 : Q #3, 4 : Exams 751-1000 : Q #3, 4 : Exams 501-750 : Q #3, 4 :
500m 500m 500m 500m
Exams 501-750 : Q #5, 6 : Exams 251-500 : Q #5, 6 : Exams 1-250 : Q #5, 6 : Exams 751-1000 : Q #5, 6 :
500m 500m 500m 500m
Exams 751-1000 : Q #7, 8 : Exams 501-750 : Q #7, 8 : Exams 251-500 : Q #7, 8 : Exams 1-250 : Q #7, 8 :
500m 500m 500m 500m
2000+ m
PIPELINE
2006+ m
PIPELINE STALL
2006+ m
GRADING EXAMPLE SUMMARY
It’s critical to understand the problem before trying to parallelize it
▪ Can the work be done in an arbitrary order, or must it be done in sequential order?
▪ Does each task take the same amount of time to complete? If not, it may be
necessary to “load balance.”
In our example, the only restriction is that a single question be graded by a single
grader, so we could divide the work easily, but had to communicate periodically.
▪ This case study is an example of task-based parallelism. Each grader is assigned a
task like “Grade questions 1 & 2 on the first 500 tests”
▪ If instead each question could be graded by different graders, then we could have
data parallelism: all graders work on Q1 of the following tests, then Q2, etc.
AMDAHL’S LAW
AMDAHL’S LAW
Serialization Limits Performance
Applications
Compiler Programming
Libraries
Directives Languages
▪ Maintain existing
sequential code
▪ Add annotations to
expose parallelism
▪ After verifying
correctness, annotate
more of the code
OPENACC
▪ OpenACC is meant to
int main(){ be easy to use, and
The programmer will
easy to learn
<sequential give hints to the
code> compiler about which ▪ Programmer remains
parts of the code to in familiar C, C++, or
#pragma acc
Compiler parallelize. Fortran
kernels
Hint The compiler will then ▪ No reason to learn
{
generate parallelism low-level details of the
<parallel for the target parallel hardware.
code> hardware.
}
}
OPENACC
MAESTRO
NekCEM CASTRO CloverLeaf FINE/Turbo
CFD
Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA
Argonne National Lab Stony Brook University AWE International
2.5X speedup 4.4X speedup 4X speedup 10X faster routines
60% less energy 4 weeks effort Single CPU/GPU code 2X faster app
OPENACC RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
FREE
Compilers
https://www.openacc.org/community#slack
EXPRESSING PARALLELISM WITH
OPENACC
CODING WITH OPENACC
Array pairing example
void pairing(int *input, int *output, int N){
6 3 10 7 2 4 3 8 9 2 0 1
input
9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example
subroutine pairing(input, output, N)
do i=1,N
output(i) = input(i*2) +
input(i*2+1);
end do
end subroutine
6 3 10 7 2 4 3 8 9 2 0 1
input
9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example - parallel
void pairing(int *input, int *output, int N){
#pragma acc parallel loop
for(int i = 0; i < N; i++)
output[i] = input[i*2] +
input[i*2+1];
}
6 3 10 7 2 4 3 8 9 2 0 1
input
9 17 6 11 11 1
output
CODING WITH OPENACC
Array pairing example - parallel
subroutine pairing(input, output, N)
void pairing(int *input, int *output, int N){
!$acc parallel loop
do i=1,N
for(int i = 0; i < N; i++)
output(i) = input(i*2) +
output[i] = input[i*2] +
input(i*2+1);
input[i*2+1];
end do
}
end subroutine
6 3 10 7 2 4 3 8 9 2 0 1
input
9 17 6 11 11 1
output
DATA DEPENDENCIES
Not all loops are parallel
void pairing(int *a, int N){
1 2
3 3
6 10
4 15
5 21
6 28
7 36
8 45
9 10
55
1
✓
3
⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
1 2
3 3
5 4
9 5
9 15
6 13
7 21
8 17
9 10
27
✓ ⌧ ⌧⌧ ⌧⌧ ⌧⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
void pairing(int *a, int N){ Even changing how the iterations
#pragma acc parallel loop are parallelized will not make this
for(int i = 1; i < N; i++) loop safe to parallelize.
a[i] = a[i] + a[i-1];
}
1
✓
3
✓ ✓ ✓ ✓ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
1 2
3 3
6 10
4 15
5 21
6 13
7 21
8 30
9 10
40
✓ ✓ ✓ ✓ ✓ ⌧ ⌧ ⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
subroutine pairing(a, N)
do i = 1,N
a(i) = a(i) + a(i-1)
end do
end subroutine
1 2
3 3
6 10
4 15
5 21
6 28
7 36
8 45
9 10
55
1
✓
3
⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
1 2
3 3
5 4
9 5
9 15
6 13
7 21
8 17
9 10
27
✓ ⌧ ⌧⌧ ⌧⌧ ⌧⌧⌧ Parallel
DATA DEPENDENCIES
Not all loops are parallel
subroutine pairing(a, N)
void pairing(int *a, int N){ Even changing how the iterations
!$acc parallel loop
do i = 1,N are parallelized will not make this
for(int i = 1; i < N; i++) loop safe to parallelize.
a(i) = a(i) + a(i-1)
a[i] = a[i] + a[i-1];
end do
}
end subroutine
1
✓
3
✓ ✓ ✓ ✓ ⌧ ⌧ ⌧ ⌧
6 10 15 21 28 36 45 55
Sequential
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9
1 2
3 3
6 10
4 15
5 21
6 13
7 21
8 30
9 10
40
✓ ✓ ✓ ✓ ✓ ⌧ ⌧ ⌧⌧ Parallel
MODULE 1 REVIEW
CLOSING SUMMARY
Module One: Introduction