0% found this document useful (0 votes)
22 views

High Performance Computing (HPC) - Lec3

Uploaded by

omargamalelziky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

High Performance Computing (HPC) - Lec3

Uploaded by

omargamalelziky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

High Performance Computing

(HPC)
Lecture 3

By: Dr. Maha Dessokey


Programming with Shared Memory
(OpenMP)
Parallel Computer Memory Architectures

 Shared Memory
All processors access all memory as a single global address
space.
Data sharing is fast.
Lack of scalability between memory and CPUs
Multithreading vs. Multiprocessing

Threads - shares “heavyweight”


the same process -
memory space completely
and global separate program
variables with its own
between variables, stack,
routines. and memory
allocation.
Programming with Shared Memory

The most popular shared memory multithreading API is


 POSIX Threads (Pthreads).
 OpenMP
Agenda

 Introduction to OpenMP
 Creating Threads
 Synchronization
 Parallel Loops
What is OpenMP?

 OpenMP: An API for Writing Multithreaded Applications


 “Standard” API for defining multi-threaded shared-memory
programs
 Set of compiler directives and library routines for parallel
application programmers
 Greatly simplifies writing multi-threaded (MT)
 programs in Fortran, C and C++
OpenMP Solution Stack
A Programmer’s View of OpenMP

 OpenMP will:
 Allow a programmer to separate a program into serial regions and parallel
regions, rather than concurrently-executing threads.
 Hide stack management
 Provide synchronization constructs
 OpenMP will not:
 Parallelize automatically
 Guarantee speedup
 Provide freedom from data races
 race condition: when the program’s outcome changes as
the threads are scheduled differently
An instance of a program

 Threads interact through reads/writes


to a shared address space.
 OS scheduler decides when to run
which threads … interleaved for
fairness.
 Synchronization to assure every legal
order results in correct results.
OpenMP core syntax

 Most of the constructs in OpenMP are compiler directives.


Example #pragma omp parallel num_threads(4)
where omp is an OpenMP keyword.
 Function prototypes and types in the file:
#include < omp.h>
 OpenMP constructs apply to a “structured block”.
 Structured block: a block of one or more statements with one point of entry at
the top and one point of exit at the bottom.
 It’s OK to have an exit() within the structured block.
 A non-structured block lacks clear control flow and can lead to "spaghetti code,"
where the logic is tangled and difficult to follow. This often includes the use of
GOTO statements or deeply nested control structures.
A multi-threaded “Hello world” program

 Write a multithreaded program where each thread prints “hello


world”
How do threads interact?

 OpenMP is a multi-threading, shared address model.


 Threads communicate by sharing variables.
 Unintended sharing of data causes race conditions:
 Race Condition: when the program’s outcome changes as the threads are
scheduled differently.
 To control race conditions: – Use synchronization to protect data conflicts.
OpenMP Programming Model

 Fork-Join Model:
 Master thread spawns a team
of threads as needed.
 Parallelism added
incrementally until
performance goals are met:
i.e. the sequential program
evolves into a parallel
program.
Thread Creation: Parallel Regions

 You create threads in OpenMP* with the parallel construct.


For example, To create a 4 thread Parallel region:

Each thread calls pooh(ID,A) for ID = 0 to 3


Thread Creation: Parallel Regions

 You create threads in OpenMP* with the parallel construct.


For example, To create a 4 thread Parallel region:

Each thread calls pooh(ID,A) for ID = 0 to 3


Thread Creation: Parallel Regions
Example- Numerical Integration

 Mathematically, we know that:

 We can approximate the integral


as a sum of rectangles:

 Where each rectangle has width


∆x and height F(x’) at the middle
of interval i.
Serial PI Program

static long num_steps = 100000;


double step;
int main () for (i=0;i< num_steps; i++)
{ {
int i; double x, pi, sum = 0.0; x = (i+0.5)*step;
step = 1.0/ /(double) num_steps; sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
A simple Parallel pi program

 To create a parallel version of the pi program pay close attention to


shared versus private variables.
 We will need the runtime library routines
int omp_get_num_threads(); -------→Get number of threads
int omp_get_thread_num(); ---------→ Get Thread ID or rank
double omp_get_wtime()------------→Time in Seconds since a fixed
point in the past
A simple Parallel pi program

#include <omp.h> if (id == 0) nthreads = nthrds;


static long num_steps = 100000; double step;
#define NUM_THREADS 2 for (i=id, sum[id]=0.0; i< num_steps; i=i+nthrds)
void main () {
{ int i, nthreads; double pi, sum[NUM_THREADS]; x = (i+0.5)*step;
step = 1.0/(double) num_steps; sum[id] += 4.0/(1.0+x*x);
omp_set_num_threads(NUM_THREADS); }
#pragma omp parallel } // End of parallel region
{
int i, id,nthrds; double x; for(i=0, pi=0.0; i<nthreads; i++)
id = omp_get_thread_num(); pi += sum[i] * step;
nthrds = omp_get_num_threads(); }
How to calculate the runtime?

#include <omp.h> if (id == 0) nthreads = nthrds;


static long num_steps = 100000; double step;
#define NUM_THREADS 2 for (i=id, sum[id]=0.0; i< num_steps; i=i+nthrds)
void main () {
{ int i, nthreads; double pi, sum[NUM_THREADS]; x = (i+0.5)*step;
double runtime; sum[id] += 4.0/(1.0+x*x);
runtime = omp_get_wtime(); }
step = 1.0/(double) num_steps; } // End of parallel region
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for(i=0, pi=0.0; i<nthreads; i++)
{ pi += sum[i] * step;
int i, id,nthrds; double x; runtime = omp_get_wtime() - runtime;
id = omp_get_thread_num(); printf(" In %lf seconds, The sum is %lf \n",runtime,sum);
nthrds = omp_get_num_threads(); }
Algorithm strategy

 The SIMD (Single Instruction Multiple Data) design pattern


 Run the same program on P processing elements where P can be
arbitrarily large.
 Use the rank( an ID ranging from 0 to (P-1)) to select between a set
of tasks and to manage any shared data structures.
 This pattern is very general and has been used to support most (if
not all) the algorithm strategy patterns.
 MPI programs almost always use this pattern
 it is probably the most commonly used pattern in the history of
parallel programming.
How do threads interact?

 OpenMP is a multi-threading, shared address model.


 Threads communicate by sharing variables.
 Unintended sharing of data causes race conditions:
 Race Condition: when the program’s outcome changes as the threads are
scheduled differently.
 To control race conditions: – Use synchronization to protect data conflicts.
Synchronization

 Synchronization: bringing one or more threads to a well


defined and known point in their execution.
 Synchronization is used to impose order constraints and to
protect access to shared data
 The two most common forms of synchronization are
Barrier: each thread wait at the barrier until all threads arrive.

Mutual exclusion: Define a block of code that only one thread


at a time can execute.
Synchronization: Barrier

Barrier: Each thread waits until all threads arrive.


#pragma omp parallel
{
int id=omp_get_thread_num();
A[id] = big_calc1(id);
B [] will not be
#pragma omp barrier calculated unless all
B[id] = big_calc2(id, A); threads complete
A[] calculations
}
Synchronization: Mutual exclusion

Mutual exclusion: Only one thread at a time can enter a critical region
float res;
#pragma omp parallel
{
float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for(i=id;i<niters;i+=nthrds){
B = big_job(i);
#pragma omp critical Threads wait their
turn – only one at a
res += consume (B); time calls consume()
}
}
Synchronization: Atomic

Atomic: provides mutual exclusion but only applies to the update of a


memory location (the update of X in the following example)
#pragma omp parallel
{
double tmp, B;
B = DOIT();
tmp = big_ugly(B);
#pragma omp atomic Atomic only protects
X += big_ugly(B); the read/update of X

}
SPMD vs. worksharing

 A parallel construct by itself creates an SPMD or “Single Program


Multiple Data” program … i.e., each thread redundantly executes
the same code.
 How do you split up pathways through the code between threads
within a team?
 This is called worksharing
 Loop construct
 Sections/section constructs
 Single construct Out of our scope
 Task construct
The loop worksharing Constructs

The loop worksharing construct splits up loop iterations among the


threads in a team
#pragma omp parallel
{
#pragma omp for
for (I=0;I<N;I++)
{
The variable I is made
NEAT_STUFF(I);
“private” to each
} thread by default.

}
The loop worksharing Constructs

Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}


code
#pragma omp parallel Block distribution for
{ loop iterations
int id, i, Nthrds,Step,, istart, iend;
OpenMP id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
parallel Step= N / Nthrds
region istart = id *Step;
iend = (id+1) * Step;
if (id == Nthrds-1) iend = N; /// last thread takes the remainder
for(i=istart;i<iend;i++)
{
a[i] = a[i] + b[i];
}
}
The loop worksharing Constructs

Sequential for(i=0;i<N;i++) { a[i] = a[i] + b[i];}


code

OpenMP parallel region #pragma omp parallel


and a worksharing for #pragma omp for
construct for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
Combined parallel/worksharing construct

 OpenMP shortcut: Put the “parallel” and the worksharing directive


on the same line
loop worksharing constructs:
The schedule clause

 The schedule clause affects how loop iterations are mapped onto threads
 schedule(static [,chunk])
 Deal-out blocks of iterations of size “chunk” to each thread.
 schedule(dynamic[,chunk])
 Each thread grabs “chunk” iterations off a queue until all iterations have been handled.
 schedule(guided[,chunk])
 Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down
to size “chunk” as the calculation proceeds.
 schedule(runtime)
 Schedule and chunk size taken from the OMP_SCHEDULE environment variable (or the runtime
library).
 schedule(auto) – Schedule is left up to the runtime to choose (does not have to be any of
the above)
Assignment 1- Parallel Matrix Addition
Using OpenMP

 you will implement a parallel program to perform matrix addition


using OpenMP.

This exercise will help you understand how to utilize parallel computing to
enhance performance.
Due will be after 2 weeks

You might also like