2.ParallelArchExec
2.ParallelArchExec
Execution
Lecture 2
January 8, 2025
Parallel Application Performance
Depends on
• Parallel architecture
• Algorithm
2
Saturation – Example 1
3
Saturation – Example 2
4
Saturation – Example 3
Problem size
7
Execution Profile
9
Scaling Deep Learning Models
Problem size #PEs
SOURCE: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES
• Speedup
• Efficiency
11
Sum of Numbers Speedup
12
Parallel Sum (Optimized)
S1 = 𝑁 E1 = ?
𝑁
+ 2𝑃
𝑃
𝑁 1
S2 = E2 =
𝑁 2𝑃𝑙𝑜𝑔𝑃
+ 2 𝑙𝑜𝑔𝑃 +1
𝑃 𝑁
14
Efficiency (Adding numbers)
Homework: Analyze the measured efficiency based on your derivation
1
Speedup S =
1 − 𝑓 + 𝑓/𝑃
Amdahl’s Law
16
Parallel Architecture
17
System Components
• Processor
• Memory
• Network
• Storage
NUMA
Source: https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-access
18
Memory Hierarchy
21
NUMA Nodes
Utility: lstopo (hwloc package)
24
Memory Placement
Lepers et al., “Thread and Memory Placement on NUMA Systems: Asymmetry Matters”, USENIX ATC 2015.
25
Connect Multiple Compute Nodes
Intraconnect
Source: hector.ac.uk
26
Parallel Programming Models
• Shared memory
• Distributed memory
27
Shared Memory
• Shared address space
• Time taken to access certain memory words is
longer (NUMA)
• Need to worry about concurrent access
• Programming paradigms – Pthreads, OpenMP
Thread 0
Thread 1
28
Intel Processors (Latest)
https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-
xeon-scalable-processors.html
29
Cluster of Compute Nodes
30
Message Passing
• Distinct address space per process
• Multiple processing nodes
• Basic operations are send and receive
31
Interprocess Communication
32
Our Parallel World
Core
Process
Memory
x = 1, y = 2 x = 10, y = 20
... ...
x++ x++
... ...
print x, y print x, y
2, 2 11, 20
34
Distinct Process Address Space
Process 0 Process 1
x = 1, y = 2 x = 1, y = 2
... ...
x++; y++ y++
... ...
print x, y print x, y
2, 3 1, 3
35
Adapted from Neha Karanjkar’s slides
36
Our Parallel World
Core
Process
NO centralized server/master
37
Message Passing
Interface
Message Passing Interface (MPI)
41
MPI Installation – Laptop
• Linux or Linux VM on Windows
• apt/snap/yum/brew
• Windows
• No support
• https://www.mpich.org/documentation/guides/
42
MPI
• Standard for message passing
• Explicit communications
• Medium programming complexity
• Requires communication scope
43
Simple MPI Code
44
MPI Code Execution Steps
• Compile
• mpicc -o program.x program.c
• Execute
• mpirun -np 1 ./program.x (mpiexec -np 1 ./program.x)
• Runs 1 process on the launch/login node
• mpirun -np 6 ./program.x
• Runs 6 processes on the launch/login node
45
Output – Hello World
mpirun –np 20 ./program.x
46