0% found this document useful (0 votes)
3 views

2.ParallelArchExec

The document discusses parallel architecture and execution, emphasizing the importance of parallel architecture and algorithms on application performance. It covers concepts such as saturation, efficiency, limitations of parallelization, and various programming models including shared and distributed memory. Additionally, it introduces the Message Passing Interface (MPI) as a standard for message passing in distributed memory environments and outlines its implementation and execution steps.

Uploaded by

Sharvani Jadhav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2.ParallelArchExec

The document discusses parallel architecture and execution, emphasizing the importance of parallel architecture and algorithms on application performance. It covers concepts such as saturation, efficiency, limitations of parallelization, and various programming models including shared and distributed memory. Additionally, it introduces the Message Passing Interface (MPI) as a standard for message passing in distributed memory environments and outlines its implementation and execution steps.

Uploaded by

Sharvani Jadhav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Parallel Architecture and

Execution

Lecture 2
January 8, 2025
Parallel Application Performance
Depends on
• Parallel architecture
• Algorithm

2
Saturation – Example 1

#Processes Time (sec) Speedup Efficiency


1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30

3
Saturation – Example 2

4
Saturation – Example 3

Source: GGKK Chapter 5 5


Efficiency (Adding numbers)

Problem size

Source: GGKK Chapter 5 6


Limitations of Parallelization
• Overhead
• E.g. communication
• Over-decomposition
• Work per process/core
• Idling
• Load imbalance
• Synchronization
• Serialization

7
Execution Profile

Execution Profile of a Hypothetical Parallel Program


Source: GGKK Chapter 5 8
Performance
• Sequential
• Input size
• Parallel
• Input size
• Number of processing elements (PE)
• Communication speed
• …

9
Scaling Deep Learning Models
Problem size #PEs

SOURCE: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

“The fastest known sequential algorithm for a problem


may be difficult or impossible to parallelize.”- GGKK 10
Parallelization

• Speedup
• Efficiency

11
Sum of Numbers Speedup

Naïve parallelization method:


𝑁 - Compute in parallel
Speedup = - Send partial result to one
𝑁
+𝑃+𝑃 (All-to-one)
𝑃
- Compute final result

12
Parallel Sum (Optimized)

Source: GGKK Chapter 5 13


Sum of Numbers Speedup (Optimized)

S1 = 𝑁 E1 = ?
𝑁
+ 2𝑃
𝑃

𝑁 1
S2 = E2 =
𝑁 2𝑃𝑙𝑜𝑔𝑃
+ 2 𝑙𝑜𝑔𝑃 +1
𝑃 𝑁

14
Efficiency (Adding numbers)
Homework: Analyze the measured efficiency based on your derivation

Source: GGKK Chapter 5 15


A Limitation of Parallel Computing
Fraction of code that
is parallelizable

1
Speedup S =
1 − 𝑓 + 𝑓/𝑃

Amdahl’s Law

16
Parallel Architecture

17
System Components
• Processor
• Memory
• Network
• Storage
NUMA

Source: https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-access
18
Memory Hierarchy

A multicore SMP architecture


Image Source: The Art of Multiprocessor Programming – Herlihy, Shavit
19
Memory Access Times

Source: MIT CSAIL


20
Processor vs. Memory

“While clock rates of high-end processors have


increased at roughly 40% per year over the last decade,
DRAM access times have only improved at the rate of
roughly 10% per year over this interval.”

- Introduction to Parallel Computing by Ananth Grama


et al. (GGKK)

21
NUMA Nodes
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia)


22
NUMA Node (Zoomed)
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia) 23


Effective Memory Access Times

24
Memory Placement

Lepers et al., “Thread and Memory Placement on NUMA Systems: Asymmetry Matters”, USENIX ATC 2015.

25
Connect Multiple Compute Nodes
Intraconnect

Source: hector.ac.uk
26
Parallel Programming Models
• Shared memory
• Distributed memory

27
Shared Memory
• Shared address space
• Time taken to access certain memory words is
longer (NUMA)
• Need to worry about concurrent access
• Programming paradigms – Pthreads, OpenMP

Thread 0
Thread 1

28
Intel Processors (Latest)

https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-
xeon-scalable-processors.html
29
Cluster of Compute Nodes

30
Message Passing
• Distinct address space per process
• Multiple processing nodes
• Basic operations are send and receive

31
Interprocess Communication

32
Our Parallel World
Core
Process
Memory

Distributed memory programming


• Distinct address space
• Explicit communication
33
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 10, y = 20
... ...
x++ x++
... ...
print x, y print x, y

2, 2 11, 20
34
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 1, y = 2
... ...
x++; y++ y++
... ...
print x, y print x, y

2, 3 1, 3
35
Adapted from Neha Karanjkar’s slides

36
Our Parallel World
Core
Process

NO centralized server/master
37
Message Passing
Interface
Message Passing Interface (MPI)

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and


David W. Walker.
• Standard for message passing in a distributed
memory environment
• MPI Forum in 1993
• Version 1.0: 1994
• Version 2.0: 1997
• Version 3.0: 2012
• Version 4.0: 2021
• Version 5.0 (under discussion)
39
MPI Implementations
“The MPI standard includes point-to-point message-passing,
collective communications, group and communicator concepts,
process topologies, environmental management, process
creation and management, one-sided communications,
extended collective operations, external interfaces, I/O, some
miscellaneous topics, and a profiling interface.” – MPI report
• MPICH (ANL)
• MVAPICH (OSU)
• OpenMPI
• Intel MPI
• Cray MPI
40
Programming Environment
• Shell scripts (e.g. bash)
• ssh basics
• E.g. ssh –X
•…
• Mostly in C/C++
• Compilation, Makefiles, ...
• Linux environment variables
• PATH
• LD_LIBRARY_PATH
•…

41
MPI Installation – Laptop
• Linux or Linux VM on Windows
• apt/snap/yum/brew
• Windows
• No support

• https://www.mpich.org/documentation/guides/

42
MPI
• Standard for message passing
• Explicit communications
• Medium programming complexity
• Requires communication scope

43
Simple MPI Code

44
MPI Code Execution Steps

• Compile
• mpicc -o program.x program.c

• Execute
• mpirun -np 1 ./program.x (mpiexec -np 1 ./program.x)
• Runs 1 process on the launch/login node
• mpirun -np 6 ./program.x
• Runs 6 processes on the launch/login node

45
Output – Hello World
mpirun –np 20 ./program.x

46

You might also like