0% found this document useful (0 votes)

3 views

2.ParallelArchExec

The document discusses parallel architecture and execution, emphasizing the importance of parallel architecture and algorithms on application performance. It covers concepts such as saturation, efficiency, limitations of parallelization, and various programming models including shared and distributed memory. Additionally, it introduces the Message Passing Interface (MPI) as a standard for message passing in distributed memory environments and outlines its implementation and execution steps.

Uploaded by

Sharvani Jadhav

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

2.ParallelArchExec

Uploaded by

Sharvani Jadhav

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Parallel Architecture and

Execution

Lecture 2
January 8, 2025
Parallel Application Performance
Depends on
• Parallel architecture
• Algorithm

2
Saturation – Example 1

#Processes Time (sec) Speedup Efficiency

1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30

3
Saturation – Example 2

4
Saturation – Example 3

Source: GGKK Chapter 5 5

Efficiency (Adding numbers)

Problem size

Source: GGKK Chapter 5 6

Limitations of Parallelization
• Overhead
• E.g. communication
• Over-decomposition
• Work per process/core
• Idling
• Load imbalance
• Synchronization
• Serialization

7
Execution Profile

Execution Profile of a Hypothetical Parallel Program

Source: GGKK Chapter 5 8
Performance
• Sequential
• Input size
• Parallel
• Input size
• Number of processing elements (PE)
• Communication speed
• …

9
Scaling Deep Learning Models
Problem size #PEs

SOURCE: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

“The fastest known sequential algorithm for a problem

may be difficult or impossible to parallelize.”- GGKK 10
Parallelization

• Speedup
• Efficiency

11
Sum of Numbers Speedup

Naïve parallelization method:

𝑁 - Compute in parallel
Speedup = - Send partial result to one
𝑁
+𝑃+𝑃 (All-to-one)
𝑃
- Compute final result

12
Parallel Sum (Optimized)

Source: GGKK Chapter 5 13

Sum of Numbers Speedup (Optimized)

S1 = 𝑁 E1 = ?
𝑁
+ 2𝑃
𝑃

𝑁 1
S2 = E2 =
𝑁 2𝑃𝑙𝑜𝑔𝑃
+ 2 𝑙𝑜𝑔𝑃 +1
𝑃 𝑁

14
Efficiency (Adding numbers)
Homework: Analyze the measured efficiency based on your derivation

Source: GGKK Chapter 5 15

A Limitation of Parallel Computing
Fraction of code that
is parallelizable

1
Speedup S =
1 − 𝑓 + 𝑓/𝑃

Amdahl’s Law

16
Parallel Architecture

17
System Components
• Processor
• Memory
• Network
• Storage
NUMA

Source: https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-access
18
Memory Hierarchy

A multicore SMP architecture

Image Source: The Art of Multiprocessor Programming – Herlihy, Shavit
19
Memory Access Times

Source: MIT CSAIL

20
Processor vs. Memory

“While clock rates of high-end processors have

increased at roughly 40% per year over the last decade,
DRAM access times have only improved at the rate of
roughly 10% per year over this interval.”

- Introduction to Parallel Computing by Ananth Grama

et al. (GGKK)

21
NUMA Nodes
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia)

22
NUMA Node (Zoomed)
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia) 23

Effective Memory Access Times

24
Memory Placement

Lepers et al., “Thread and Memory Placement on NUMA Systems: Asymmetry Matters”, USENIX ATC 2015.

25
Connect Multiple Compute Nodes
Intraconnect

Source: hector.ac.uk
26
Parallel Programming Models
• Shared memory
• Distributed memory

27
Shared Memory
• Shared address space
• Time taken to access certain memory words is
longer (NUMA)
• Need to worry about concurrent access
• Programming paradigms – Pthreads, OpenMP

Thread 0
Thread 1

28
Intel Processors (Latest)

https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-
xeon-scalable-processors.html
29
Cluster of Compute Nodes

30
Message Passing
• Distinct address space per process
• Multiple processing nodes
• Basic operations are send and receive

31
Interprocess Communication

32
Our Parallel World
Core
Process
Memory

Distributed memory programming

• Distinct address space
• Explicit communication
33
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 10, y = 20
... ...
x++ x++
... ...
print x, y print x, y

2, 2 11, 20
34
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 1, y = 2
... ...
x++; y++ y++
... ...
print x, y print x, y

2, 3 1, 3
35
Adapted from Neha Karanjkar’s slides

36
Our Parallel World
Core
Process

NO centralized server/master
37
Message Passing
Interface
Message Passing Interface (MPI)

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and

David W. Walker.
• Standard for message passing in a distributed
memory environment
• MPI Forum in 1993
• Version 1.0: 1994
• Version 2.0: 1997
• Version 3.0: 2012
• Version 4.0: 2021
• Version 5.0 (under discussion)
39
MPI Implementations
“The MPI standard includes point-to-point message-passing,
collective communications, group and communicator concepts,
process topologies, environmental management, process
creation and management, one-sided communications,
extended collective operations, external interfaces, I/O, some
miscellaneous topics, and a profiling interface.” – MPI report
• MPICH (ANL)
• MVAPICH (OSU)
• OpenMPI
• Intel MPI
• Cray MPI
40
Programming Environment
• Shell scripts (e.g. bash)
• ssh basics
• E.g. ssh –X
•…
• Mostly in C/C++
• Compilation, Makefiles, ...
• Linux environment variables
• PATH
• LD_LIBRARY_PATH
•…

41
MPI Installation – Laptop
• Linux or Linux VM on Windows
• apt/snap/yum/brew
• Windows
• No support

• https://www.mpich.org/documentation/guides/

42
MPI
• Standard for message passing
• Explicit communications
• Medium programming complexity
• Requires communication scope

43
Simple MPI Code

44
MPI Code Execution Steps

• Compile
• mpicc -o program.x program.c

• Execute
• mpirun -np 1 ./program.x (mpiexec -np 1 ./program.x)
• Runs 1 process on the launch/login node
• mpirun -np 6 ./program.x
• Runs 6 processes on the launch/login node

45
Output – Hello World
mpirun –np 20 ./program.x

Baby Class Term III
100% (2)
Baby Class Term III
37 pages
Numerology Lessons
90% (10)
Numerology Lessons
45 pages
Ama University: School of Business and Accountancy Project 8, Quezon City
No ratings yet
Ama University: School of Business and Accountancy Project 8, Quezon City
77 pages
Continuous Casting Machine Scale Pit
No ratings yet
Continuous Casting Machine Scale Pit
1 page
3.Introduction to Parallelism
No ratings yet
3.Introduction to Parallelism
64 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Multicore Chapter 01
No ratings yet
Multicore Chapter 01
58 pages
Khaitan PSERC Webinar HPC Mar 2013 Slides
No ratings yet
Khaitan PSERC Webinar HPC Mar 2013 Slides
52 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Chapter 01
No ratings yet
Chapter 01
129 pages
PDC 1 - PD Computing
No ratings yet
PDC 1 - PD Computing
12 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
OS Part 02 PDF
No ratings yet
OS Part 02 PDF
93 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Implementation of DSP Algorithms
No ratings yet
Implementation of DSP Algorithms
20 pages
ParallelProgramming_Start2016
No ratings yet
ParallelProgramming_Start2016
41 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Lecture 3 and 4HPC
No ratings yet
Lecture 3 and 4HPC
24 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Bodybuilding 123
No ratings yet
Bodybuilding 123
101 pages
05-JP Id Tech 5 Challenges
No ratings yet
05-JP Id Tech 5 Challenges
37 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Chapter 01
No ratings yet
Chapter 01
123 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Cluster Basics
No ratings yet
Cluster Basics
34 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Parallel
No ratings yet
Parallel
20 pages
GTC 2021 - nvFuser_1617576268549001fotZ (1)
No ratings yet
GTC 2021 - nvFuser_1617576268549001fotZ (1)
75 pages
Virtual Memory Classnotes
No ratings yet
Virtual Memory Classnotes
70 pages
OOP, ADT and Analysis - Final
No ratings yet
OOP, ADT and Analysis - Final
51 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
CS621 Final Term Current Papers
No ratings yet
CS621 Final Term Current Papers
9 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
12-ch14-ELEC462.pptx
No ratings yet
12-ch14-ELEC462.pptx
66 pages
Aca Lecuternotes 1-4units
No ratings yet
Aca Lecuternotes 1-4units
91 pages
Concurrent and Learned Data Structures
No ratings yet
Concurrent and Learned Data Structures
26 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Optimising Serial Code
No ratings yet
Optimising Serial Code
101 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Running in Parallel
No ratings yet
Running in Parallel
24 pages
Introduction To HPC: Content and Definitions
No ratings yet
Introduction To HPC: Content and Definitions
22 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms
No ratings yet
The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms
22 pages
hpc-Neal
No ratings yet
hpc-Neal
32 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
250109_L2
No ratings yet
250109_L2
72 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
24 02 18 Rejender Pratap
No ratings yet
24 02 18 Rejender Pratap
95 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
C++ Demystified
From Everand
C++ Demystified
Jeff Kent
2.5/5 (1)
CS772-Lec21
No ratings yet
CS772-Lec21
26 pages
2311.15990v1
No ratings yet
2311.15990v1
22 pages
5.P2P-II
No ratings yet
5.P2P-II
26 pages
hw3_cgs
No ratings yet
hw3_cgs
7 pages
Git Training
No ratings yet
Git Training
55 pages
bando belts
No ratings yet
bando belts
2 pages
Application Form For GKM-EAP
No ratings yet
Application Form For GKM-EAP
1 page
BS 5930 Craig Soil 13 Soil Class
No ratings yet
BS 5930 Craig Soil 13 Soil Class
2 pages
Hock Lai Ho, Evidence and Truth
No ratings yet
Hock Lai Ho, Evidence and Truth
16 pages
How The Planet Evolve in The Last 4.6 Billion Years
No ratings yet
How The Planet Evolve in The Last 4.6 Billion Years
23 pages
Chapter 2 Far 2
100% (1)
Chapter 2 Far 2
7 pages
Creative Kitchen Crafts Girl Crafts Kathy Ross 2024 scribd download
100% (1)
Creative Kitchen Crafts Girl Crafts Kathy Ross 2024 scribd download
77 pages
Supply, Demand, Government Policies: Numerical Questions
No ratings yet
Supply, Demand, Government Policies: Numerical Questions
6 pages
How We Perceive Ourselves
No ratings yet
How We Perceive Ourselves
29 pages
DR Golden - Plab 2 Journey To Success - Russia To Uk
No ratings yet
DR Golden - Plab 2 Journey To Success - Russia To Uk
3 pages
Pandan Leaves
No ratings yet
Pandan Leaves
4 pages
Chapter 6 Midi Implant Procedure
No ratings yet
Chapter 6 Midi Implant Procedure
9 pages
Main Lectures Material ME 223
No ratings yet
Main Lectures Material ME 223
48 pages
Kids Zone Setup
No ratings yet
Kids Zone Setup
4 pages
Dermaceutic Professional
No ratings yet
Dermaceutic Professional
31 pages
EXW Price List 2023 Tensai Furniture
No ratings yet
EXW Price List 2023 Tensai Furniture
2 pages
New Trends in English Language Teaching and Learning: Prepared by
No ratings yet
New Trends in English Language Teaching and Learning: Prepared by
16 pages
Estireno
No ratings yet
Estireno
15 pages
Test 16: Choose The Best Option To Complete The Following Sentences
No ratings yet
Test 16: Choose The Best Option To Complete The Following Sentences
4 pages
Chpt29-06 Matrimonial Causes
No ratings yet
Chpt29-06 Matrimonial Causes
9 pages
Surah Al Fatiha: Al Fatiha Literally Means The Opening Derived From The Root Letters Root English Arabi C Meaning
100% (2)
Surah Al Fatiha: Al Fatiha Literally Means The Opening Derived From The Root Letters Root English Arabi C Meaning
9 pages
Mortgages 2019
No ratings yet
Mortgages 2019
60 pages
Mapeh 7 2ND Q Exam 21 22 Linterna
No ratings yet
Mapeh 7 2ND Q Exam 21 22 Linterna
6 pages
College Education
No ratings yet
College Education
4 pages
Financial Analysis VIP Industries
No ratings yet
Financial Analysis VIP Industries
5 pages

2.ParallelArchExec

Uploaded by

2.ParallelArchExec

Uploaded by

Parallel Architecture and

#Processes Time (sec) Speedup Efficiency

Source: GGKK Chapter 5 5

Source: GGKK Chapter 5 6

Execution Profile of a Hypothetical Parallel Program

“The fastest known sequential algorithm for a problem

Naïve parallelization method:

Source: GGKK Chapter 5 13

Source: GGKK Chapter 5 15

A multicore SMP architecture

Source: MIT CSAIL

“While clock rates of high-end processors have

- Introduction to Parallel Computing by Ananth Grama

AMD Bulldozer Memory Topology (Source: Wikipedia)

AMD Bulldozer Memory Topology (Source: Wikipedia) 23

Distributed memory programming

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and

You might also like