0% found this document useful (0 votes)

12 views

12.revision Parallelization

Uploaded by

spareyash

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

12.revision Parallelization

Uploaded by

spareyash

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Revision

Lecture 12
February 14, 2024
Broadcast Algorithms in MPICH
• Short messages
• < MPIR_CVAR_BCAST_SHORT_MSG_SIZE
• Binomial
• Medium messages
• Scatter + Allgather (Recursive doubling)
• Large messages
• > MPIR_CVAR_BCAST_LONG_MSG_SIZE
• Scatter + Allgather (Ring)

2
Old vs. New MPI_Bcast

Van de Geijn

3
Reduce on 64 nodes

4
Allgather – Ring Algorithm
• Every process sends to and 0 1 2 3
1 4 10 1 7 3 9 15
receives from everyone else
n/p n/p n/p n/p
• Assume p processes and total
n bytes 1 4 9 15
1 4 10 1
• Every process sends and
receives n/p bytes 10 1 7 3
7 3 9 15
• Time
• (p – 1) * (L + n/p*(1/B))
• How can we improve?

5
Non-blocking Point-to-Point

• MPI_Isend (buf, count, datatype, dest, tag, comm, request)

• MPI_Irecv (buf, count, datatype, source, tag, comm, request)

• MPI_Wait (request, status)

• MPI_Waitall (count, request, status)

6
Many-to-one Non-blocking P2P

7
Non-blocking Performance
• Standard does not require overlapping communication and
computation
• Implementation may use a thread to move data in parallel
• Implementation can delay the initiation of data transfer until “Wait”
• MPI_Test – non-blocking, tests completion, starts progress
• MPIR_CVAR_ASYNC_PROGRESS (MPICH)

8
Non-blocking Point-to-Point Safety
• MPI_Isend (buf, count, datatype, dest, tag, comm, request)
• MPI_Irecv (buf, count, datatype, source, tag, comm, request)
• MPI_Wait (request, status)

0 1
MPI_Isend MPI_Isend Safe
MPI_Recv MPI_Recv

9
Mesh Interconnect

• Diameter 2(√p – 1)
• Bisection width √p
• Cost 2(p – √p)
10
Torus Interconnect

• Diameter 2(√p/2)
• Bisection width 2√p
• Cost 2p
11
Parallelization
Parallelization Steps
1. Decomposition of computation into tasks
• Identifying portions of the work that can be performed concurrently
2. Assignment of tasks to processes
• Assigning concurrent pieces of work onto multiple processes running in parallel
3. Orchestration of data access, communication and synchronization among processes
• Distributing the data associated with the program
• Managing access to data shared by multiple processes
• Synchronizing at various stages of the parallel program execution
4. Mapping of processes to processors
• Placement of processes in the physical processor topology

13
Illustration of Parallelization Steps
Expose enough
concurrency

Source: Culler et al. book

14
Performance Goals
• Expose concurrency
• Reduce inter-process communications
• Load-balance
• Reduce synchronization
• Reduce idling
• Reduce management overhead
• Preserve data locality
• Exploit network topology

15
Matrix Vector Multiplication – Decomposition
P=3?
P1
Decomposition
Identifying portions of the
P2 work that can be performed
concurrently
P3 Assignment

16
Matrix Vector Multiplication – Orchestration
P=3

P1 Decomposition
Assignment
P2 Orchestration
• Allgather/Bcast
P3 • Scatter
• Gather
• Initial communication
• Distribute (read by process 0) or parallel reads
• Final communication 17
Distribute using Bcast vs. Allgather

18
Bcast vs. Allgather

19
Bcast vs. Allgather

20
Matrix Vector Multiplication – Column-wise
Decomposition

Decomposition
Assignment
Orchestration
P1 P2 P3
• Reduce

Row-wise vs. column-wise partitioning?

21
1D Domain Decomposition

N grid points
P processes
N/P points per process
Grid
point
P1 P2 P3 P4 #Communications?
2
1D domain
#Computations?
Nearest neighbor communications N/P

2 sends()
Communication to computation ratio=2P/N
2 recvs()
22
1D Domain Decomposition
N grid points
P processes
Grid N/P points per process
point
#Communications?
2√N (assuming square grid)
#Computations?
N/P (assuming square grid)

2D domain
Communication to computation ratio=?
23
2D Domain decomposition

Grid
point N grid points (√N x √N grid)
P processes (√P x √P grid)
N/P points per process

+ Several parallel communications

+ Lower communication volume/process
24
2D Domain decomposition

Grid 2 Sends()
point 2 Recvs()

#Communications?
2√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?

25
Stencils

Five-point stencil

26
2D Domain decomposition

Grid 4 Sends() N grid points (√N x √N grid)

P processes (√P x √P grid)
point 4 Recvs() N/P points per process

#Communications?
4√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?

27
Send / Recv

0 1 2 3

4 5 6 7 MPI_Send MPI_Recv

0 1 2 3 4 5 6 7

28
Send / Recv

0 1 2 3

4 5 6 7 MPI_Pack (buf) MPI_Recv (buf)

MPI_Send (buf) MPI_Unpack (buf)

0 1 2 3 4 5 6 7

29
MPI_Pack

int MPI_Pack (const void

*inbuf, int incount,
MPI_Datatype datatype,
void *outbuf, int outsize,
int *position, MPI_Comm
comm)

Vending Machine V-A02
No ratings yet
Vending Machine V-A02
6 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
10.collectives I
No ratings yet
10.collectives I
31 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
No ratings yet
Govindarajan_ParallelizationPrinciples-NSM-AstroPhysics
50 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
7 P2P-4
No ratings yet
7 P2P-4
24 pages
chap4_selected_slides
No ratings yet
chap4_selected_slides
54 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Parallel Computing Communication Operations Slides
No ratings yet
Parallel Computing Communication Operations Slides
71 pages
Class8
No ratings yet
Class8
72 pages
Matrix-Vector Multiplication
No ratings yet
Matrix-Vector Multiplication
12 pages
Oc23 Mpps
No ratings yet
Oc23 Mpps
30 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Unit 5
No ratings yet
Unit 5
51 pages
AML Report Bricola-1
No ratings yet
AML Report Bricola-1
5 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
25 Revision
No ratings yet
25 Revision
21 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
HPC Unit2 Part1
No ratings yet
HPC Unit2 Part1
44 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Pipeline - Instr - Super Branch
No ratings yet
Pipeline - Instr - Super Branch
48 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
MapReduce Its Applications For Course
No ratings yet
MapReduce Its Applications For Course
36 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Istc 18 Paper
No ratings yet
Istc 18 Paper
6 pages
Car Manufacturing With Three Plants Speed Up For N Inputs in A P-Stage Pipeline
No ratings yet
Car Manufacturing With Three Plants Speed Up For N Inputs in A P-Stage Pipeline
26 pages
Introduction To Mobile Robotics: Iterative Closest Point Algorithm
No ratings yet
Introduction To Mobile Robotics: Iterative Closest Point Algorithm
39 pages
Concurrent and Learned Data Structures
No ratings yet
Concurrent and Learned Data Structures
26 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Ch03 - Embarrassingly Parallel Computations 2023-2024
No ratings yet
Ch03 - Embarrassingly Parallel Computations 2023-2024
34 pages
Pipeline
No ratings yet
Pipeline
64 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
Lec04b-Processes and Mapping
No ratings yet
Lec04b-Processes and Mapping
26 pages
L04-MapReduce
No ratings yet
L04-MapReduce
37 pages
Multiple Slice Turbo Codes: David Gnaedig Supervisors: Emmanuel Boutillon Michel Jezequel
No ratings yet
Multiple Slice Turbo Codes: David Gnaedig Supervisors: Emmanuel Boutillon Michel Jezequel
24 pages
Presentation 5156 Content Document 20250301102853AM
No ratings yet
Presentation 5156 Content Document 20250301102853AM
40 pages
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
No ratings yet
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
53 pages
Lect27-parallal-processing
No ratings yet
Lect27-parallal-processing
15 pages
Cours 2
No ratings yet
Cours 2
25 pages
Digital Electronics Suggestion (1)
No ratings yet
Digital Electronics Suggestion (1)
2 pages
Chapter 3: Distributed Database Design
No ratings yet
Chapter 3: Distributed Database Design
44 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Unit 3 HPC
No ratings yet
Unit 3 HPC
73 pages
Module - 3 Parallel Algorithm Design - Preliminaries
No ratings yet
Module - 3 Parallel Algorithm Design - Preliminaries
12 pages
Unit 5 - Pipeling and Multipoessors
No ratings yet
Unit 5 - Pipeling and Multipoessors
74 pages
Operating System - Weekly Test 04 - Test Paper
No ratings yet
Operating System - Weekly Test 04 - Test Paper
6 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
VLSI Digital Signal Processing Systems: Keshab K. Parhi
No ratings yet
VLSI Digital Signal Processing Systems: Keshab K. Parhi
25 pages
VLSI Digital Signal Processing Systems by Keshab K Parhi
50% (4)
VLSI Digital Signal Processing Systems by Keshab K Parhi
25 pages
Chap2 PDF
No ratings yet
Chap2 PDF
25 pages
Cours 3
No ratings yet
Cours 3
54 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Voice Over IP Crash Course
From Everand
Voice Over IP Crash Course
Steven Shepard
2/5 (1)
Mach1 CNC Control
No ratings yet
Mach1 CNC Control
68 pages
BCS303_OS_Module 4 (1)
No ratings yet
BCS303_OS_Module 4 (1)
40 pages
Disaster Recovery Plan Knowledge Pack
No ratings yet
Disaster Recovery Plan Knowledge Pack
7 pages
Dell Emc Poweredge 15g Portfolio
No ratings yet
Dell Emc Poweredge 15g Portfolio
74 pages
Mod 3 Numpy Ds
No ratings yet
Mod 3 Numpy Ds
15 pages
Flat Panel Monitor User's Guide Machine Type: 61DD
No ratings yet
Flat Panel Monitor User's Guide Machine Type: 61DD
28 pages
Top 50 Jquery Interview Questions
No ratings yet
Top 50 Jquery Interview Questions
7 pages
Summary - The Parts of Computer: 1. Input Devices
No ratings yet
Summary - The Parts of Computer: 1. Input Devices
2 pages
Idiots Guide To Big O
100% (1)
Idiots Guide To Big O
50 pages
UCCE SRND 8-5 Dated 08-03-2011
No ratings yet
UCCE SRND 8-5 Dated 08-03-2011
367 pages
FreeBSD Installing Portsnap
No ratings yet
FreeBSD Installing Portsnap
3 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
Google Summer of Code - Qubes OS
No ratings yet
Google Summer of Code - Qubes OS
13 pages
Sre Document: From
No ratings yet
Sre Document: From
36 pages
9.conditional Branching
No ratings yet
9.conditional Branching
2 pages
Lab 02
No ratings yet
Lab 02
7 pages
Java As 2
No ratings yet
Java As 2
15 pages
SAP BASIS Introductory Training Program - Day 3
No ratings yet
SAP BASIS Introductory Training Program - Day 3
76 pages
Lectronic LOW Orrector: Highlights
No ratings yet
Lectronic LOW Orrector: Highlights
2 pages
Topic - 4
No ratings yet
Topic - 4
8 pages
User Manual For IRulu Q8 Tablets
No ratings yet
User Manual For IRulu Q8 Tablets
10 pages
Starting The System: Keysight I3070 Series 5i Inline In-Circuit Test System
No ratings yet
Starting The System: Keysight I3070 Series 5i Inline In-Circuit Test System
11 pages
Escalado / PLC - 1 (CPU 1214C AC/DC/Rly) / Program Blocks
No ratings yet
Escalado / PLC - 1 (CPU 1214C AC/DC/Rly) / Program Blocks
2 pages
Hti Microcc-20Plus Vet: Quality Products and Service For Healthcare Professionals
No ratings yet
Hti Microcc-20Plus Vet: Quality Products and Service For Healthcare Professionals
2 pages
Keyboard PPT - Nilendu Sarkar
No ratings yet
Keyboard PPT - Nilendu Sarkar
14 pages
BSC IP Setup Diagram V03
No ratings yet
BSC IP Setup Diagram V03
1 page
Pervasive PSQL
No ratings yet
Pervasive PSQL
6 pages
Ssqu 017
100% (2)
Ssqu 017
185 pages
Network Drawing
No ratings yet
Network Drawing
6 pages