12.revision Parallelization
12.revision Parallelization
Lecture 12
February 14, 2024
Broadcast Algorithms in MPICH
• Short messages
• < MPIR_CVAR_BCAST_SHORT_MSG_SIZE
• Binomial
• Medium messages
• Scatter + Allgather (Recursive doubling)
• Large messages
• > MPIR_CVAR_BCAST_LONG_MSG_SIZE
• Scatter + Allgather (Ring)
2
Old vs. New MPI_Bcast
Van de Geijn
3
Reduce on 64 nodes
4
Allgather – Ring Algorithm
• Every process sends to and 0 1 2 3
1 4 10 1 7 3 9 15
receives from everyone else
n/p n/p n/p n/p
• Assume p processes and total
n bytes 1 4 9 15
1 4 10 1
• Every process sends and
receives n/p bytes 10 1 7 3
7 3 9 15
• Time
• (p – 1) * (L + n/p*(1/B))
• How can we improve?
5
Non-blocking Point-to-Point
6
Many-to-one Non-blocking P2P
7
Non-blocking Performance
• Standard does not require overlapping communication and
computation
• Implementation may use a thread to move data in parallel
• Implementation can delay the initiation of data transfer until “Wait”
• MPI_Test – non-blocking, tests completion, starts progress
• MPIR_CVAR_ASYNC_PROGRESS (MPICH)
8
Non-blocking Point-to-Point Safety
• MPI_Isend (buf, count, datatype, dest, tag, comm, request)
• MPI_Irecv (buf, count, datatype, source, tag, comm, request)
• MPI_Wait (request, status)
0 1
MPI_Isend MPI_Isend Safe
MPI_Recv MPI_Recv
9
Mesh Interconnect
• Diameter 2(√p – 1)
• Bisection width √p
• Cost 2(p – √p)
10
Torus Interconnect
• Diameter 2(√p/2)
• Bisection width 2√p
• Cost 2p
11
Parallelization
Parallelization Steps
1. Decomposition of computation into tasks
• Identifying portions of the work that can be performed concurrently
2. Assignment of tasks to processes
• Assigning concurrent pieces of work onto multiple processes running in parallel
3. Orchestration of data access, communication and synchronization among processes
• Distributing the data associated with the program
• Managing access to data shared by multiple processes
• Synchronizing at various stages of the parallel program execution
4. Mapping of processes to processors
• Placement of processes in the physical processor topology
13
Illustration of Parallelization Steps
Expose enough
concurrency
15
Matrix Vector Multiplication – Decomposition
P=3?
P1
Decomposition
Identifying portions of the
P2 work that can be performed
concurrently
P3 Assignment
16
Matrix Vector Multiplication – Orchestration
P=3
P1 Decomposition
Assignment
P2 Orchestration
• Allgather/Bcast
P3 • Scatter
• Gather
• Initial communication
• Distribute (read by process 0) or parallel reads
• Final communication 17
Distribute using Bcast vs. Allgather
18
Bcast vs. Allgather
19
Bcast vs. Allgather
20
Matrix Vector Multiplication – Column-wise
Decomposition
Decomposition
Assignment
Orchestration
P1 P2 P3
• Reduce
N grid points
P processes
N/P points per process
Grid
point
P1 P2 P3 P4 #Communications?
2
1D domain
#Computations?
Nearest neighbor communications N/P
2 sends()
Communication to computation ratio=2P/N
2 recvs()
22
1D Domain Decomposition
N grid points
P processes
Grid N/P points per process
point
#Communications?
2√N (assuming square grid)
#Computations?
N/P (assuming square grid)
2D domain
Communication to computation ratio=?
23
2D Domain decomposition
Grid
point N grid points (√N x √N grid)
P processes (√P x √P grid)
N/P points per process
Grid 2 Sends()
point 2 Recvs()
#Communications?
2√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)
Five-point stencil
26
2D Domain decomposition
#Communications?
4√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)
27
Send / Recv
0 1 2 3
4 5 6 7 MPI_Send MPI_Recv
0 1 2 3 4 5 6 7
28
Send / Recv
0 1 2 3
0 1 2 3 4 5 6 7
29
MPI_Pack