0% found this document useful (0 votes)
12 views

12.revision Parallelization

Uploaded by

spareyash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

12.revision Parallelization

Uploaded by

spareyash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Revision

Lecture 12
February 14, 2024
Broadcast Algorithms in MPICH
• Short messages
• < MPIR_CVAR_BCAST_SHORT_MSG_SIZE
• Binomial
• Medium messages
• Scatter + Allgather (Recursive doubling)
• Large messages
• > MPIR_CVAR_BCAST_LONG_MSG_SIZE
• Scatter + Allgather (Ring)

2
Old vs. New MPI_Bcast

Van de Geijn

3
Reduce on 64 nodes

4
Allgather – Ring Algorithm
• Every process sends to and 0 1 2 3
1 4 10 1 7 3 9 15
receives from everyone else
n/p n/p n/p n/p
• Assume p processes and total
n bytes 1 4 9 15
1 4 10 1
• Every process sends and
receives n/p bytes 10 1 7 3
7 3 9 15
• Time
• (p – 1) * (L + n/p*(1/B))
• How can we improve?

5
Non-blocking Point-to-Point

• MPI_Isend (buf, count, datatype, dest, tag, comm, request)


• MPI_Irecv (buf, count, datatype, source, tag, comm, request)

• MPI_Wait (request, status)


• MPI_Waitall (count, request, status)

6
Many-to-one Non-blocking P2P

7
Non-blocking Performance
• Standard does not require overlapping communication and
computation
• Implementation may use a thread to move data in parallel
• Implementation can delay the initiation of data transfer until “Wait”
• MPI_Test – non-blocking, tests completion, starts progress
• MPIR_CVAR_ASYNC_PROGRESS (MPICH)

8
Non-blocking Point-to-Point Safety
• MPI_Isend (buf, count, datatype, dest, tag, comm, request)
• MPI_Irecv (buf, count, datatype, source, tag, comm, request)
• MPI_Wait (request, status)

0 1
MPI_Isend MPI_Isend Safe
MPI_Recv MPI_Recv

9
Mesh Interconnect

• Diameter 2(√p – 1)
• Bisection width √p
• Cost 2(p – √p)
10
Torus Interconnect

• Diameter 2(√p/2)
• Bisection width 2√p
• Cost 2p
11
Parallelization
Parallelization Steps
1. Decomposition of computation into tasks
• Identifying portions of the work that can be performed concurrently
2. Assignment of tasks to processes
• Assigning concurrent pieces of work onto multiple processes running in parallel
3. Orchestration of data access, communication and synchronization among processes
• Distributing the data associated with the program
• Managing access to data shared by multiple processes
• Synchronizing at various stages of the parallel program execution
4. Mapping of processes to processors
• Placement of processes in the physical processor topology

13
Illustration of Parallelization Steps
Expose enough
concurrency

Source: Culler et al. book


14
Performance Goals
• Expose concurrency
• Reduce inter-process communications
• Load-balance
• Reduce synchronization
• Reduce idling
• Reduce management overhead
• Preserve data locality
• Exploit network topology

15
Matrix Vector Multiplication – Decomposition
P=3?
P1
Decomposition
Identifying portions of the
P2 work that can be performed
concurrently
P3 Assignment

16
Matrix Vector Multiplication – Orchestration
P=3

P1 Decomposition
Assignment
P2 Orchestration
• Allgather/Bcast
P3 • Scatter
• Gather
• Initial communication
• Distribute (read by process 0) or parallel reads
• Final communication 17
Distribute using Bcast vs. Allgather

18
Bcast vs. Allgather

19
Bcast vs. Allgather

20
Matrix Vector Multiplication – Column-wise
Decomposition

Decomposition
Assignment
Orchestration
P1 P2 P3
• Reduce

Row-wise vs. column-wise partitioning?


21
1D Domain Decomposition

N grid points
P processes
N/P points per process
Grid
point
P1 P2 P3 P4 #Communications?
2
1D domain
#Computations?
Nearest neighbor communications N/P

2 sends()
Communication to computation ratio=2P/N
2 recvs()
22
1D Domain Decomposition
N grid points
P processes
Grid N/P points per process
point
#Communications?
2√N (assuming square grid)
#Computations?
N/P (assuming square grid)

2D domain
Communication to computation ratio=?
23
2D Domain decomposition

Grid
point N grid points (√N x √N grid)
P processes (√P x √P grid)
N/P points per process

+ Several parallel communications


+ Lower communication volume/process
24
2D Domain decomposition

Grid 2 Sends()
point 2 Recvs()

#Communications?
2√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?


25
Stencils

Five-point stencil

26
2D Domain decomposition

Grid 4 Sends() N grid points (√N x √N grid)


P processes (√P x √P grid)
point 4 Recvs() N/P points per process

#Communications?
4√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?

27
Send / Recv

0 1 2 3

4 5 6 7 MPI_Send MPI_Recv

0 1 2 3 4 5 6 7

28
Send / Recv

0 1 2 3

4 5 6 7 MPI_Pack (buf) MPI_Recv (buf)


MPI_Send (buf) MPI_Unpack (buf)

0 1 2 3 4 5 6 7

29
MPI_Pack

int MPI_Pack (const void


*inbuf, int incount,
MPI_Datatype datatype,
void *outbuf, int outsize,
int *position, MPI_Comm
comm)

You might also like