Pdc - Co1-Basic Op & Cost Analysis
Pdc - Co1-Basic Op & Cost Analysis
2
BASIC COMMUNICATION OPERATIONS
• Proper implementation of these basic communication operations on various
parallel architectures is a key to the efficient execution of the parallel algorithms
that use them.
• The following basic communication operations are commonly used on various
parallel architectures:
• One to all broadcast and All to one reduction
• All to all broadcast and reduction
• All reduce operations
• Prefix sum operations
• Scatter and gather
• All to all personalized communication
3
ONE-TO-ALL BROADCAST AND ALL-TO-
ONE REDUCTION
• One processor has a piece of data (of size m) it needs to send to everyone.
• The dual of one-to-all broadcast is all-to-one reduction.
• In all-to-one reduction, each processor has m units of data. These data items
must be combined piece-wise (using some associative operator, such as
addition or min), and the result made available at a target processor.
4
ONE-TO-ALL BROADCAST AND ALL-TO-
ONE REDUCTION ON RINGS
• Simplest way is to send p-1 messages from the source to the other p-1
processors - this is not very efficient.
• Use recursive doubling: source sends a message to a selected processor. We now
have two independent problems derined over halves of machines.
• Reduction can be performed in an identical fashion by inverting the process.
5
ONE-TO-ALL BROADCAST
6
ALL-TO-ONE REDUCTION
Reduction on an eight-node
ring with node 0 as the
destination of the reduction.
7
ALL-TO-ALL BROADCAST ON A MESH
All-to-all broadcast on a 3 x 3
mesh. The groups of nodes
communicating with each other in
each phase are enclosed by dotted
boundaries. By the end of the
second phase, all nodes get
(0,1,2,3,4,5,6,7) (that is, a
message from each node).
8
ALL-TO-ALL REDUCTION
9
ALL-TO-ALL BROADCAST AND
REDUCTION ON A RING
10
BROADCAST AND REDUCTION ON A
MESH
• We can view each row and column of a
square mesh of p nodes as a linear array of
√p nodes.
• Broadcast and reduction operations can be
performed in two steps - the first step does
the operation along a row and the second
step along each column concurrently.
• This process generalizes to higher
dimensions as well.
11
BROADCAST AND REDUCTION ON A
BALANCED BINARY TREE
• Send/Receive:
• A process sends a message to another process and the receiving
process acknowledges it.
• Used for direct communication between two processes.
• Examples: MPI (Message Passing Interface) MPI_Send and
MPI_Recv.
13
BROADCAST
14
SCATTER & GATHER
• Scatter
• A process divides data into chunks and sends each chunk to
different processes.
• Used when a large dataset needs to be distributed across
processes.
• Gather
• A process collects data from all other processes into a single
process.
• Example: MPI MPI_Gather.
15
ALL-TO-ALL COMMUNICATION
16
REDUCE
17
COST ANALYSIS IN COMMUNICATION
18
COST ANALYSIS IN COMMUNICATION
• Parallel cost analysis
• This static cost analysis method uses three phases to estimate the cost
of parallel execution:
• Block-level analysis: Estimates the serial costs of blocks between
synchronization points
• Distributed flow graph (DFG) construction: Captures the
parallelism, waiting, and idle times in the distributed system
• Parallel cost calculation: The parallel cost is the path with the highest
cost in the DFG
19
COST ANALYSIS IN COMMUNICATION
• Optimizing cost
20
WORKING PROCESS OF COST ANALYSIS
• Parallel cost analysis works in three phases:
• (1) it performs a block-level analysis to estimate the serial costs of the blocks
between synchronization points in the program;
• (2) it then constructs a distributed flow graph (DFG) to capture the parallelism,
the waiting, and idle times at the locations of the distributed system;
• (3) the parallel cost can finally be obtained as the path of maximal cost in the
DFG. We prove the correctness of the proposed parallel cost analysis, and
provide a prototype implementation to perform an experimental evaluation of
the accuracy and feasibility of the proposed analysis.
21
THANK YOU
22