Unit 3 HPC
Unit 3 HPC
Prof. B. J. Dange
Assistant Professor
E-mail :
Contact No: 91301 91301 Ext :145, 9604146122
Contents
• Basic Communication Operations- One-to-All Broadcast and All-to-One Reduction
• Circular Shift
• Efficient implementations must leverage underlying architecture. For this reason, we refer
to specific architectures here.
• Recall from our discussion of architectures that communicating a message of size m over an
uncongested network takes time ts +tmw.
• We use this as the basis for our analyses. Where necessary, we take congestion into account
explicitly by scaling the tw term.
• One processor has a piece of data (of size m) it needs to send to everyone.
• In all-to-one reduction, each processor has m units of data. These data items must be
combined piece-wise (using some associative operator, such as addition or min), and the
result made available at a target processor.
• Simplest way is to send p-1 messages from the source to the other p-1 processors - this is
not very efficient.
• Use recursive doubling: source sends a message to a selected processor. We now have
two independent problems derived over halves of machines.
One-to-all broadcast on an eight-node ring. Node 0 is the source of the broadcast. Each message transfer
step is shown by a numbered, dotted arrow from the source of the message to its destination. The
number on an arrow indicates the time step during which the message is transferred.
• The first step of the product requires a one-to-all broadcast of the vector element along
the corresponding column of processors. This can be done concurrently for all n columns.
• The processors compute local product of the vector element and the local matrix entry.
• In the final step, the results of these products are accumulated to the first row using n
concurrent all-to-one reduction operations along the columns (using the sum operation).
• Broadcast and reduction operations can be performed in two steps - the first step does the
operation along a row and the second step along each column concurrently.
• A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in
each dimension.
• The mesh algorithm can be generalized to a hypercube and the operation is carried out in d
(= log p) steps.
• Assume that source processor is the root of this tree. In the first step, the source sends the
data to the right child (assuming the source is also the left child). The problem has now been
decomposed into two problems with half the number of processors.
• We illustrate the algorithm for a hypercube, but the algorithm, as has been seen, can be
adapted to other architectures.
• The hypercube has 2d nodes and my_id is the label for a node.
• A process sends the same m-word message to every other process, but different
processes may broadcast different messages.
• Simplest approach: perform p one-to-all broadcasts. This is not the most efficient way,
though.
• Each node first sends to one of its neighbors the data it needs to broadcast.
• In subsequent steps, it forwards the data received from one of its neighbors to its other
neighbor.
All-to-all broadcast on an
eight-node ring.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 25
All-to-All Broadcast and Reduction on a Ring
• In this phase, all nodes collect √p messages corresponding to the √p nodes of their
respective rows. Each node consolidates this information into a single message of size
m√p.
• On receiving a message, a node must combine it with the local copy of the message that
has the same destination as the received message before forwarding the combined
message to the next neighbor.
• On a hypercube, we have:
• Different from all-to-all reduction, in which p simultaneous all-to-one reductions take place,
each with a different destination for the result.
• Initially, nk resides on the node labeled k, and at the end of the procedure, the same node
holds Sk.
Computing prefix sums on an eight-node hypercube. At each node, square brackets show the local prefix sum
accumulated in the result buffer and parentheses enclose the contents of the outgoing message buffer for the
next step.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 39
The Prefix-Sum Operation
• The operation can be implemented using the all-to-all broadcast kernel.
• We must account for the fact that in prefix sums the node with label k uses information
from only the k-node subset whose labels are less than or equal to k.
• This is implemented using an additional result buffer. The content of an incoming message
is added to the result buffer only if the message comes from a node with a smaller label
than the recipient node.
• The contents of the outgoing message (denoted by parentheses in the figure) are updated
with every incoming message.
• In the gather operation, a single node collects a unique message from each node.
• While the scatter operation is fundamentally different from broadcast, the algorithmic
structure is similar, except for differences in message sizes (messages get smaller in scatter
and stay constant in broadcast).
• The gather operation is exactly the inverse of the scatter operation and can be executed as
such.
• Each node has a distinct message of size m for every other node.
• This is unlike all-to-all broadcast, in which each node sends the same message to all other
nodes.
• Each node extracts the information meant for it from the data received, and forwards the
remaining (p – 2) pieces of size m each to the next node.
All-to-all personalized communication on a six-node ring. The label of each message is of the form {x,y},
where x is the label of the node that originally owned the message, and y is the label of the node that is
the final destination of the message. The label ({x1,y1}, {x2,y2},…, {xn,yn}, indicates a message that is
formed by concatenating n individual messages.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 51
All-to-All Personalized Communication on a
Ring: Cost
• We have p – 1 steps in all.
• In step i, the message size is m(p – i).
• The total time is given by:
• Messages in each node are sorted again, this time according to the rows of their destination
nodes.
The distribution of messages at the beginning of each phase of all-to-all personalized communication on a
3 x 3 mesh. At the end of the second phase, node i has messages ({0,i},…,{8,i}), where 0 ≤ i ≤ 8. The
groups of nodes communicating together in each phase are enclosed in dotted boundaries.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 54
All-to-All Personalized Communication on a Mesh: Cost
• Time for the first phase is identical to that in a ring with √p processors, i.e., (ts +
twmp/2)(√p – 1).
• Time in the second phase is identical to the first phase. Therefore, total time is twice of this
time, i.e.,
• It can be shown that the time for rearrangement is less much less than this communication
time.
• At any stage in all-to-all personalized communication, every node holds p packets of size m
each.
• While communicating in a particular dimension, every node sends p/2 of these packets
(consolidated as one message).
• A node must rearrange its messages locally before each of the log p communication steps.
• A node must choose its communication partner in each step so that the hypercube links do
not suffer congestion.
• In the jth communication step, node i exchanges data with node (i XOR j).
• In this schedule, all paths in every communication step are congestion-free, and none of the
bidirectional links carry more than one message in the same direction.
• We have:
(0 ≤ q ≤ p).
• Mesh algorithms follow from this as well. We shift in one direction (all processors)
followed by the next direction.
The mapping of an eight-node linear array onto a three-dimensional hypercube to perform a circular 5-shift as
a combination of a 4-shift and a 1-shift.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 67
Circular Shift on a Hypercube
• The intervening gather and scatter operations cancel each other. Therefore, an all-reduce
operation requires an all-to-all reduction and an all-to-all broadcast.
• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, "Introduction to
Parallel Computing", 2nd edition, Addison-Wesley, 2003, ISBN: 0-201-64865-2.