Thakur05-Optimization of Collective Communication Operations in MPICH
Thakur05-Optimization of Collective Communication Operations in MPICH
1
2 COMPUTING APPLICATIONS
P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
1 2 3 4 5 0 1 2 3 4 5 0
2 3 4 5 0 1
3 4 5 0 1 2
P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
0 1 2 3 4 5 0 0 0 0 0 0
1 2 3 4 5 0 1 1 1 1 1 1
2 3 4 5 0 1 2 2 2 2 2 2
3 4 5 0 1 2 3 3 3 3 3 3
4 5 0 1 2 3 4 4 4 4 4 4
5 0 1 2 3 4 5 5 5 5 5 5
1200 80000
time (microsec.)
time (microsec.)
1000
60000
800
600 40000
400
20000
200
0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
message length (KB) message length (MB)
IBM SP
Figure 4: Performance of allgather for short messages 180000
Recursive doubling
(64 nodes). The size on the x-axis is the total amount Ring
160000
of data gathered on each process. 140000
Myrinet Cluster the top of the output buffer. To achieve this, process
250000 i must rotate its data up by i blocks. In each com-
MPICH Old
MPICH New
munication step k (0 ≤ k < dlg pe), process i sends to
200000 rank (i+2k ) (with wrap-around) all those data blocks
whose kth bit is 1, receives data from rank (i − 2k ),
time (microsec.)
150000 and stores the incoming data into blocks whose kth
bit is 1 (that is, overwriting the data that was just
100000 sent). In other words, in step 0, all the data blocks
whose least significant bit is 1 are sent and received
50000
(blocks 1, 3, and 5 in our example). In step 1, all
the data blocks whose second bit is 1 are sent and re-
ceived, namely, blocks 2 and 3. After a total of dlg pe
0
0 1 2 3 4 5 6 7 8 steps, all the data gets routed to the right destination
message length (MB)
process, but the data blocks are not in the right order
IBM SP
160000
in the output buffer. A final step in which each pro-
IBM MPI cess does a local inverse shift of the blocks (memory
MPICH New
140000
copies) places the data in the right order.
120000 The beauty of the Bruck algorithm is that it is
a logarithmic algorithm for short-message all-to-all
time (microsec.)
100000
that does not need any extra bookkeeping or control
80000
information for routing the right data to the right
60000 process—that is taken care of by the mathematics of
40000
the algorithm. It does need a memory permutation in
the beginning and another at the end, but for short
20000
messages, where communication latency dominates,
0 the performance penalty of memory copying is small.
0 1 2 3 4 5 6 7 8
message length (MB)
If n is the total amount of data a process needs to
send to or receive from all other processes, the time
Figure 6: Performance of long-message broadcast (64 taken by the Bruck algorithm can be calculated as
nodes) follows. If the number of processes is a power of two,
each process sends and receives n2 amount of data in
isend, each process calculates the source or destina- each step, for a total of lg p steps. Therefore, the n
time
tion as (rank + i) % p, which results in a scattering of taken by the algorithm is T bruck = lg p α + 2 lg p β.
the sources and destinations among the processes. If If the number of processes is not a power of two, in
n
the loop index were directly used as the source or tar- the final step, each process must communicate p (p −
blg pc
get rank, all processes would try to communicate with 2 ) data. Therefore, the time taken in the non-
n n
rank 0 first, then with rank 1, and so on, resulting in power-of-two case is Tbruck = dlg peα + ( 2 lg p + p (p −
a bottleneck. 2blg pc )) β.
The new all-to-all in MPICH uses four different al- Figure 8 shows the performance of the Bruck al-
gorithms depending on the message size. For short gorithm versus the old algorithm in MPICH (isend-
messages (≤ 256 bytes per message), we use the index irecv) for short messages. The Bruck algorithm per-
algorithm by Bruck et al. [7]. It is a store-and-forward forms significantly better because of its logarithmic
algorithm that takes dlg pe steps at the expense of latency term. As the message size is increased, how-
some extra data communication ( n2 lg p β instead of ever, latency becomes less of an issue, and the ex-
nβ, where n is the total amount of data to be sent tra bandwidth cost of the Bruck algorithm begins to
or received by any process). Therefore, it is a good show. Beyond a per process message size of about
algorithm for very short messages where latency is an 256 bytes, the isend-irecv algorithm performs better.
issue. Therefore, for medium-sized messages (256 bytes to
Figure 7 illustrates the Bruck algorithm for an ex- 32 KB per message), we use the irecv-isend algorithm,
ample with six processes. The algorithm begins by do- which works well in this range.
ing a local copy and “upward” shift of the data blocks For long messages and power-of-two number of pro-
from the input buffer to the output buffer such that cesses, we use a pairwise-exchange algorithm, which
the data block to be sent by each process to itself is at takes p − 1 steps. In each step k, 1 ≤ k < p, each
OPT. OF COLLECTIVE COMMUNICATIONS 7
P0 P1 P2 P3 P4 P5
P0
P1
P2
P3
P4 P5 P0 P1 P2 P3 P4 P5
00 10 20 30 40 50
00 11
22 33
44
55 00 11 22 33 44
/
0/0/<;0/0
55
01 11 21 31 41 51
01
12
23
34
45
50
1&%&
50
%21 3
2 ''21&%% 4
31 3&211 (
(3'' 5
01 ) 4(3'' 6
33 )
*)) 8
55(433 5*
12 ,++ 65*)*) 8
55 7,
,++ 9
77 655 7,
23 - 87,++ :
77 -
.9-- :..9--
99,877 .
34
9 9 <;
45
<; ;
22 44 686 88 ::9 ::9
4 6 8 8 <; <<;
02 12 22 32 42 52 02 13 24 35 40 51 02 13 24 35 40 51
$#$# $#
"!
!
!!
"
03 13 23 33 43 53 03 14 25 30 41 52 52 03 14 25 30 41
04
05
14
15
24
25
34
35
44
45
54
55
$#$# $#$# $#$#
04
05
15
10
"
"!""!
20
21
31
32
42
43
53
54
04
54
15
05
20
10
31
21
42
32
53
43
P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
00 11 22 33 44 55 00 11 22 33 44 55 00 01 02 03 04 05
50 01 12 23 34 45 50 01 12 23 34 45 10 11 12 13 14 15
>=>= D
40
CC >== DCC DCC51
FEFE FEFE
02
LKLK LKLK
13
NMNM NMNM
24
TSTS TSTS
35 40 51 02 13 24 35 20 21 22 23 24 25
After communication step 1 After communication step 2 After local inverse rotation
Figure 7: Bruck algorithm for all-to-all. The number ij in each box represents the data to be sent from
process i to process j. The shaded boxes indicate the data to be communicated in the next step.
600
rithm in MPICH implements reduce-scatter by do-
500 ing a binomial tree reduce to rank 0 followed by a
linear scatterv. This algorithm takes lg p + p − 1
400
steps, and the bandwidth term is (lg p + p−1 p )nβ.
300 Therefore, the time taken by this algorithm is Told =
(lg p + p − 1)α + (lg p + p−1
p )nβ + n lg p γ.
200
0 50 100 150 200 250 300 In our new implementation of reduce-scatter, for
message length (bytes)
short messages, we use different algorithms depending
on whether the reduction operation is commutative or
Figure 8: Performance of Bruck all-to-all versus the noncommutative. The commutative case occurs most
old algorithm in MPICH (isend-irecv) for short mes- commonly because all the predefined reduction oper-
sages. The size on the x-axis is the amount of data ations in MPI (such as MPI SUM, MPI MAX) are com-
sent by each process to every other process. mutative.
For commutative operations, we use a recursive-
halving algorithm, which is analogous to the recursive-
process calculates its target process as (rank b k) doubling algorithm used for allgather (see Figure 9).
(exclusive-or operation) and exchanges data directly In the first step, each process exchanges data with a
with that process. This algorithm, however, does not process that is a distance p2 away: Each process sends
work if the number of processes is not a power of the data needed by all processes in the other half, re-
two. For the non-power-of-two case, we use an al- ceives the data needed by all processes in its own half,
gorithm in which, in step k, each process receives and performs the reduction operation on the received
data from rank − k and sends data to rank + k. data. The reduction can be done because the oper-
In both these algorithms, data is directly communi- ation is commutative. In the second step, each pro-
cated from source to destination, with no intermediate cess exchanges data with a process that is a distance
p
steps. The time taken by these algorithms is given by 4 away. This procedure continues recursively, halving
Tlong = (p − 1)α + nβ. the data communicated at each step, for a total of lg p
8 COMPUTING APPLICATIONS
IBM SP
P0 P1 P2 P3 P4 P5 P6 P7
1600
IBM MPI
MPICH New
Step 1 1400
1200
Step 2
time (microsec.)
Step 3 1000
800
Figure 9: Recursive halving for commutative reduce- 600
scatter
400
200
steps. Therefore, if p is a power of two, the time taken
by this algorithm is Trec half = lg pα+ p−1 p−1
p nβ+ p nγ. 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
We use this algorithm for messages up to 512 KB. message length (bytes)
If p is not a power of two, we first reduce the num- Myrinet Cluster
ber of processes to the nearest lower power of two 400000
MPICH Old
by having the first few even-numbered processes send MPICH New
350000
their data to the neighboring odd-numbered process
300000
(rank + 1). These odd-numbered processes do a re-
time (microsec.)
duce on the received data, compute the result for 250000
Trec half = (blg pc + 2)α + 2nβ + n(1 + p−1 p )γ. This 50000
cost is approximate because some imbalance exists in
0
the amount of work each process does, since some pro- 0 1 2 3 4 5 6 7 8
message length (MB)
cesses do the work of their neighbors as well.
If the reduction operation is not commutative, re-
Figure 10: Performance of reduce-scatter for short
cursive halving will not work (unless the data is per-
messages on the IBM SP (64 nodes) and for long mes-
muted suitably [29]). Instead, we use a recursive-
sages on the Myrinet cluster (32 nodes)
doubling algorithm similar to the one in allgather. In
the first step, pairs of neighboring processes exchange
data; in the second step, pairs of processes at distance rithm is Tlong = (p − 1)α + p−1 p−1
p nβ + p nγ. Note that
2 apart exchange data; in the third step, processes at this algorithm has the same bandwidth requirement as
distance 4 apart exchange data; and so forth. How- the recursive halving algorithm. Nonetheless, we use
ever, more data is communicated than in allgather. In this algorithm for long messages because it performs
step 1, processes exchange all the data except the data much better than recursive halving (similar to the re-
needed for their own result (n− np ); in step 2, processes sults for recursive doubling versus ring algorithm for
exchange all data except the data needed by them- long-message allgather).
selves and by the processes they communicated with The SKaMPI benchmark, by default, uses a non-
in the previous step (n − 2n 4n
p ); in step 3, it is (n − p ); commutative user-defined reduction operation. Since
and so forth. Therefore, the time taken by this algo- commutative operations are more commonly used, we
rithm is Tshort = lg pα+n(lg p− p−1 p−1
p )β+n(lg p− p )γ. modified the benchmark to use a commutative oper-
We use this algorithm for very short messages (< 512 ation, namely, MPI SUM. Figure 10 shows the perfor-
bytes). mance of the new algorithm for short messages on the
For long messages (≥ 512 KB in the case of com- IBM SP and on the Myrinet cluster. The performance
mutative operations and ≥ 512 bytes in the case of is significantly better than that of the algorithm used
noncommutative operations), we use a pairwise ex- in IBM’s MPI on the SP and several times better than
change algorithm that takes p−1 steps. In step i, each the old algorithm (reduce + scatterv) used in MPICH
process sends data to (rank + i), receives data from on the Myrinet cluster.
(rank−i), and performs the local reduction. The data The above algorithms will also work for irregular
exchanged is only the data needed for the scattered reduce-scatter operations, but they are not specifically
result on the process ( np ). The time taken by this algo- optimized for that case.
OPT. OF COLLECTIVE COMMUNICATIONS 9
Myrinet Cluster
4.5 Reduce and Allreduce 450000
MPICH Old
MPICH New
400000
MPI Reduce performs a global reduction operation
and returns the result to the specified root, whereas 350000
time (microsec.)
The old algorithm for reduce in MPICH uses a bi- 250000
nomial tree, which takes lg p steps, and the data com- 200000
municated at each step is n. Therefore, the time taken
150000
by this algorithm is Ttree = dlg pe(α + nβ + nγ). The
100000
old algorithm for allreduce simply does a reduce to
rank 0 followed by a broadcast. 50000
5.1 Vector Halving and Distance Dou- halved, and the distance is doubled. At the end, each
bling Algorithm of the p0 processes has p10 of the total reduction result.
0
All these recursive steps take lg p0 α + ( p p−1
0 )(nβ + nγ)
This algorithm is a combination of a reduce-scatter time. The next part of the algorithm is either an all-
implemented with recursive vector halving and dis- gather or gather depending on whether the operation
tance doubling, followed either by a binomial-tree to be implemented is an allreduce or reduce.
gather (for reduce) or by an allgather implemented
Allreduce: To implement allreduce, we do an all-
with recursive vector doubling and distance halving
gather using recursive vector doubling and distance
(for allreduce).
halving. In the first step, process pairs exchange p10
Since these recursive algorithms require a power-of-
two number of processes, if the number of processes of the buffer to achieve p20 of the result vector, in the
next step p20 of the buffer is exchanged to get p40 of the
is not a power of two, we first reduce it to the nearest
result, and so forth. After lg p0 steps, the p0 processes
lower power of two (p0 = 2blg pc ) by removing r = p−p0
extra processes as follows. In the first 2r processes receive the total reduction result. This allgather part
0
(ranks 0 to 2r − 1), all the even ranks send the sec- costs lg p0 α + ( p p−1
0 )nβ. If the number of processes is
ond half of the input vector to their right neighbor not a power of two, the total result vector must be sent
(rank + 1), and all the odd ranks send the first half of
to the r processes that were removed in the first step,
the input vector to their left neighbor (rank − 1), aswhich results in additional overhead of αuni + nβuni .
illustrated in Figure 12. The even ranks compute the The total allreduce operation therefore takes the fol-
reduction on the first half of the vector and the odd lowing time:
ranks compute the reduction on the second half. The • If p is a power of two: Tall,h&d,p=2exp = 2 lg p α +
odd ranks then send the result to their left neigh- 2nβ + nγ − p1 (2nβ + nγ) ' 2 lg p α + 2nβ + nγ
bors (the even ranks). As a result, the even ranks 0
• If p is not a power of two: Tall,h&d,p6=2exp = (2 lg p +
among the first 2r processes now contain the reduc- 1+3fβ 3 1
1 + 2fα )α + (2 + 2 )nβ + 2 nγ − p0 (2nβ + nγ)
tion with the input vector on their right neighbors
' (3 + 2blg pc)α + 4nβ + 23 nγ
(the odd ranks). These odd ranks do not participate
This algorithm is good for long vectors and power-
in the rest of the algorithm, which leaves behind a
of-two numbers of processes. For non-power-of-two
power-of-two number of processes. The first r even-
numbers of processes, the data transfer overhead is
ranked processes and the last p − 2r processes are now
doubled, and the computation overhead is increased
renumbered from 0 to p0 − 1, p0 being a power of two.
by 32 . The binary blocks algorithm described in Sec-
Figure 12 illustrates the algorithm for an exampletion 5.2 can reduce this overhead in many cases.
on 13 processes. The input vectors and all reduction
Reduce: For reduce, a binomial tree gather is per-
results are divided into 8 parts (A, B,. . .,H), where 8
is the largest power of two less than 13, and denoted formed by using recursive vector doubling and dis-
0
tance halving, which takes lg p0 αuni + p p−1
as A–Hranks . After the first reduction, process P0 has 0 nβuni time.
computed A–D0−1 , which is the reduction result of theIn the non-power-of-two case, if the root happens to
first half (A–D) of the input vector from processes 0 be one of those odd-ranked processes that would nor-
and 1. Similarly, P1 has computed E–H0−1 , P2 has mally be removed in the first step, then the role of
computed A–D2−3 , and so forth. The odd ranks then this process and its partner in the first step are inter-
changed after the first reduction in the reduce-scatter
send their half to the even ranks on their left: P1 sends
E–H0−1 to P0, P3 sends E–H2−3 to P0, and so forth. phase, which causes no additional overhead. The total
reduce operation therefore takes the following time:
This completes the first step, which takes (1 + fα )α +
n n • If p is a power of two: Tred,h&d,p=2exp = lg p(1 +
2 (1 + fβ )β + 2 γ time. P1, P3, P5, P7, and P9 do not
participate in the remainder of the algorithm, and the fα )α + (1 + fβ )nβ + nγ − p1 ((1 + fβ )nβ + nγ) '
remaining processes are renumbered from 0–7. 2 lg p α + 2nβ + nγ
• If p is a not a power of two: Tred,h&d,p6=2exp =
The remaining processes now perform a reduce-
scatter by using recursive vector halving and distance lg p0 (1+fα )α+(1+fα )α+(1+ 1+f2beta +fβ )nβ+ 32 nγ−
1 3
doubling. The even-ranked processes send the sec- p0 ((1 + fβ )nβ + nγ) ' (2 + 2blg pc)α + 3nβ + 2 nγ
ond half of their buffer to rank 0 + 1 and the odd-
ranked processes send the first half of their buffer to 5.2 Binary Blocks Algorithm
rank 0 − 1. All processes then compute the reduction
between the local buffer and the received buffer. In This algorithm reduces some of the load imbalance
the next lg p0 − 1 steps, the buffers are recursively in the recursive halving and doubling algorithm when
OPT. OF COLLECTIVE COMMUNICATIONS 11
the number of processes is not a power of two. The 5.3 Ring Algorithm
algorithm starts with a binary-block decomposition
of all processes in blocks with power-of-two numbers This algorithm uses a pairwise-exchange algorithm for
of processes (see the example in Figure 13). Each the reduce-scatter phase (see Section 4.4). For allre-
block executes its own reduce-scatter with the recur- duce, it uses a ring algorithm to do the allgather, and,
sive vector halving and distance doubling algorithm for reduce, all processes directly send their result seg-
described above. Then, starting with the smallest ment to the root. This algorithm is good in bandwidth
block, the intermediate result (or the input vector in use when the number of processes is not a power of
the case of a 20 block) is split into the segments of two, but the latency scales with the number of pro-
the intermediate result in the next higher block and cesses. Therefore this algorithm should be used only
sent to the processes in that block, and those pro- for small or medium number of processes or for large
cesses compute the reduction on the segment. This vectors. The time taken is Tall,ring = 2(p − 1)α +
does cause a load imbalance in computation and com- 2nβ + nγ − p1 (2nβ + nγ) for allreduce and Tred,ring =
munication compared with the execution in the larger (p−1)(α+αuni )+n(β+βuni )+nγ− p1 (n(β+βuni )+nγ)
blocks. For example, in the third exchange step in for reduce.
the 23 block, each process sends one segment, re-
ceives one segment, and computes the reduction of
one segment (P0 sends B, receives A, and computes 5.4 Choosing the Fastest Algorithm
the reduction on A). The load imbalance is introduced
by the smaller blocks 22 and 20 : In the 22 block, Based on the number of processes and the buffer size,
each process receives and reduces two segments (for the reduction routine must decide which algorithm
to use. This decision is not easy and depends on
example, A–B on P8), whereas in the 20 block (P12),
a number of factors. We experimentally determined
each process has to send as many messages as the ra-
which algorithm works best for different buffer sizes
tio of the two block sizes (here 22 /20 ). At the end of
and number of processes on the Cray T3E 900. The
the first part, the highest block must be recombined
results for allreduce are shown in Figure 14. The fig-
with the next smaller block, and the ratio of the block
ure indicates which is the fastest allreduce algorithm
sizes again determines the overhead.
for each parameter pair (number of processes, buffer
We see that the maximum difference between the size) and for the operation MPI SUM with datatype
ratio of two successive blocks, especially in the low MPI DOUBLE. For buffer sizes less than or equal to
range of exponents, determines the load imbalance. 32 bytes, recursive doubling is the best; for buffer
Let us define δexpo,max as the maximal difference of sizes less than or equal to 1 KB, the vendor’s algo-
two consecutive exponents in the binary represen- rithm (for power-of-two) and binomial tree (for non-
tation of the number of processes. For example, power-of-two) are the best, but not much better than
100 = 26 + 25 + 22 , δexpo,max = max(6 − 5, 5 − 2) = 3. recursive doubling; for longer buffer sizes, the ring
If δexpo,max is small, the binary blocks algorithm can algorithm is good for some buffer sizes and some
perform well. number of processes less than 32. In general, on a
Cray T3E 900, the binary blocks algorithm is faster
Allreduce: For allreduce, the second part is an all- if δexpo,max < lg(vector length in bytes)/2.0 − 2.5 and
gather implemented with recursive vector doubling vector size ≥ 16 KB and more than 32 processes are
and distance halving in each block. For this purpose, used. In a few cases, for example, 33 processes and
data must be provided to the processes in the smaller less than 32 KB, recursive halving and doubling is the
blocks with a pair of messages from processes of the best.
next larger block, as shown in Figure 13. Figure 15 shows the bandwidths obtained by the
Reduce: For reduce, if the root is outside the largest various algorithms for a 32 KB buffer on the T3E. For
block, then the intermediate result segment of rank 0 this buffer size, the new algorithms are clearly better
is sent to the root, and the root plays the role of than the vendor’s algorithm (Cray MPT.1.4.0.4) and
rank 0. A binomial tree is used to gather the result the binomial tree algorithm for all numbers of pro-
segments to the root process. cesses. We observe that the bandwidth of the binary
blocks algorithm depends strongly on δexpo,max and
We note that if the number of processes is a power that recursive halving and doubling is faster on 33,
of two, the binary blocks algorithm is identical to the 65, 66, 97, 128–131 processes. The ring algorithm is
recursive halving and doubling algorithm. faster on 3, 5, 7, 9–11, and 17 processes.
12 COMPUTING APPLICATIONS
Figure 12: Allreduce using the recursive halving and doubling algorithm. The intermediate results after each
communication step, including the reduction operation in the reduce-scatter phase, are shown. The dotted
frames show the additional overhead caused by a non-power-of-two number of processes.
0
6 Conclusions and Future Work 2 4 8 16 32 64 128 256
number of MPI processes
Our results demonstrate that optimized algorithms
for collective communication can provide substantial Figure 15: Bandwidth comparison for allreduce
performance benefits and, to achieve the best perfor- (MPI DOUBLE, MPI SUM) with 32 KB vectors on a Cray
mance, one needs to use a number of different algo- T3E 900.
rithms and select the right algorithm for a partic-
ular message size and number of processes. Deter-
mining the right cutoff points for switching between
14 COMPUTING APPLICATIONS
128
Allreduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’sAllreduce(sum,dbl)
bandwidth - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth
512 100.<=
64 ratio 100.<= ratio
50. <= ratio <100. 50. <= ratio <100.
20. <= ratio < 50. 20. <= ratio < 50.
number of MPI processes
Figure 16: Ratio of the bandwidth of the fastest of the new algorithms (not including recursive doubling) and
the vendor’s allreduce on the IBM SP at SDSC with 1 MPI process per CPU (left) and per SMP node (right)
128 64
10. <= ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
64 32
5.0 <= ratio < 7.0 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0
32 16
2.0 <= ratio < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
16 8
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5
Figure 17: Ratio of the bandwidth of the fastest of the new algorithms (not including recursive doubling)
and the old MPICH-1 algorithm on a Myrinet cluster with dual-CPU PCs (HELICS cluster, University of
Heidelberg) and 1 MPI process per CPU (left) and per SMP node (right)
OPT. OF COLLECTIVE COMMUNICATIONS 15
64 10. <=
64ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
32 32ratio < 7.0
5.0 <= 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0
16 16
2.0 <= ratio < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
8 8
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5
Figure 18: Ratio of the bandwidth of the fastest of the new algorithms and the vendor’s algorithm for allreduce
(left) and reduce (right) with operation MPI SUM (first row) and MPI MAXLOC (second row) on a Cray T3E 900
Figure 19: Benefit of new allreduce and reduce algorithms optimized for long vectors on the Cray T3E
16 COMPUTING APPLICATIONS
the different algorithms is tricky, however, and they computing in general and high-performance communi-
may be different for different machines and networks. cation and I/O in particular. He was a member of the
At present, we use experimentally determined cutoff MPI Forum and participated actively in the definition
points. In the future, we intend to determine the cut- of the I/O part of the MPI-2 standard. He is also the
off points automatically based on system parameters. the author of a widely used, portable implementation
MPI also defines irregular (“v”) versions of many of MPI-IO, called ROMIO. He is currently involved
of the collectives, where the operation counts may be in the development of MPICH-2, a new portable im-
different on different processes. For these operations, plementation of MPI-2. Rajeev is a co-author of the
we currently use the same techniques as for the regular book ”Using MPI-2: Advanced Features of the Mes-
versions described in this paper. Further optimization sage Passing Interface” published by MIT Press. He is
of the irregular collectives is possible, and we plan to an associate editor of IEEE Transactions on Parallel
optimize them in the future. and Distributed Systems, has served on the program
In this work, we assume a flat communication model committees of several conferences, and has also served
in which any pair of processes can communicate at as a co-guest editor for a special issue of the Int’l Jour-
the same cost. Although these algorithms will work nal of High-Performance Computing Applications on
even on hierarchical networks, they may not be opti- ”I/O in Parallel Applications.”
mized for such networks. We plan to extend this work Rolf Rabenseifner studied mathematics and physics
to hierarchical networks and develop algorithms that at the University of Stuttgart. He is head of the
are optimized for architectures comprising clusters of Department of Parallel Computing at the High-
SMPs and clusters distributed over a wide area, such Performance Computing Center Stuttgart (HLRS).
as the TeraGrid [26]. We also plan to explore the He led the projects DFN-RPC, a remote procedure
use of one-sided communication to improve the per- call tool, and MPI-GLUE, the first metacomputing
formance of collective operations. MPI combining different vendor’s MPIs without loos-
The source code for the algorithms in Section 4 is ing the full MPI interface. In his dissertation work
available in MPICH-1.2.6 and MPICH2 1.0. Both at the University of Stuttgart, he developed a con-
MPICH-1 and MPICH2 can be downloaded from trolled logical clock as global time for trace-based
www.mcs.anl.gov/mpi/mpich. profiling of parallel and distributed applications. He
is an active member of the MPI-2 Forum. In 1999,
he was an invited researcher at the Center for High-
Acknowledgments
Performance Computing at Dresden University of
This work was supported by the Mathematical, Infor- Technology. His current research interests include
mation, and Computational Sciences Division subpro- MPI profiling, benchmarking, and optimization. Each
gram of the Office of Advanced Scientific Computing year he teaches parallel programming models in a
Research, Office of Science, U.S. Department of En- workshop format at many universities and labs in Ger-
ergy, under Contract W-31-109-ENG-38. The authors many. (http://www.hlrs.de/people/rabenseifner/).
would like to acknowledge their colleagues and oth- William Gropp received his B.S. in Mathematics
ers who provided suggestions and helpful comments. from Case Western Reserve University in 1977, a
They would especially like to thank Jesper Larsson MS in Physics from the University of Washington in
Träff for helpful discussion on optimized reduction al- 1978, and a Ph.D. in Computer Science from Stan-
gorithms and Gerhard Wellein, Thomas Ludwig, and ford in 1982. He held the positions of assistant (1982-
Ana Kovatcheva for their benchmarking support. We 1988) and associate (1988-1990) professor in the Com-
also thank the reviewers for their detailed comments. puter Science Department at Yale University. In 1990,
he joined the Numerical Analysis group at Argonne,
Biographies where he is a Senior Computer Scientist and Asso-
ciate Director of the Mathematics and Computer Sci-
Rajeev Thakur is a Computer Scientist in the Math- ence Division, a Senior Scientist in the Department of
ematics and Computer Science Division at Argonne Computer Science at the University of Chicago, and
National Laboratory. He received a B.E. in Computer a Senior Fellow in the Argonne-Chicago Computation
Engineering from the University of Bombay, India, in Institute. His research interests are in parallel com-
1990, an M.S. in Computer Engineering from Syra- puting, software for scientific computing, and numer-
cuse University in 1992, and a Ph.D. in Computer ical methods for partial differential equations. He has
Engineering from Syracuse University in 1995. His played a major role in the development of the MPI
research interests are in the area of high-performance message-passing standard. He is co-author of the most
OPT. OF COLLECTIVE COMMUNICATIONS 17
widely used implementation of MPI, MPICH, and was LogP: Towards a realistic model of parallel computa-
involved in the MPI Forum as a chapter author for tion. In Principles Practice of Parallel Programming,
both MPI-1 and MPI-2. He has written many books pages 1–12, 1993.
and papers on MPI including ”Using MPI” and ”Us- [10] Debra Hensgen, Raphael Finkel, and Udi Manbet.
ing MPI-2”. He is also one of the designers of the Two algorithms for barrier synchronization. Interna-
PETSc parallel numerical library, and has developed tional Journal of Parallel Programming, 17(1):1–17,
efficient and scalable parallel algorithms for the solu- 1988.
tion of linear and nonlinear equations. [11] Roger W. Hockney. The communication challenge for
mpp: Intel paragon and meiko cs-2. Parallel Comput-
ing, 20(3):389–398, March 1994.
References [12] Giulio Iannello. Efficient algorithms for the reduce-
[1] Albert Alexandrov, Mihai F. Ionescu, Klaus E. scatter operation in LogGP. IEEE Transactions
Schauser, and Chris Scheiman. LogGP: Incorporating on Parallel and Distributed Systems, 8(9):970–982,
long messages into the LogP model for parallel com- September 1997.
putation. Journal of Parallel and Distributed Com- [13] L. V. Kale, Sameer Kumar, and Krishnan Vardara-
puting, 44(1):71–79, 1997. jan. A framework for collective personalized com-
[2] M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de munication. In Proceedings of the 17th Interna-
Geijn, and J. Watts. Interprocessor collective com- tional Parallel and Distributed Processing Symposium
munication library (InterCom). In Proceedings of Su- (IPDPS ’03), 2003.
percomputing ’94, November 1994. [14] N. Karonis, B. de Supinski, I. Foster, W. Gropp,
E. Lusk, and J. Bresnahan. Exploiting hierarchy
[3] M. Barnett, R. Littlefield, D. Payne, and R. van de
in parallel computer networks to optimize collective
Geijn. Global combine on mesh architectures with
operation performance. In Proceedings of the Four-
wormhole routing. In Proceedings of the 7th Interna-
teenth International Parallel and Distributed Process-
tional Parallel Processing Symposium, April 1993.
ing Symposium (IPDPS ’00), pages 377–384, 2000.
[4] Gregory D. Benson, Cho-Wai Chu, Qing Huang,
[15] T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat,
and Sadik G. Caglar. A comparison of MPICH all-
and R. A. F. Bhoedjang. MagPIe: MPI’s collec-
gather algorithms on switched networks. In Jack Don-
tive communication operations for clustered wide
garra, Domenico Laforenza, and Salvatore Orlando,
area systems. In ACM SIGPLAN Symposium on
editors, Recent Advances in Parallel Virtual Ma-
Principles and Practice of Parallel Programming
chine and Message Passing Interface, 10th European
(PPoPP’99), pages 131–140. ACM, May 1999.
PVM/MPI Users’ Group Meeting, pages 335–343.
Lecture Notes in Computer Science 2840, Springer, [16] P. Mitra, D. Payne, L. Shuler, R. van de Geijn, and
September 2003. J. Watts. Fast collective communication libraries,
please. In Proceedings of the Intel Supercomputing
[5] S. Bokhari. Complete exchange on the iPSC/860.
Users’ Group Meeting, June 1995.
Technical Report 91–4, ICASE, NASA Langley Re-
search Center, 1991. [17] MPICH – A portable implementation of MPI.
http://www.mcs.anl.gov/mpi/mpich.
[6] S. Bokhari and H. Berryman. Complete exchange on
a circuit switched mesh. In Proceedings of the Scal- [18] Rolf Rabenseifner. Effective bandwidth (b eff) bench-
able High Performance Computing Conference, pages mark. http://www.hlrs.de/mpi/b eff.
300–306, 1992. [19] Rolf Rabenseifner. New optimized MPI reduce algo-
[7] Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli rithm. http://www.hlrs.de/organization/par/
Upfal, and Derrick Weathersby. Efficient algorithms services/models/mpi/myreduce.html.
for all-to-all communications in multiport message- [20] Rolf Rabenseifner. Automatic MPI counter profiling
passing systems. IEEE Transactions on Parallel of all users: First results on a CRAY T3E 900-512.
and Distributed Systems, 8(11):1143–1156, November In Proceedings of the Message Passing Interface De-
1997. veloper’s and User’s Conference 1999 (MPIDC ’99),
[8] Ernie W. Chan, Marcel F. Heimlich, Avi Pu- pages 77–85, March 1999.
rakayastha, and Robert A. van de Geijn. On opti- [21] Rolf Rabenseifner and Gerhard Wellein. Communi-
mizing collective communication. In Proceedings of cation and optimization aspects of parallel program-
the 2004 IEEE International Conference on Cluster ming models on hybrid architectures. International
Computing, September 2004. Journal of High Performance Computing Applica-
[9] David E. Culler, Richard M. Karp, David A. Patter- tions, 17(1):49–62, 2003.
son, Abhijit Sahay, Klaus E. Schauser, Eunice San- [22] Peter Sanders and Jesper Larsson Träff. The hierar-
tos, Ramesh Subramonian, and Thorsten von Eicken. chical factor algorithm for all-to-all communication.
18 COMPUTING APPLICATIONS