0% found this document useful (0 votes)
7 views

Thakur05-Optimization of Collective Communication Operations in MPICH

Uploaded by

angelo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Thakur05-Optimization of Collective Communication Operations in MPICH

Uploaded by

angelo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

To be published in the International Journal of High Performance Computing Applications, 2005. c Sage Publications.

Optimization of Collective Communication Operations in MPICH


Rajeev Thakur∗ Rolf Rabenseifner† William Gropp∗

Abstract paper describes our efforts at improving the perfor-


mance of collective operations in MPICH. Our ini-
We describe our work on improving the performance tial target architecture is one that is very popular
of collective communication operations in MPICH for among our users, namely, clusters of machines con-
clusters connected by switched networks. For each nected by a switch, such as Myrinet or the IBM SP
collective operation, we use multiple algorithms de- switch. Our approach has been to identify the best
pending on the message size, with the goal of min- algorithms known in the literature, improve on them
imizing latency for short messages and minimizing or develop new algorithms where necessary, and im-
bandwidth use for long messages. Although we have plement them efficiently. For each collective opera-
implemented new algorithms for all MPI collective op- tion, we use multiple algorithms based on message
erations, because of limited space we describe only the size: The short-message algorithms aim to minimize
algorithms for allgather, broadcast, all-to-all, reduce- latency, and the long-message algorithms aim to min-
scatter, reduce, and allreduce. Performance results imize bandwidth use. We use experimentally deter-
on a Myrinet-connected Linux cluster and an IBM SP mined cutoff points to switch between different algo-
indicate that, in all cases, the new algorithms signifi- rithms depending on the message size and number of
cantly outperform the old algorithms used in MPICH processes. We have implemented new algorithms in
on the Myrinet cluster, and, in many cases, they out- MPICH (MPICH 1.2.6 and MPICH2 0.971) for all the
perform the algorithms used in IBM’s MPI on the MPI collective operations, namely, scatter, gather, all-
SP. We also explore in further detail the optimiza- gather, broadcast, all-to-all, reduce, allreduce, reduce-
tion of two of the most commonly used collective op- scatter, scan, barrier, and their variants. Because of
erations, allreduce and reduce, particularly for long limited space, however, we describe only the new al-
messages and non-power-of-two numbers of processes. gorithms for allgather, broadcast, all-to-all, reduce-
The optimized algorithms for these operations per- scatter, reduce, and allreduce.
form several times better than the native algorithms A five-year profiling study of applications running
on a Myrinet cluster, IBM SP, and Cray T3E. Our in production mode on the Cray T3E 900 at the Uni-
results indicate that to achieve the best performance versity of Stuttgart revealed that more than 40% of
for a collective communication operation, one needs the time spent in MPI functions was spent in the
to use a number of different algorithms and select the two functions MPI Allreduce and MPI Reduce and
right algorithm for a particular message size and num- that 25% of all execution time was spent on program
ber of processes. runs that involved a non-power-of-two number of pro-
cesses [20]. We therefore investigated in further detail
how to optimize allreduce and reduce. We present a
1 Introduction detailed study of different ways of optimizing allre-
duce and reduce, particularly for long messages and
Collective communication is an important and fre-
non-power-of-two numbers of processes, both of which
quently used component of MPI and offers im-
occur frequently according to the profiling study.
plementations considerable room for optimization.
The rest of this paper is organized as follows. In
MPICH [17], although widely used as an MPI imple-
Section 2, we describe related work in the area of
mentation, has until recently had fairly rudimentary
collective communication. In Section 3, we describe
implementations of the collective operations. This
the cost model used to guide the selection of algo-
∗ Mathematics and Computer Science Division, Argonne Na- rithms. In Section 4, we describe the new algorithms
tional Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, in MPICH and their performance. In Section 5, we
USA. {thakur, gropp}@mcs.anl.gov
† High Performance Computing Center (HLRS), University investigate in further detail the optimization of reduce
of Stuttgart, Allmandring 30, D-70550 Stuttgart, Germany. and allreduce. In Section 6, we conclude with a brief
[email protected], www.hlrs.de/people/rabenseifner/ discussion of future work.

1
2 COMPUTING APPLICATIONS

2 Related Work byte, and n is the number of bytes transferred. We


assume further that the time taken is independent of
Early work on collective communication focused on how many pairs of processes are communicating with
developing optimized algorithms for particular archi- each other, independent of the distance between the
tectures, such as hypercube, mesh, or fat tree, with communicating nodes, and that the communication
an emphasis on minimizing link contention, node links are bidirectional (that is, a message can be trans-
contention, or the distance between communicating ferred in both directions on the link in the same time
nodes [3, 5, 6, 23]. More recently, Vadhiyar et al. as in one direction). The node’s network interface
have developed automatically tuned collective com- is assumed to be single ported; that is, at most one
munication algorithms [30]. They run experiments to message can be sent and one message can be received
measure the performance of different algorithms for simultaneously. In the case of reduction operations,
a collective communication operation under different we assume that γ is the computation cost per byte
conditions (message size, number of processes) and for performing the reduction operation locally on any
then use the best algorithm for a given set of condi- process.
tions. Researchers in Holland and at Argonne have This cost model assumes that all processes can send
optimized MPI collective communication for wide- and receive one message at the same time, regard-
area distributed environments [14, 15]. In such en- less of the source and destination. Although this is a
vironments, the goal is to minimize communication good approximation, many networks are faster if pairs
over slow wide-area links at the expense of more com- of processes exchange data with each other, rather
munication over faster local-area connections. Re- than if a process sends to and receives from differ-
searchers have also developed collective communica- ent processes [4]. Therefore, for the further optimiza-
tion algorithms for clusters of SMPs [22, 25, 27, 28], tion of reduction operations (Section 5), we refine the
where communication within an SMP is done differ- cost model by defining two costs: α + nβ is the time
ently from communication across a cluster. Some ef- taken for bidirectional communication between a pair
forts have focused on using different algorithms for of processes, and αuni + nβuni is the time taken for
different message sizes, such as the work by Van de unidirectional communication from one process to an-
Geijn et al. [2, 8, 16, 24], by Rabenseifner on re- other. We also define the ratios fα = αuni /α and
duce and allreduce [19], and by Kale et al. on all- fβ = βuni /β. These ratios are normally in the range
to-all communication [13]. Benson et al. studied the 0.5 (simplex network) to 1.0 (full-duplex network).
performance of the allgather operation in MPICH on
Myrinet and TCP networks and developed a dissem-
ination allgather based on the dissemination barrier 4 Algorithms
algorithm [4]. Bruck et al. proposed algorithms for all-
gather and all-to-all that are particularly efficient for In this section we describe the new algorithms and
short messages [7]. Iannello developed efficient algo- their performance. We measured performance by us-
rithms for the reduce-scatter operation in the LogGP ing the SKaMPI benchmark [31] on two platforms:
model [12]. a Linux cluster at Argonne connected with Myrinet
2000 and the IBM SP at the San Diego Super-
computer Center. On the Myrinet cluster we used
3 Cost Model MPICH-GM and compared the performance of the
new algorithms with the old algorithms in MPICH-
We use a simple model to estimate the cost of the GM. On the IBM SP, we used IBM’s MPI and com-
collective communication algorithms in terms of la- pared the performance of the new algorithms with the
tency and bandwidth use, and to guide the selection algorithms used in IBM’s MPI. On both systems, we
of algorithms for a particular collective communica- ran one MPI process per node. We implemented the
tion operation. This model is similar to the one used new algorithms as functions on top of MPI point-to-
by Van de Geijn [2, 16, 24], Hockney [11], and others. point operations, so that we could compare perfor-
Although more sophisticated models such as LogP [9] mance simply by linking or not linking the new func-
and LogGP [1] exist, this model is sufficient for our tions.
needs.
We assume that the time taken to send a message
4.1 Allgather
between any two nodes can be modeled as α + nβ,
where α is the latency (or startup time) per message, MPI Allgather is a gather operation in which the
independent of message size, β is the transfer time per data contributed by each process is gathered on
OPT. OF COLLECTIVE COMMUNICATIONS 3

P0 P1 P2 P3 P4 P5 P6 P7 arithmic fashion to ensure that all processes get the


data they would have gotten had the number of pro-
Step 1 cesses been a power of two. This extra communication
Step 2 is necessary for the subsequent steps of recursive dou-
Step 3 bling to work correctly. The total number of steps for
the non-power-of-two case is bounded by 2blg pc.
Figure 1: Recursive doubling for allgather
4.1.2 Bruck Algorithm
all processes, instead of just the root process as The Bruck algorithm for allgather [7] (referred to as
in MPI Gather. The old algorithm for allgather in concatenation) is a variant of the dissemination algo-
MPICH uses a ring method in which the data from rithm for barrier, described in [10]. Both algorithms
each process is sent around a virtual ring of processes. take dlg pe steps in all cases, even for non-power-of-two
In the first step, each process i sends its contribution numbers of processes. In the dissemination algorithm
to process i + 1 and receives the contribution from for barrier, in each step k (0 ≤ k < dlg pe), process
process i − 1 (with wrap-around). From the second i sends a (zero-byte) message to process (i + 2k ) and
step onward each process i forwards to process i + 1 receives a (zero-byte) message from process (i − 2k )
the data it received from process i − 1 in the previ- (with wrap-around). If the same order were used to
ous step. If p is the number of processes, the entire perform an allgather, it would require communicat-
algorithm takes p − 1 steps. If n is the total amount ing noncontiguous data in each step in order to get
of data to be gathered on each process, then at ev- the right data to the right process (see [4] for details).
ery step each process sends and receives np amount of The Bruck algorithm avoids this problem nicely by a
data. Therefore, the time taken by this algorithm is simple modification to the dissemination algorithm in
given by Tring = (p − 1)α + p−1 p nβ. Note that the which, in each step k, process i sends data to pro-
bandwidth term cannot be reduced further because cess (i − 2k ) and receives data from process (i + 2k ),
each process must receive np data from p − 1 other instead of the other way around. The result is that
processes. The latency term, however, can be reduced all communication is contiguous, except that at the
by using an algorithm that takes lg p steps. We con- end, the blocks in the output buffer must be shifted
sider two such algorithms: recursive doubling and the locally to place them in the right order, which is a
Bruck algorithm [7]. local memory-copy operation.
Figure 2 illustrates the Bruck algorithm for an ex-
4.1.1 Recursive Doubling ample with six processes. The algorithm begins by
Figure 1 illustrates how recursive doubling works. In copying the input data on each process to the top of
the first step, processes that are a distance 1 apart the output buffer. In keach step k, process i sends to
exchange their data. In the second step, processes the destination (i − 2 ) all the data it has so k
far and
that are a distance 2 apart exchange their own data stores the data it receives (from rank (i + 2 )) at the
as well as the data they received in the previous step. end of the data it currently has. This procedure con-
In the third step, processes that are a distance 4 apart tinues for blg pc steps. If the number of processes is
exchange their own data as well the data they received not a power of two, an additional step is needed in
blg pc
in the previous two steps. In this way, for a power-of- which each process sends the first (p − 2 ) blocks
two number of processes, all processes get all the data from the top of its output buffer to the destination
in lg p steps. The amount of data exchanged by each and appends the data it receives to the data it al-
process is np in the first step, 2n ready has. Each process now has all the data it needs,
p in the second step,
lg p−1 but the data is not in the right order in the output
and so forth, up to 2 p n in the last step. Therefore, buffer: The data on process i is shifted “up” by i
the total time taken by this algorithm is Trec dbl = blocks. Therefore, a simple local shift of the blocks
lg p α + p−1
p nβ. downwards by i blocks brings the data into the de-
Recursive doubling works very well for a power-of- sired order. The total time taken by this algorithm is
two number of processes but is tricky to get right for a Tbruck = dlg pe α + p−1 nβ.
p
non-power-of-two number of processes. We have im-
plemented the non-power-of-two case as follows. At
4.1.3 Performance
each step of recursive doubling, if any set of exchang-
ing processes is not a power of two, we do additional The Bruck algorithm has lower latency than recursive
communication in the peer (power-of-two) set in a log- doubling for non-power-of-two numbers of processes.
4 COMPUTING APPLICATIONS

P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
1 2 3 4 5 0 1 2 3 4 5 0
2 3 4 5 0 1
3 4 5 0 1 2

Initial data After step 0 After step 1

P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
0 1 2 3 4 5 0 0 0 0 0 0
1 2 3 4 5 0 1 1 1 1 1 1
2 3 4 5 0 1 2 2 2 2 2 2
3 4 5 0 1 2 3 3 3 3 3 3
4 5 0 1 2 3 4 4 4 4 4 4
5 0 1 2 3 4 5 5 5 5 5 5

After step 2 After local shift

Figure 2: Bruck allgather

For power-of-two numbers of processes, however, the Myrinet Cluster


Bruck algorithm requires local memory permutation 140
Recursive Doubling
Bruck Algorithm
at the end, whereas recursive doubling does not. In 120
practice, we find that the Bruck algorithm is best
for short messages and non-power-of-two numbers of 100
time (microsec.)

processes; recursive doubling is best for power-of-two 80


numbers of processes and short or medium-sized mes-
sages; and the ring algorithm is best for long messages 60

and any number of processes and also for medium- 40


sized messages and non-power-of-two numbers of pro-
cesses. 20

Figure 3 shows the advantage of the Bruck al- 0


0 5 10 15 20 25 30 35
gorithm over recursive doubling for short messages Number of processes
and non-power-of-two numbers of processes because it
takes fewer steps. For power-of-two numbers of pro- Figure 3: Performance of recursive doubling versus
cesses, however, recursive doubling performs better Bruck allgather for power-of-two and non-power-of-
because of the pairwise nature of its communication two numbers of processes (message size 16 bytes per
pattern and because it does not need any memory per- process).
mutation. As the message size increases, the Bruck
algorithm suffers because of the memory copies. In that are much farther apart communicate. To con-
MPICH, therefore, we use the Bruck algorithm for firm this hypothesis, we used the b eff MPI bench-
short messages (< 80 KB total data gathered) and mark [18], which measures the performance of about
non-power-of-two numbers of processes, and recur- 48 different communication patterns, and found that,
sive doubling for power-of-two numbers of processes for long messages on both the Myrinet cluster and the
and short or medium-sized messages (< 512 KB total IBM SP, some communication patterns (particularly
data gathered). For short messages, the new allgather nearest neighbor) achieve more than twice the band-
performs significantly better than the old allgather inwidth of other communication patterns. In MPICH,
MPICH, as shown in Figure 4. therefore, for long messages (≥ 512 KB total data
For long messages, the ring algorithm performs bet- gathered) and any number of processes and also for
ter than recursive doubling (see Figure 5). We believe medium-sized messages (≥ 80 KB and < 512 KB to-
this is because it uses a nearest-neighbor communica- tal data gathered) and non-power-of-two numbers of
tion pattern, whereas in recursive doubling, processes processes, we use the ring algorithm.
OPT. OF COLLECTIVE COMMUNICATIONS 5

Myrinet Cluster Myrinet Cluster


1800 120000
MPICH Old Recursive doubling
MPICH New Ring
1600
100000
1400

1200 80000
time (microsec.)

time (microsec.)
1000
60000
800

600 40000

400
20000
200

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
message length (KB) message length (MB)

IBM SP
Figure 4: Performance of allgather for short messages 180000
Recursive doubling
(64 nodes). The size on the x-axis is the total amount Ring
160000
of data gathered on each process. 140000

time (microsec.) 120000


4.2 Broadcast 100000

The old algorithm for broadcast in MPICH is the com- 80000

monly used binomial tree algorithm. In the first step, 60000


the root sends data to process (root + p2 ). This pro- 40000
cess and the root then act as new roots within their
20000
own subtrees and recursively continue this algorithm.
0
This communication takes a total of dlg pe steps. The 0 1 2 3 4 5 6 7 8
amount of data communicated by a process at any message length (MB)

step is n. Therefore, the time taken by this algorithm


is Ttree = dlg pe(α + nβ). Figure 5: Ring algorithm versus recursive doubling
This algorithm is good for short messages because for long-message allgather (64 nodes). The size on
it has a logarithmic latency term. For long mes- the x-axis is the total amount of data gathered on
sages, however, a better algorithm has been proposed each process.
by Van de Geijn et al. that has a lower bandwidth
term [2, 24]. In this algorithm, the message to be shows the performance for long messages of the new
broadcast is first divided up and scattered among the algorithm versus the old binomial tree algorithm in
processes, similar to an MPI Scatter. The scattered MPICH as well as the algorithm used by IBM’s MPI
data is then collected back to all processes, similar to on the SP. In both cases, the new algorithm performs
an MPI Allgather. The time taken by this algorithm significantly better. In MPICH, therefore, we use the
is the sum of the times taken by the scatter, which is binomial tree algorithm for short messages (< 12 KB)
(lg p α + p−1
p nβ) for a binomial tree algorithm, and
or when the number of processes is less than 8, and
the allgather for which we use either recursive dou- the Van de Geijn algorithm otherwise (long messages
bling or the ring algorithm depending on the message and number of processes ≥ 8).
size. Therefore, for very long messages where we use
the ring allgather, the time taken by the broadcast is
4.3 All-to-All
Tvandegeijn = (lg p + p − 1)α + 2 p−1
p nβ.
Comparing this time with that for the binomial tree All-to-all communication is a collective operation in
algorithm, we see that for long messages (where the which each process has unique data to be sent to ev-
latency term can be ignored) and when lg p > 2 (or ery other process. The old algorithm for all-to-all in
p > 4), the Van de Geijn algorithm is better than MPICH does not attempt to schedule communication.
binomial tree. The maximum improvement in per- Instead, each process posts all the MPI Irecvs in a
formance that can be expected is (lg p)/2. In other loop, then all the MPI Isends in a loop, followed by
words, the larger the number of processes, the greater an MPI Waitall. Instead of using the loop index i
the expected improvement in performance. Figure 6 as the source or destination process for the irecv or
6 COMPUTING APPLICATIONS

Myrinet Cluster the top of the output buffer. To achieve this, process
250000 i must rotate its data up by i blocks. In each com-
MPICH Old
MPICH New
munication step k (0 ≤ k < dlg pe), process i sends to
200000 rank (i+2k ) (with wrap-around) all those data blocks
whose kth bit is 1, receives data from rank (i − 2k ),
time (microsec.)

150000 and stores the incoming data into blocks whose kth
bit is 1 (that is, overwriting the data that was just
100000 sent). In other words, in step 0, all the data blocks
whose least significant bit is 1 are sent and received
50000
(blocks 1, 3, and 5 in our example). In step 1, all
the data blocks whose second bit is 1 are sent and re-
ceived, namely, blocks 2 and 3. After a total of dlg pe
0
0 1 2 3 4 5 6 7 8 steps, all the data gets routed to the right destination
message length (MB)
process, but the data blocks are not in the right order
IBM SP
160000
in the output buffer. A final step in which each pro-
IBM MPI cess does a local inverse shift of the blocks (memory
MPICH New
140000
copies) places the data in the right order.
120000 The beauty of the Bruck algorithm is that it is
a logarithmic algorithm for short-message all-to-all
time (microsec.)

100000
that does not need any extra bookkeeping or control
80000
information for routing the right data to the right
60000 process—that is taken care of by the mathematics of
40000
the algorithm. It does need a memory permutation in
the beginning and another at the end, but for short
20000
messages, where communication latency dominates,
0 the performance penalty of memory copying is small.
0 1 2 3 4 5 6 7 8
message length (MB)
If n is the total amount of data a process needs to
send to or receive from all other processes, the time
Figure 6: Performance of long-message broadcast (64 taken by the Bruck algorithm can be calculated as
nodes) follows. If the number of processes is a power of two,
each process sends and receives n2 amount of data in
isend, each process calculates the source or destina- each step, for a total of lg p steps. Therefore, the n
time
tion as (rank + i) % p, which results in a scattering of taken by the algorithm is T bruck = lg p α + 2 lg p β.
the sources and destinations among the processes. If If the number of processes is not a power of two, in
n
the loop index were directly used as the source or tar- the final step, each process must communicate p (p −
blg pc
get rank, all processes would try to communicate with 2 ) data. Therefore, the time taken in the non-
n n
rank 0 first, then with rank 1, and so on, resulting in power-of-two case is Tbruck = dlg peα + ( 2 lg p + p (p −
a bottleneck. 2blg pc )) β.
The new all-to-all in MPICH uses four different al- Figure 8 shows the performance of the Bruck al-
gorithms depending on the message size. For short gorithm versus the old algorithm in MPICH (isend-
messages (≤ 256 bytes per message), we use the index irecv) for short messages. The Bruck algorithm per-
algorithm by Bruck et al. [7]. It is a store-and-forward forms significantly better because of its logarithmic
algorithm that takes dlg pe steps at the expense of latency term. As the message size is increased, how-
some extra data communication ( n2 lg p β instead of ever, latency becomes less of an issue, and the ex-
nβ, where n is the total amount of data to be sent tra bandwidth cost of the Bruck algorithm begins to
or received by any process). Therefore, it is a good show. Beyond a per process message size of about
algorithm for very short messages where latency is an 256 bytes, the isend-irecv algorithm performs better.
issue. Therefore, for medium-sized messages (256 bytes to
Figure 7 illustrates the Bruck algorithm for an ex- 32 KB per message), we use the irecv-isend algorithm,
ample with six processes. The algorithm begins by do- which works well in this range.
ing a local copy and “upward” shift of the data blocks For long messages and power-of-two number of pro-
from the input buffer to the output buffer such that cesses, we use a pairwise-exchange algorithm, which
the data block to be sent by each process to itself is at takes p − 1 steps. In each step k, 1 ≤ k < p, each
OPT. OF COLLECTIVE COMMUNICATIONS 7

P0 P1 P2 P3 P4 P5


P0

  P1


   P2

   P3


   P4 P5 P0 P1 P2 P3 P4 P5
00 10 20 30 40 50

00 11

22 33

44

55 00 11 22 33 44
/
0/0/<;0/0
55
01 11 21 31 41 51
 01

 
12    
23  
 
34   
 
45

 
   
  50
1&%&
50
%21 3
2 ''21&%% 4
31 3&211 (
(3'' 5
01 ) 4(3'' 6
33 )
*)) 8
55(433 5*
12 ,++ 65*)*) 8
55 7,
,++ 9
77 655 7,
23 - 87,++ :
77 -
.9-- :..9--
99,877 .
34

9 9 <;
45

<; ;
       
   
     22 44 686 88 ::9 ::9
4 6 8 8 <; <<;
02 12 22 32 42 52 02 13 24 35 40 51 02 13 24 35 40 51
$#$# $# 
"!
!  
!!    

     
 


"  
  
03 13 23 33 43 53 03 14 25 30 41 52 52 03 14 25 30 41
04
05
14
15
24
25
34
35
44
45
54
55
$#$# $#$# $#$#
04
05
15
10
"
"!""! 
20
21  

31
32
 
   
    42
43
   53
54
04
54
15
05
20
10
31
21
42
32
53
43

Initial Data After local rotation After communication step 0

P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5
00 11 22 33 44 55 00 11 22 33 44 55 00 01 02 03 04 05
50 01 12 23 34 45 50 01 12 23 34 45 10 11 12 13 14 15

>=>= D
40
CC >== DCC DCC51
FEFE FEFE
02
LKLK LKLK
13
NMNM NMNM
24
TSTS TSTS
35 40 51 02 13 24 35 20 21 22 23 24 25

>=@? D CBA >>=@? DDCBA DDCBA ILKJI ILKJI QTSRQ QTSRQ


30 41 52 03 14 25 30 41 52 03 14 25 30 31 32 33 34 35
04 D 15 FEHG FEHG
20 31 NMPO NMPO
42 53 20 31 42 53 04 15 40 41 42 43 44 45
54 @? B A @? BA BA
05 10 HG HG 21 JJ 32 PO PO 43 RR 10 21 32 43 54 05 50 51 52 53 54 55

After communication step 1 After communication step 2 After local inverse rotation

Figure 7: Bruck algorithm for all-to-all. The number ij in each box represents the data to be sent from
process i to process j. The shaded boxes indicate the data to be communicated in the next step.

Myrinet Cluster, 64 nodes 4.4 Reduce-Scatter


900
MPICH Old
MPICH New Reduce-scatter is a variant of reduce in which the
800
result, instead of being stored at the root, is scat-
700 tered among all processes. It is an irregular primi-
tive: The scatter in it is a scatterv. The old algo-
time (microsec.)

600
rithm in MPICH implements reduce-scatter by do-
500 ing a binomial tree reduce to rank 0 followed by a
linear scatterv. This algorithm takes lg p + p − 1
400
steps, and the bandwidth term is (lg p + p−1 p )nβ.
300 Therefore, the time taken by this algorithm is Told =
(lg p + p − 1)α + (lg p + p−1
p )nβ + n lg p γ.
200
0 50 100 150 200 250 300 In our new implementation of reduce-scatter, for
message length (bytes)
short messages, we use different algorithms depending
on whether the reduction operation is commutative or
Figure 8: Performance of Bruck all-to-all versus the noncommutative. The commutative case occurs most
old algorithm in MPICH (isend-irecv) for short mes- commonly because all the predefined reduction oper-
sages. The size on the x-axis is the amount of data ations in MPI (such as MPI SUM, MPI MAX) are com-
sent by each process to every other process. mutative.
For commutative operations, we use a recursive-
halving algorithm, which is analogous to the recursive-
process calculates its target process as (rank b k) doubling algorithm used for allgather (see Figure 9).
(exclusive-or operation) and exchanges data directly In the first step, each process exchanges data with a
with that process. This algorithm, however, does not process that is a distance p2 away: Each process sends
work if the number of processes is not a power of the data needed by all processes in the other half, re-
two. For the non-power-of-two case, we use an al- ceives the data needed by all processes in its own half,
gorithm in which, in step k, each process receives and performs the reduction operation on the received
data from rank − k and sends data to rank + k. data. The reduction can be done because the oper-
In both these algorithms, data is directly communi- ation is commutative. In the second step, each pro-
cated from source to destination, with no intermediate cess exchanges data with a process that is a distance
p
steps. The time taken by these algorithms is given by 4 away. This procedure continues recursively, halving
Tlong = (p − 1)α + nβ. the data communicated at each step, for a total of lg p
8 COMPUTING APPLICATIONS

IBM SP
P0 P1 P2 P3 P4 P5 P6 P7
1600
IBM MPI
MPICH New
Step 1 1400

1200
Step 2

time (microsec.)
Step 3 1000

800
Figure 9: Recursive halving for commutative reduce- 600
scatter
400

200
steps. Therefore, if p is a power of two, the time taken
by this algorithm is Trec half = lg pα+ p−1 p−1
p nβ+ p nγ. 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
We use this algorithm for messages up to 512 KB. message length (bytes)
If p is not a power of two, we first reduce the num- Myrinet Cluster
ber of processes to the nearest lower power of two 400000
MPICH Old
by having the first few even-numbered processes send MPICH New
350000
their data to the neighboring odd-numbered process
300000
(rank + 1). These odd-numbered processes do a re-

time (microsec.)
duce on the received data, compute the result for 250000

themselves and their left neighbor during the recur- 200000


sive halving algorithm, and, at the end, send the re-
150000
sult back to the left neighbor. Therefore, if p is not
a power of two, the time taken by the algorithm is 100000

Trec half = (blg pc + 2)α + 2nβ + n(1 + p−1 p )γ. This 50000
cost is approximate because some imbalance exists in
0
the amount of work each process does, since some pro- 0 1 2 3 4 5 6 7 8
message length (MB)
cesses do the work of their neighbors as well.
If the reduction operation is not commutative, re-
Figure 10: Performance of reduce-scatter for short
cursive halving will not work (unless the data is per-
messages on the IBM SP (64 nodes) and for long mes-
muted suitably [29]). Instead, we use a recursive-
sages on the Myrinet cluster (32 nodes)
doubling algorithm similar to the one in allgather. In
the first step, pairs of neighboring processes exchange
data; in the second step, pairs of processes at distance rithm is Tlong = (p − 1)α + p−1 p−1
p nβ + p nγ. Note that
2 apart exchange data; in the third step, processes at this algorithm has the same bandwidth requirement as
distance 4 apart exchange data; and so forth. How- the recursive halving algorithm. Nonetheless, we use
ever, more data is communicated than in allgather. In this algorithm for long messages because it performs
step 1, processes exchange all the data except the data much better than recursive halving (similar to the re-
needed for their own result (n− np ); in step 2, processes sults for recursive doubling versus ring algorithm for
exchange all data except the data needed by them- long-message allgather).
selves and by the processes they communicated with The SKaMPI benchmark, by default, uses a non-
in the previous step (n − 2n 4n
p ); in step 3, it is (n − p ); commutative user-defined reduction operation. Since
and so forth. Therefore, the time taken by this algo- commutative operations are more commonly used, we
rithm is Tshort = lg pα+n(lg p− p−1 p−1
p )β+n(lg p− p )γ. modified the benchmark to use a commutative oper-
We use this algorithm for very short messages (< 512 ation, namely, MPI SUM. Figure 10 shows the perfor-
bytes). mance of the new algorithm for short messages on the
For long messages (≥ 512 KB in the case of com- IBM SP and on the Myrinet cluster. The performance
mutative operations and ≥ 512 bytes in the case of is significantly better than that of the algorithm used
noncommutative operations), we use a pairwise ex- in IBM’s MPI on the SP and several times better than
change algorithm that takes p−1 steps. In step i, each the old algorithm (reduce + scatterv) used in MPICH
process sends data to (rank + i), receives data from on the Myrinet cluster.
(rank−i), and performs the local reduction. The data The above algorithms will also work for irregular
exchanged is only the data needed for the scattered reduce-scatter operations, but they are not specifically
result on the process ( np ). The time taken by this algo- optimized for that case.
OPT. OF COLLECTIVE COMMUNICATIONS 9

Myrinet Cluster
4.5 Reduce and Allreduce 450000
MPICH Old
MPICH New
400000
MPI Reduce performs a global reduction operation
and returns the result to the specified root, whereas 350000

MPI Allreduce returns the result on all processes. 300000

time (microsec.)
The old algorithm for reduce in MPICH uses a bi- 250000
nomial tree, which takes lg p steps, and the data com- 200000
municated at each step is n. Therefore, the time taken
150000
by this algorithm is Ttree = dlg pe(α + nβ + nγ). The
100000
old algorithm for allreduce simply does a reduce to
rank 0 followed by a broadcast. 50000

The binomial tree algorithm for reduce is a good 0


0 1 2 3 4 5 6 7 8
algorithm for short messages because of the lg p num- message length (MB)
ber of steps. For long messages, however, a better
algorithm exists, proposed by Rabenseifner [19]. The Figure 11: Performance of reduce (64 nodes)
principle behind Rabenseifner’s algorithm is similar to
that behind Van de Geijn’s algorithm for long-message
broadcast. Van de Geijn implements the broadcast as Therefore, the total cost is Trabenseif ner = 2 lg p α +
a scatter followed by an allgather, which reduces the 2 p−1 p−1
p nβ + p nγ.
n lg pβ bandwidth term in the binomial tree algorithm
to a 2nβ term. Rabenseifner’s algorithm implements
a long-message reduce effectively as a reduce-scatter
followed by a gather to the root, which has the same 5 Further Optimization of
effect of reducing the bandwidth term from n lg p β Allreduce and Reduce
to 2nβ. The time taken by Rabenseifner’s algorithm
is the sum of the times taken by reduce-scatter (re- As the profiling study in [20] indicated that allreduce
cursive halving) and gather (binomial tree), which is and reduce are the most commonly used collective op-
Trabenseif ner = 2 lg p α + 2 p−1 p−1
p nβ + p nγ. erations, we investigated in further detail how to op-
For reduce, in the case of predefined reduction oper- timize these operations. We consider five different al-
ations, we use Rabenseifner’s algorithm for long mes- gorithms for implementing allreduce and reduce. The
sages (> 2 KB) and the binomial tree algorithm for first two algorithms are binomial tree and recursive
short messages (≤ 2 KB). In the case of user-defined doubling, which were explained above. Binomial tree
reduction operations, we use the binomial tree algo- for reduce is well known. For allreduce, it involves
rithm for all message sizes because, unlike with prede- doing a binomial-tree reduce to rank 0 followed by a
fined reduction operations, the user may pass derived binomial-tree broadcast. Recursive doubling is used
datatypes, and breaking up derived datatypes to do for allreduce only. The other three algorithms are re-
the reduce-scatter is tricky. Figure 11 shows the per- cursive halving and doubling, binary blocks, and ring.
formance of reduce for long messages on the Myrinet For explaining these algorithms, we define the follow-
cluster. The new algorithm is more than twice as fast ing terms:
as the old algorithm in some cases. • Recursive vector halving: The vector to be reduced
For allreduce, we use a recursive doubling algorithm is recursively halved in each step.
for short messages and for long messages with user- • Recursive vector doubling: Small pieces of the vector
defined reduction operations. This algorithm is sim- scattered across processes are recursively gathered
ilar to the recursive doubling algorithm used in all- or combined to form the large vector
gather, except that each communication step also in- • Recursive distance halving: The distance over which
volves a local reduction. The time taken by this algo- processes communicate is recursively halved at each
rithm is Trec−dbl = lg p α + n lg p β + n lg p γ. step ( p2 , p4 , . . . , 1).
For long messages and predefined reduction op- • Recursive distance doubling: The distance over
erations, we use Rabenseifner’s algorithm for allre- which processes communicate is recursively doubled
duce [19], which does a reduce-scatter followed by an at each step (1, 2, 4, . . . , p2 ).
allgather. If the number of processes is a power of All algorithms in this section can be implemented
two, the cost for the reduce-scatter is lg p α + p−1
p nβ + without local copying of data, except if user-defined
p−1 p−1
p nγ. The cost for the allgather is lg p α + p nβ. noncommutative operations are used.
10 COMPUTING APPLICATIONS

5.1 Vector Halving and Distance Dou- halved, and the distance is doubled. At the end, each
bling Algorithm of the p0 processes has p10 of the total reduction result.
0
All these recursive steps take lg p0 α + ( p p−1
0 )(nβ + nγ)
This algorithm is a combination of a reduce-scatter time. The next part of the algorithm is either an all-
implemented with recursive vector halving and dis- gather or gather depending on whether the operation
tance doubling, followed either by a binomial-tree to be implemented is an allreduce or reduce.
gather (for reduce) or by an allgather implemented
Allreduce: To implement allreduce, we do an all-
with recursive vector doubling and distance halving
gather using recursive vector doubling and distance
(for allreduce).
halving. In the first step, process pairs exchange p10
Since these recursive algorithms require a power-of-
two number of processes, if the number of processes of the buffer to achieve p20 of the result vector, in the
next step p20 of the buffer is exchanged to get p40 of the
is not a power of two, we first reduce it to the nearest
result, and so forth. After lg p0 steps, the p0 processes
lower power of two (p0 = 2blg pc ) by removing r = p−p0
extra processes as follows. In the first 2r processes receive the total reduction result. This allgather part
0
(ranks 0 to 2r − 1), all the even ranks send the sec- costs lg p0 α + ( p p−1
0 )nβ. If the number of processes is

ond half of the input vector to their right neighbor not a power of two, the total result vector must be sent
(rank + 1), and all the odd ranks send the first half of
to the r processes that were removed in the first step,
the input vector to their left neighbor (rank − 1), aswhich results in additional overhead of αuni + nβuni .
illustrated in Figure 12. The even ranks compute the The total allreduce operation therefore takes the fol-
reduction on the first half of the vector and the odd lowing time:
ranks compute the reduction on the second half. The • If p is a power of two: Tall,h&d,p=2exp = 2 lg p α +
odd ranks then send the result to their left neigh- 2nβ + nγ − p1 (2nβ + nγ) ' 2 lg p α + 2nβ + nγ
bors (the even ranks). As a result, the even ranks 0
• If p is not a power of two: Tall,h&d,p6=2exp = (2 lg p +
among the first 2r processes now contain the reduc- 1+3fβ 3 1
1 + 2fα )α + (2 + 2 )nβ + 2 nγ − p0 (2nβ + nγ)
tion with the input vector on their right neighbors
' (3 + 2blg pc)α + 4nβ + 23 nγ
(the odd ranks). These odd ranks do not participate
This algorithm is good for long vectors and power-
in the rest of the algorithm, which leaves behind a
of-two numbers of processes. For non-power-of-two
power-of-two number of processes. The first r even-
numbers of processes, the data transfer overhead is
ranked processes and the last p − 2r processes are now
doubled, and the computation overhead is increased
renumbered from 0 to p0 − 1, p0 being a power of two.
by 32 . The binary blocks algorithm described in Sec-
Figure 12 illustrates the algorithm for an exampletion 5.2 can reduce this overhead in many cases.
on 13 processes. The input vectors and all reduction
Reduce: For reduce, a binomial tree gather is per-
results are divided into 8 parts (A, B,. . .,H), where 8
is the largest power of two less than 13, and denoted formed by using recursive vector doubling and dis-
0
tance halving, which takes lg p0 αuni + p p−1
as A–Hranks . After the first reduction, process P0 has 0 nβuni time.

computed A–D0−1 , which is the reduction result of theIn the non-power-of-two case, if the root happens to
first half (A–D) of the input vector from processes 0 be one of those odd-ranked processes that would nor-
and 1. Similarly, P1 has computed E–H0−1 , P2 has mally be removed in the first step, then the role of
computed A–D2−3 , and so forth. The odd ranks then this process and its partner in the first step are inter-
changed after the first reduction in the reduce-scatter
send their half to the even ranks on their left: P1 sends
E–H0−1 to P0, P3 sends E–H2−3 to P0, and so forth. phase, which causes no additional overhead. The total
reduce operation therefore takes the following time:
This completes the first step, which takes (1 + fα )α +
n n • If p is a power of two: Tred,h&d,p=2exp = lg p(1 +
2 (1 + fβ )β + 2 γ time. P1, P3, P5, P7, and P9 do not
participate in the remainder of the algorithm, and the fα )α + (1 + fβ )nβ + nγ − p1 ((1 + fβ )nβ + nγ) '
remaining processes are renumbered from 0–7. 2 lg p α + 2nβ + nγ
• If p is a not a power of two: Tred,h&d,p6=2exp =
The remaining processes now perform a reduce-
scatter by using recursive vector halving and distance lg p0 (1+fα )α+(1+fα )α+(1+ 1+f2beta +fβ )nβ+ 32 nγ−
1 3
doubling. The even-ranked processes send the sec- p0 ((1 + fβ )nβ + nγ) ' (2 + 2blg pc)α + 3nβ + 2 nγ
ond half of their buffer to rank 0 + 1 and the odd-
ranked processes send the first half of their buffer to 5.2 Binary Blocks Algorithm
rank 0 − 1. All processes then compute the reduction
between the local buffer and the received buffer. In This algorithm reduces some of the load imbalance
the next lg p0 − 1 steps, the buffers are recursively in the recursive halving and doubling algorithm when
OPT. OF COLLECTIVE COMMUNICATIONS 11

the number of processes is not a power of two. The 5.3 Ring Algorithm
algorithm starts with a binary-block decomposition
of all processes in blocks with power-of-two numbers This algorithm uses a pairwise-exchange algorithm for
of processes (see the example in Figure 13). Each the reduce-scatter phase (see Section 4.4). For allre-
block executes its own reduce-scatter with the recur- duce, it uses a ring algorithm to do the allgather, and,
sive vector halving and distance doubling algorithm for reduce, all processes directly send their result seg-
described above. Then, starting with the smallest ment to the root. This algorithm is good in bandwidth
block, the intermediate result (or the input vector in use when the number of processes is not a power of
the case of a 20 block) is split into the segments of two, but the latency scales with the number of pro-
the intermediate result in the next higher block and cesses. Therefore this algorithm should be used only
sent to the processes in that block, and those pro- for small or medium number of processes or for large
cesses compute the reduction on the segment. This vectors. The time taken is Tall,ring = 2(p − 1)α +
does cause a load imbalance in computation and com- 2nβ + nγ − p1 (2nβ + nγ) for allreduce and Tred,ring =
munication compared with the execution in the larger (p−1)(α+αuni )+n(β+βuni )+nγ− p1 (n(β+βuni )+nγ)
blocks. For example, in the third exchange step in for reduce.
the 23 block, each process sends one segment, re-
ceives one segment, and computes the reduction of
one segment (P0 sends B, receives A, and computes 5.4 Choosing the Fastest Algorithm
the reduction on A). The load imbalance is introduced
by the smaller blocks 22 and 20 : In the 22 block, Based on the number of processes and the buffer size,
each process receives and reduces two segments (for the reduction routine must decide which algorithm
to use. This decision is not easy and depends on
example, A–B on P8), whereas in the 20 block (P12),
a number of factors. We experimentally determined
each process has to send as many messages as the ra-
which algorithm works best for different buffer sizes
tio of the two block sizes (here 22 /20 ). At the end of
and number of processes on the Cray T3E 900. The
the first part, the highest block must be recombined
results for allreduce are shown in Figure 14. The fig-
with the next smaller block, and the ratio of the block
ure indicates which is the fastest allreduce algorithm
sizes again determines the overhead.
for each parameter pair (number of processes, buffer
We see that the maximum difference between the size) and for the operation MPI SUM with datatype
ratio of two successive blocks, especially in the low MPI DOUBLE. For buffer sizes less than or equal to
range of exponents, determines the load imbalance. 32 bytes, recursive doubling is the best; for buffer
Let us define δexpo,max as the maximal difference of sizes less than or equal to 1 KB, the vendor’s algo-
two consecutive exponents in the binary represen- rithm (for power-of-two) and binomial tree (for non-
tation of the number of processes. For example, power-of-two) are the best, but not much better than
100 = 26 + 25 + 22 , δexpo,max = max(6 − 5, 5 − 2) = 3. recursive doubling; for longer buffer sizes, the ring
If δexpo,max is small, the binary blocks algorithm can algorithm is good for some buffer sizes and some
perform well. number of processes less than 32. In general, on a
Cray T3E 900, the binary blocks algorithm is faster
Allreduce: For allreduce, the second part is an all- if δexpo,max < lg(vector length in bytes)/2.0 − 2.5 and
gather implemented with recursive vector doubling vector size ≥ 16 KB and more than 32 processes are
and distance halving in each block. For this purpose, used. In a few cases, for example, 33 processes and
data must be provided to the processes in the smaller less than 32 KB, recursive halving and doubling is the
blocks with a pair of messages from processes of the best.
next larger block, as shown in Figure 13. Figure 15 shows the bandwidths obtained by the
Reduce: For reduce, if the root is outside the largest various algorithms for a 32 KB buffer on the T3E. For
block, then the intermediate result segment of rank 0 this buffer size, the new algorithms are clearly better
is sent to the root, and the root plays the role of than the vendor’s algorithm (Cray MPT.1.4.0.4) and
rank 0. A binomial tree is used to gather the result the binomial tree algorithm for all numbers of pro-
segments to the root process. cesses. We observe that the bandwidth of the binary
blocks algorithm depends strongly on δexpo,max and
We note that if the number of processes is a power that recursive halving and doubling is faster on 33,
of two, the binary blocks algorithm is identical to the 65, 66, 97, 128–131 processes. The ring algorithm is
recursive halving and doubling algorithm. faster on 3, 5, 7, 9–11, and 17 processes.
12 COMPUTING APPLICATIONS

Figure 12: Allreduce using the recursive halving and doubling algorithm. The intermediate results after each
communication step, including the reduction operation in the reduce-scatter phase, are shown. The dotted
frames show the additional overhead caused by a non-power-of-two number of processes.

Figure 13: Allreduce using the binary blocks algorithm


OPT. OF COLLECTIVE COMMUNICATIONS 13

5.5 Comparison with Vendor’s MPI


We also ran some experiments to compare the perfor-
mance of the best of the new algorithms with the algo-
rithm in the native MPI implementations on the IBM Fastest Protocol for vendor
Allreduce(sum,dbl) binary tree
SP at San Diego Supercomputer Center, a Myrinet pairwise + ring
cluster at the University of Heidelberg, and the Cray halving + doubling
recursive doubling
T3E. Figures 16–18 show the improvement achieved 512 binary blocks halving+doubling
9/16
break-even points : size=1k and 2k and min( (size/256) , ...)
compared with the allreduce/reduce algorithm in the 256

number of MPI processes


native (vendor’s) MPI library. Each symbol in these 128
figures indicates how many times faster the best algo-
64
rithm is compared with the native vendor’s algorithm.
Figure 16 compares the algorithm based on two dif- 32

ferent application programming models on a cluster 16


of SMP nodes. The left graph shows that with a 8
pure MPI programming model (1 MPI process per
4
CPU) on the IBM SP, the fastest algorithm performs
about 1.5 times better than the vendor’s algorithm 2

for buffer sizes of 8–64 KB and 2–5 times better for


8 32 256 1k 8k 32k 256k 1M 8M
larger buffers. In the right graph, a hybrid program-
buffersize [bytes]
ming model comprising one MPI process per SMP
node is used, where each MPI process is itself SMP- Figure 14: The fastest algorithm for allreduce
parallelized (with OpenMP, for example) and only (MPI DOUBLE, MPI SUM) on a Cray T3E 900
the master thread calls MPI functions (the master-
only style in [21]). The performance is about 1.5–3
times better than the vendor’s MPI for buffer sizes
4–128 KB and more than 4 processes.
Figure 17 compares the best of the new algorithms
with the old MPICH-1 algorithm on the Heidelberg
Myrinet cluster. The new algorithms show a perfor-
mance benefit of 3–7 times with pure MPI and 2–5
100
times with the hybrid model. Figure 18 shows that buffersize = 32 kb vendor
90 Allreduce(sum,dbl) binary tree
on the T3E, the new algorithms are 3–5 times faster pairwise + ring
than the vendor’s algorithm for the operation MPI SUM 80 halving + doubling
binary blocks halving + doubling
and, because of the very slow implementation of struc- recursive doubling
tured derived datatypes in Cray’s MPI, up to 100 70 chosen best
bandwidth [Mb/s]

times faster for MPI MAXLOC. 60


We ran the best-performing algorithms for the us-
50
age scenarios indicated by the profiling study in [20]
and found that the new algorithms improve the per- 40
formance of allreduce by up to 20% and that of reduce 30
by up to 54%, compared to the vendor’s implementa-
20
tion on the T3E, as shown in Figure 19.
10

0
6 Conclusions and Future Work 2 4 8 16 32 64 128 256
number of MPI processes
Our results demonstrate that optimized algorithms
for collective communication can provide substantial Figure 15: Bandwidth comparison for allreduce
performance benefits and, to achieve the best perfor- (MPI DOUBLE, MPI SUM) with 32 KB vectors on a Cray
mance, one needs to use a number of different algo- T3E 900.
rithms and select the right algorithm for a partic-
ular message size and number of processes. Deter-
mining the right cutoff points for switching between
14 COMPUTING APPLICATIONS

128
Allreduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’sAllreduce(sum,dbl)
bandwidth - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth

512 100.<=
64 ratio 100.<= ratio
50. <= ratio <100. 50. <= ratio <100.
20. <= ratio < 50. 20. <= ratio < 50.
number of MPI processes

number of MPI processes


256 32
10. <= ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
128 5.0 <=16
ratio < 7.0 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0

64 2.0 <= ratio


8 < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5
32 4
0.9 <= ratio < 1.1 0.9 <= ratio < 1.1
0.7 <= ratio < 0.9 0.7 <= ratio < 0.9
16 2
0.0 <= ratio < 0.7 0.0 <= ratio < 0.7

8 32 256 1k 8k 32k 256k 1M 8M 8 32 256 1k 8k 32k 256k 1M 8M


buffersize [bytes] buffersize [bytes]

Figure 16: Ratio of the bandwidth of the fastest of the new algorithms (not including recursive doubling) and
the vendor’s allreduce on the IBM SP at SDSC with 1 MPI process per CPU (left) and per SMP node (right)

Allreduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’sAllreduce(sum,dbl)


bandwidth - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth
512 256
100.<= ratio 100.<= ratio
256 50. <=128
ratio <100. 50. <= ratio <100.
20. <= ratio < 50. 20. <= ratio < 50.
number of MPI processes

number of MPI processes

128 64
10. <= ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
64 32
5.0 <= ratio < 7.0 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0
32 16
2.0 <= ratio < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
16 8
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5

8 0.9 <= ratio


4 < 1.1 0.9 <= ratio < 1.1
0.7 <= ratio < 0.9 0.7 <= ratio < 0.9
4 2
0.0 <= ratio < 0.7 0.0 <= ratio < 0.7

8 32 256 1k 8k 32k 256k 1M 8M 8 32 256 1k 8k 32k 256k 1M 8M


buffersize [bytes] buffersize [bytes]

Figure 17: Ratio of the bandwidth of the fastest of the new algorithms (not including recursive doubling)
and the old MPICH-1 algorithm on a Myrinet cluster with dual-CPU PCs (HELICS cluster, University of
Heidelberg) and 1 MPI process per CPU (left) and per SMP node (right)
OPT. OF COLLECTIVE COMMUNICATIONS 15

Allreduce(sum,dbl) - ratio := best bandwidth of 5 new algo.s / vendor’sReduce(sum,dbl)


bandwidth - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth
256 256
100.<= ratio 100.<= ratio

128 50. <= ratio <100.


128 50. <= ratio <100.
20. <= ratio < 50. 20. <= ratio < 50.
number of MPI processes

number of MPI processes


64 10. <=
64ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
32 32ratio < 7.0
5.0 <= 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0
16 16
2.0 <= ratio < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
8 8
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5

4 0.9 <=4ratio < 1.1 0.9 <= ratio < 1.1


0.7 <= ratio < 0.9 0.7 <= ratio < 0.9
2 0.0 <=2ratio < 0.7 0.0 <= ratio < 0.7

8 32 256 1k 8k 32k 256k 1M 8M 8 32 256 1k 8k 32k 256k 1M 8M


buffersize [bytes] buffersize [bytes]

Allreduce(maxloc,dbl) - ratio := best bandwidth of 5 new algo.s / vendor’s


Reduce(maxloc,dbl)
bandwidth - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth
256 256
100.<= ratio 100.<= ratio

128 50. <= ratio <100.


128 50. <= ratio <100.
20. <= ratio < 50. 20. <= ratio < 50.
number of MPI processes

number of MPI processes

64 10. <=
64ratio < 20. 10. <= ratio < 20.
7.0 <= ratio < 10. 7.0 <= ratio < 10.
32 32ratio < 7.0
5.0 <= 5.0 <= ratio < 7.0
3.0 <= ratio < 5.0 3.0 <= ratio < 5.0
16 16
2.0 <= ratio < 3.0 2.0 <= ratio < 3.0
1.5 <= ratio < 2.0 1.5 <= ratio < 2.0
8 8
1.1 <= ratio < 1.5 1.1 <= ratio < 1.5

4 0.9 <=4ratio < 1.1 0.9 <= ratio < 1.1


0.7 <= ratio < 0.9 0.7 <= ratio < 0.9
2 0.0 <=2ratio < 0.7 0.0 <= ratio < 0.7

8 32 256 1k 8k 32k 256k 1M 8M 8 32 256 1k 8k 32k 256k 1M 8M


buffersize [bytes] buffersize [bytes]

Figure 18: Ratio of the bandwidth of the fastest of the new algorithms and the vendor’s algorithm for allreduce
(left) and reduce (right) with operation MPI SUM (first row) and MPI MAXLOC (second row) on a Cray T3E 900

Figure 19: Benefit of new allreduce and reduce algorithms optimized for long vectors on the Cray T3E
16 COMPUTING APPLICATIONS

the different algorithms is tricky, however, and they computing in general and high-performance communi-
may be different for different machines and networks. cation and I/O in particular. He was a member of the
At present, we use experimentally determined cutoff MPI Forum and participated actively in the definition
points. In the future, we intend to determine the cut- of the I/O part of the MPI-2 standard. He is also the
off points automatically based on system parameters. the author of a widely used, portable implementation
MPI also defines irregular (“v”) versions of many of MPI-IO, called ROMIO. He is currently involved
of the collectives, where the operation counts may be in the development of MPICH-2, a new portable im-
different on different processes. For these operations, plementation of MPI-2. Rajeev is a co-author of the
we currently use the same techniques as for the regular book ”Using MPI-2: Advanced Features of the Mes-
versions described in this paper. Further optimization sage Passing Interface” published by MIT Press. He is
of the irregular collectives is possible, and we plan to an associate editor of IEEE Transactions on Parallel
optimize them in the future. and Distributed Systems, has served on the program
In this work, we assume a flat communication model committees of several conferences, and has also served
in which any pair of processes can communicate at as a co-guest editor for a special issue of the Int’l Jour-
the same cost. Although these algorithms will work nal of High-Performance Computing Applications on
even on hierarchical networks, they may not be opti- ”I/O in Parallel Applications.”
mized for such networks. We plan to extend this work Rolf Rabenseifner studied mathematics and physics
to hierarchical networks and develop algorithms that at the University of Stuttgart. He is head of the
are optimized for architectures comprising clusters of Department of Parallel Computing at the High-
SMPs and clusters distributed over a wide area, such Performance Computing Center Stuttgart (HLRS).
as the TeraGrid [26]. We also plan to explore the He led the projects DFN-RPC, a remote procedure
use of one-sided communication to improve the per- call tool, and MPI-GLUE, the first metacomputing
formance of collective operations. MPI combining different vendor’s MPIs without loos-
The source code for the algorithms in Section 4 is ing the full MPI interface. In his dissertation work
available in MPICH-1.2.6 and MPICH2 1.0. Both at the University of Stuttgart, he developed a con-
MPICH-1 and MPICH2 can be downloaded from trolled logical clock as global time for trace-based
www.mcs.anl.gov/mpi/mpich. profiling of parallel and distributed applications. He
is an active member of the MPI-2 Forum. In 1999,
he was an invited researcher at the Center for High-
Acknowledgments
Performance Computing at Dresden University of
This work was supported by the Mathematical, Infor- Technology. His current research interests include
mation, and Computational Sciences Division subpro- MPI profiling, benchmarking, and optimization. Each
gram of the Office of Advanced Scientific Computing year he teaches parallel programming models in a
Research, Office of Science, U.S. Department of En- workshop format at many universities and labs in Ger-
ergy, under Contract W-31-109-ENG-38. The authors many. (http://www.hlrs.de/people/rabenseifner/).
would like to acknowledge their colleagues and oth- William Gropp received his B.S. in Mathematics
ers who provided suggestions and helpful comments. from Case Western Reserve University in 1977, a
They would especially like to thank Jesper Larsson MS in Physics from the University of Washington in
Träff for helpful discussion on optimized reduction al- 1978, and a Ph.D. in Computer Science from Stan-
gorithms and Gerhard Wellein, Thomas Ludwig, and ford in 1982. He held the positions of assistant (1982-
Ana Kovatcheva for their benchmarking support. We 1988) and associate (1988-1990) professor in the Com-
also thank the reviewers for their detailed comments. puter Science Department at Yale University. In 1990,
he joined the Numerical Analysis group at Argonne,
Biographies where he is a Senior Computer Scientist and Asso-
ciate Director of the Mathematics and Computer Sci-
Rajeev Thakur is a Computer Scientist in the Math- ence Division, a Senior Scientist in the Department of
ematics and Computer Science Division at Argonne Computer Science at the University of Chicago, and
National Laboratory. He received a B.E. in Computer a Senior Fellow in the Argonne-Chicago Computation
Engineering from the University of Bombay, India, in Institute. His research interests are in parallel com-
1990, an M.S. in Computer Engineering from Syra- puting, software for scientific computing, and numer-
cuse University in 1992, and a Ph.D. in Computer ical methods for partial differential equations. He has
Engineering from Syracuse University in 1995. His played a major role in the development of the MPI
research interests are in the area of high-performance message-passing standard. He is co-author of the most
OPT. OF COLLECTIVE COMMUNICATIONS 17

widely used implementation of MPI, MPICH, and was LogP: Towards a realistic model of parallel computa-
involved in the MPI Forum as a chapter author for tion. In Principles Practice of Parallel Programming,
both MPI-1 and MPI-2. He has written many books pages 1–12, 1993.
and papers on MPI including ”Using MPI” and ”Us- [10] Debra Hensgen, Raphael Finkel, and Udi Manbet.
ing MPI-2”. He is also one of the designers of the Two algorithms for barrier synchronization. Interna-
PETSc parallel numerical library, and has developed tional Journal of Parallel Programming, 17(1):1–17,
efficient and scalable parallel algorithms for the solu- 1988.
tion of linear and nonlinear equations. [11] Roger W. Hockney. The communication challenge for
mpp: Intel paragon and meiko cs-2. Parallel Comput-
ing, 20(3):389–398, March 1994.
References [12] Giulio Iannello. Efficient algorithms for the reduce-
[1] Albert Alexandrov, Mihai F. Ionescu, Klaus E. scatter operation in LogGP. IEEE Transactions
Schauser, and Chris Scheiman. LogGP: Incorporating on Parallel and Distributed Systems, 8(9):970–982,
long messages into the LogP model for parallel com- September 1997.
putation. Journal of Parallel and Distributed Com- [13] L. V. Kale, Sameer Kumar, and Krishnan Vardara-
puting, 44(1):71–79, 1997. jan. A framework for collective personalized com-
[2] M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de munication. In Proceedings of the 17th Interna-
Geijn, and J. Watts. Interprocessor collective com- tional Parallel and Distributed Processing Symposium
munication library (InterCom). In Proceedings of Su- (IPDPS ’03), 2003.
percomputing ’94, November 1994. [14] N. Karonis, B. de Supinski, I. Foster, W. Gropp,
E. Lusk, and J. Bresnahan. Exploiting hierarchy
[3] M. Barnett, R. Littlefield, D. Payne, and R. van de
in parallel computer networks to optimize collective
Geijn. Global combine on mesh architectures with
operation performance. In Proceedings of the Four-
wormhole routing. In Proceedings of the 7th Interna-
teenth International Parallel and Distributed Process-
tional Parallel Processing Symposium, April 1993.
ing Symposium (IPDPS ’00), pages 377–384, 2000.
[4] Gregory D. Benson, Cho-Wai Chu, Qing Huang,
[15] T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat,
and Sadik G. Caglar. A comparison of MPICH all-
and R. A. F. Bhoedjang. MagPIe: MPI’s collec-
gather algorithms on switched networks. In Jack Don-
tive communication operations for clustered wide
garra, Domenico Laforenza, and Salvatore Orlando,
area systems. In ACM SIGPLAN Symposium on
editors, Recent Advances in Parallel Virtual Ma-
Principles and Practice of Parallel Programming
chine and Message Passing Interface, 10th European
(PPoPP’99), pages 131–140. ACM, May 1999.
PVM/MPI Users’ Group Meeting, pages 335–343.
Lecture Notes in Computer Science 2840, Springer, [16] P. Mitra, D. Payne, L. Shuler, R. van de Geijn, and
September 2003. J. Watts. Fast collective communication libraries,
please. In Proceedings of the Intel Supercomputing
[5] S. Bokhari. Complete exchange on the iPSC/860.
Users’ Group Meeting, June 1995.
Technical Report 91–4, ICASE, NASA Langley Re-
search Center, 1991. [17] MPICH – A portable implementation of MPI.
http://www.mcs.anl.gov/mpi/mpich.
[6] S. Bokhari and H. Berryman. Complete exchange on
a circuit switched mesh. In Proceedings of the Scal- [18] Rolf Rabenseifner. Effective bandwidth (b eff) bench-
able High Performance Computing Conference, pages mark. http://www.hlrs.de/mpi/b eff.
300–306, 1992. [19] Rolf Rabenseifner. New optimized MPI reduce algo-
[7] Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli rithm. http://www.hlrs.de/organization/par/
Upfal, and Derrick Weathersby. Efficient algorithms services/models/mpi/myreduce.html.
for all-to-all communications in multiport message- [20] Rolf Rabenseifner. Automatic MPI counter profiling
passing systems. IEEE Transactions on Parallel of all users: First results on a CRAY T3E 900-512.
and Distributed Systems, 8(11):1143–1156, November In Proceedings of the Message Passing Interface De-
1997. veloper’s and User’s Conference 1999 (MPIDC ’99),
[8] Ernie W. Chan, Marcel F. Heimlich, Avi Pu- pages 77–85, March 1999.
rakayastha, and Robert A. van de Geijn. On opti- [21] Rolf Rabenseifner and Gerhard Wellein. Communi-
mizing collective communication. In Proceedings of cation and optimization aspects of parallel program-
the 2004 IEEE International Conference on Cluster ming models on hybrid architectures. International
Computing, September 2004. Journal of High Performance Computing Applica-
[9] David E. Culler, Richard M. Karp, David A. Patter- tions, 17(1):49–62, 2003.
son, Abhijit Sahay, Klaus E. Schauser, Eunice San- [22] Peter Sanders and Jesper Larsson Träff. The hierar-
tos, Ramesh Subramonian, and Thorsten von Eicken. chical factor algorithm for all-to-all communication.
18 COMPUTING APPLICATIONS

In B. Monien and R. Feldman, editors, Euro-Par 2002


Parallel Processing, pages 799–803. Lecture Notes in
Computer Science 2400, Springer, August 2002.
[23] D. Scott. Efficient all-to-all communication patterns
in hypercube and mesh topologies. In Proceedings of
the 6th Distributed Memory Computing Conference,
pages 398–403, 1991.
[24] Mohak Shroff and Robert A. van de Geijn. CollMark:
MPI collective communication benchmark. Techni-
cal report, Dept. of Computer Sciences, University of
Texas at Austin, December 1999.
[25] Steve Sistare, Rolf vandeVaart, and Eugene Loh. Op-
timization of MPI collectives on clusters of large-scale
SMPs. In Proceedings of SC99: High Performance
Networking and Computing, November 1999.
[26] Teragrid. http://www.teragrid.org.
[27] V. Tipparaju, J. Nieplocha, and D. K. Panda. Fast
collective operations using shared and remote mem-
ory access protocols on clusters. In Proceedings of the
17th International Parallel and Distributed Process-
ing Symposium (IPDPS ’03), 2003.
[28] Jesper Larsson Träff. Improved MPI all-to-all com-
munication on a Giganet SMP cluster. In Dieter
Kranzlmuller, Peter Kacsuk, Jack Dongarra, and
Jens Volkert, editors, Recent Advances in Parallel
Virtual Machine and Message Passing Interface, 9th
European PVM/MPI Users’ Group Meeting, pages
392–400. Lecture Notes in Computer Science 2474,
Springer, September 2002.
[29] Jesper Larsson Träff. Personal communication, 2004.
[30] Sathish S. Vadhiyar, Graham E. Fagg, and Jack Don-
garra. Automatically tuned collective communica-
tions. In Proceedings of SC99: High Performance
Networking and Computing, November 1999.
[31] Thomas Worsch, Ralf Reussner, and Werner Au-
gustin. On benchmarking collective MPI operations.
In Dieter Kranzlmüller, Peter Kacsuk, Jack Don-
garra, and Jens Volkert, editors, Recent Advances in
Parallel Virtual Machine and Message Passing Inter-
face, 9th European PVM/MPI Users’ Group Meeting,
pages 271–279. Lecture Notes in Computer Science
2474, Springer, September 2002.

You might also like