Sorting On A Mesh-Connected Parallel Computer
Sorting On A Mesh-Connected Parallel Computer
1976
H. T. Kung
This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
contact [email protected].
March 1976
ABSTRACT
This research was supported in part by the National Science Foundation under
Grant MCS75-222-55 and the Office of Naval Research under Contract N00014-76C-0370, NR 044-422.
1. Introduction
In the course of a parallel computation, individual processors will need to
distribute their results to other processors and complicated data flow problems may
arise. One way to handle this problem is by sorting "destination tags" attached to each
data element, as discussed in Batcher [1968]. Hence efficient sorting algorithms for
parallel machines with some fixed processor interconnection pattern are relevant to
almost any use of these machines.
In this paper we present two algorithms for sorting N n^ elements on an nxn
mesh-type processor array that require 0{n.) unit-distance routing steps and 0<n)
comparison steps (n is assumed to be a power of 2). The best previous algorithm
takes time 0(n log n) (Orcytt [1974]). One of our algorithms, the s^-way merge sort, is
shown optimal within a factor of 2 in time for sufficiently large n, if one comparison
step takes no more than twice the time of a routing step. Our other 0{n) algorithm, an
adaptation of Batcher's bitonic merge sort, is much less complex but optimal under the
same assumption to within a factor of 4.5 for aH n, and is more efficient for moderate
n.
We believe that the algorithms of this paper will give the most efficient sorting
algorithms for ILL1AC IV-type parallel computers.
Our algorithms can be generalized to higher-dimensional array interconnection
patterns. For example, our second algorithm can be modified to sort N elements on a
j-dimensionally mesh-connected N-processor computer in O(N^J) time, which is optimal
within a small constant factor.
Efficient sorting algorithms have been developed for interconnection patterns
other than the "mesh" considered in this paper. Stone [1971] maps Batcher's bitonic
merge sort onto the "perfect shuffle" interconnection scheme, obtaining an N-element
sort time of Oflog^N) on N processors. The odd-even transposition sort (see
Appendix) requires an optimal 0{N) time on a linearly connected N-processor computer.
Sorting time is thus seen to be strongly dependent on the interconnection pattern.
Exploration of this dependence for a given problem is of interest from both an
architectural and an algorithmic point of view.
In Section 2 we give the model of computation. The sorting problem is defined
precisely in Section 3. A lower bound on the sorting time is given in Section 4.
Batcher's 2-way odd-even merge is mapped on our 2-dimensional mesh-connected
processor array in the next section. Generalizing the 2-way odd-even merge, we
introduce a 2s-way merge algorithm in Section 6. This is further generalized to an s ^ way merge in Section 7, from which our most efficient sorting algorithm for large n is
developed. Section 8 shows that Batcher's bitonic sort can be performed efficiently on
our model by choosing an appropriate processor indexing scheme. Some extensions
and implications of our results are discussed in Section 9. The Appendix contains a
description of the odd-even transposition sort.
2
2. Model of Computation
We assume a parallel computer with N - nxn identical processors. The
architecture of the machine is similar to that of the ILL1AC IV (Barnes, el. al. [1968]).
The major assumptions are as follows:
(i> The interconnections between the processors are a subset of those on the U L I A C
IV, and are defined by the following two dimensional array:
where the p's denote the processors, That is, each processor is connected to all
its neighbors. Processors at the perimeter have two or three rather than four
neighbors; there are ho "wrap-around" connections as found on the ILLIAC IV.
The bounds obtained in this paper would be affected at most by a factor of
2 if "wrap-around" connections were included, but we feel that this addition would
obscure the ideas of this paper without substantially strengthening the results.
(ii) It is a SIMD (Single Instruction stream Multiple Data stream) machine (Flynn[1966]X
During each time unit, a single instruction is broadcast to all processors, but only
executed by the set of processors specified in the instruction. For the purpose of
the paper, only two instruction types are needed: the routing instruction for
interprocessor data moves, and the comparison instruction on two data elements in
each processor. The comparison instruction is a conditional interchange on the
contents of two registers in each processor. Actually, we need both "types" of
such comparison instructions to allow either register to receive the minimum;
normally both types will be issued during "one comparison step"
(iii) Define
tp time required for one unit-distance routing Step, i.e., moving one item
from a processor to one of its neighbors,
3
HUNT LIBRARY
--
-- o
--
15
12
T
9
11
14
10
11
r-
12
13
14
15
Note that this indexing is obtained by shuffling the binary representation of the
row-major index. For example, the row-major index 5 has the binary
representation 0101. Shuffling the bits gives 0011 which is 3. (In general, the
shuffled binary number, say, "abcdefgh" is "aebfcgdh")
(iii) Snake-Like Row-Maior Indexing: After sorting we have
ROW 1
ROW 2
8
lb
14
10
11
ROW 3
13
12
ROW 4
This indexing is obtained from the rpw-majpr indexing by reversing the ordering
in even rows.
The choice of a particular indexing scheme depends upon how the sorted
elements will be used (or accessed), and upon which sorting algorithm is to be used.
For example, we found that the row-major indexing is poor for merge-sorting.
It is clear that the sorting problem with respect to any index scheme can be
solved by using the routing and comparison steps. We are interested in designing
algorithms which minimize the time spent in routing and comparing.
4. A Lower Bound
Observe that for any index scheme there are situations where the two elements
initially loaded at the opposite corner processors have to be transposed during the
sorting.
IK
-7K
SORTING
at least 4 ( n elements in
can sort n
the other in
2
LI.
tu
L2. Merge the "odd sequences" and the "even sequences".
H T 1 - I
L3. Shuffle.
C
Step L3 above is the "perfect shuffle" <Stone[ 1971]) and step L I is its inverse,
the "unshuffle". Note that the perfect shuffle can be achieved by using the triangular
interchange pattern below:
5
**>
<->
<*+
where the
indicate interchanges. Similarly, an inverted triangular interchange
pattern will do the unshuffle. Therefore, both the perfect shuffle and unshuffle can be
done in k-1 interchanges (i.e., 2k-2 routing steps) when performed on a row of
length 2k in our model.
We now give an implementation of the odd-even merge pn a rectangular region
of our model. Let M(j,k) denote our algorithm of merging two j . b y k/2 sorted adjacent
subarrays to form a sorted j by k array, where j,k are powers of 2, k>l, and all the
arrays are arranged in the snake-like row major ordering. We first consider the case
where k 2 . If j - 1 , a single comparison-interchange step suffices to sort the two unit
"sub-arrays". Given two sorted columns of length ,j>l,.'M(j,2) consists of the following
steps:
Jl.
Move all odds to the left column and all evens to the right. Time: 2tp
J2. Use the "odd-even transposition sort" (see Appendix) to sort each column. Time:
2Jt +jt
R
J4.
Time:
For k>2, M(j,k) is defined recursively in the following way. Steps M l and M2
unshuffle the elements, step M3 recursively merges the "odd sequences" and the "even
sequences", steps M4 and M5 shuffle the "odds" and "evens" together, and step M5
performs the final comparison-interchange. The accompanying diagrams illustrate the
algorithm M(4,4), where the two given sorted 4 by 2 subarrays are initially stored in
16 processors as follows:
Ml.
Single interchange step on even rows if j>2, so that columns contain either all
evens or all odds. If j=2, do nothing; the columns are already segregated. Time:
10
i 7
>- 11
12
11
10
15
11
<*-*>
12
<J-f>
13
T(j,2) ( 2 j + 6 ) t
and for k >
+(j+ l ) t ,
c
2,
12
Tak)
(2k 4 ) t + t
R
+ T(j.k/2).
13
HUNi L1BRAHT
eMflttlE-MELLQII UBIYEBSITY
MT.
Single interchange step on even rows; if i>s so that columns contain either all
evens or all odds. If jj^s, do nothing: the columns are already segregated. Time:
2t
R
M6\
Note that the original step M6 is just the first step of an odd-even transposition sort.
Thus the 2-way merge is sefen to be a special case of the 2s-way merge.
Similarly, for M'(j,2,s), j>s, J4 is replaced by M6 , which takes time
?
(2s-l)(2t +t ).
R
M'(s,2,s) is a special case analogous to M( 1,2), and may be performed by the odd-even
transposition sort (see Appendix) in time 4 s t + 2stQ.
R
The validity of this algorithm may be demonstrated by use of the 0-1 principle
(Knuth [1973], p.224): if a network sorts all sequences of O's and T s , then it will sort
any arbitrary sequence of elements chosen from a linearly ordered set. Thus, we may
assume that the inputs are O's and Vs. It is easy to check that there may be as many
as 2s more zeros on the left as on the right after unshuffling (i.e., after step J l or
step M2). After the shuffling, the first 2 s - l steps of an odd-^ven transposition sort
suffice to sort the resulting array.
Let T\j,k,s) be the time required by the algorithm MXj>k,s). Then we have that
r(j,2,s) (2j + 4s + 2 ) t + (j + 2s - 1 ) t
R
14
+ (j +(2s)log k + O(s+log k ) .
c
For s 2, a merge sort may be derived that has! the following time behavior
S'(n,n) - S'(n/2,n/2) + r(n,n,2).
Thus,
S'(n,n) (12n + 0(log n))t + (2n 0(log2 )).t .
2
15
7.
M1 .
M
M6'\
( s / 2 ) ( 4 t + t ) + (s /2 - l ) ( 2 t + t ) ( 3 s - 2 ) t + ( s ~ l ) t .
2
The motivation for this new step comes from the realization that when the inputs are
O's and l's, there may be as many as s more zeros on the left half as the right after
unshuffling.
2
Therefore,
T " ( j , M ) - (4k + 2j + 3s log(k/$) + 0((s + j/s)log s * log k ) ) t
2
A sorting algorithm may be developed from the s - w a y merge; a good value for
s is approximately n*/^ (remember that s must be a power of 2). Then the time of
sorting nxn elements satisfies
2
16
S (n,n) ~ S " ( n / , n / ) + . H n A n / ) .
H
Theorem 7.1
If the snake-like row-major indexing is used, the sorting problem can be done in
time
(6n + 0(n / log n))t + (n + 0(n / log n))t .
2
17
8.
In this section we shall show that Batcher's bitonic merge algorithm {Batcher
[1968], Knuth [1973, pp. 232-237]) lends itself well to sorting on a mesh-connected
parallel computer, once the proper indexing scheme.has been selected. Two indexing
schemes will be considered, the "row-major" and the "shuffled row-major" indexing
schemes defined in Section 3.
The bitonic merge of two sections of a bitonic array of j/2 elements each takes
log j passes, where pass i consists of a. comparison-interchange between processors
with indices differing only in the i* bit of their binary representations. (This
operation will be. termed "comparison-interchange on the i* bit".) Sorting an entire
array of 2^ elements by the bitonic method requires k comparison-interchanges on the
0 ^ bit (the least significant bit), k-1 comparison-interchanges on the first bit,...,(k-i)
comparison-interchanges on the i^ bit,.,., and 1 comparison-interchange on the most
significant bit. For any fixed indexing scheme, in general a comparison-interchange on
the bit will take a different amount of time than when done on the j ^ bit: an
optimal processor indexing scheme for the bitonic algorithm minimizes the time spent
on comparison-interchange steps. A necessary condition for optimality is that a
comparison-interchange on the j ^ bit be no more expensive than the (j+1) * bit for all
j . I f this were not the case for some j , then a better indexing scheme could
immediately be derived from the supposedly optimal one by interchanging the j** and
the ( j + l ) * bits of all processor indices (since more comparison-interchanges will be
done on the original j bit than on the (j+l) * bit).
h
The bitonic algorithm has been analyzed for the row-major indexing scheme: I t
takes
0(n log n ) t + O ( l o g n ) t
R
18
excessive in either direction. The standard row-major indexing causes the bitonic sort
to contend with sub-arrays that are always at least as wide as they are tall; the
aspect ratio can be as high as n on an nxn processor array.
Programming the bitonic sort would be a little tricky, as the "direction" of a
comparison-interchange step depends on the processor index. Orcutt [1974] covers
these g o r y details for row-major indexing; his algorithm may easily be modified to
handle the shuffled row-major indexing scheme. Here is an example of the bitonic
merge sort on a 4x4 processor array for the shuffled row-major indexing; the
comparison "directions" were derived from the following diagram (Knuth [1973], p.
237):
FRUCLSSR U
0
1
2
3
4
5
B
7
8
9
10
11
12
13
14
ShlRC
StilRO 2
Stane 3
71" ~
> f
y
\f
\<
\
t
I
'
fJ t -
> <
>s
> f
y.
t v
.
J\
/s
'
> f
>y
\ f
>
'
vf
> f
15
14
11
12
13
C
8
Stage
19
14
10
Cl
cl]
et
Cl
15
H MM
12
11
13
Cil
Cil
Ct [
Ctj
3 0
Stage 2. Merge pairs of 1x2 matrices; note that one member of a pair is sorted in
ascending order, the other in descending order. This will always be the case in
any bitonic merge. Time: 4tp + 2tQ.
__C,
,C
, ,
14
15
10
12
11
10
13
15
11
13
11
8
Stage 3. Merge pairs of 2x2 matrices. Time: 8 t + 3t.
R
CI
CI
c*
c*
12
14
15
1.1
C
10
->
13
ct
Ct
c t
c t
20
'4~
15
14
C
13
12
5.
4-
CM
C M C *
C4
i H o
13
11
11
12
c i\
10
c*
10
14 ( H 15
13
ci
12
14
15
12
13
14
15
C
1
CM
c
9
c
8
13
c
11
12
15
10
c
10
14
Let T'"(2J) be the time to merge the bitonically sorted elements in processors *0
through * 2 ' - l , where the shuffled row-major indexing is used. Then after one pass of
comparison-interchange, which takes time
the problem is reduced to the
bitonic merge of elements in processors *0 through * 2 ~ - l , and that of elements in
procesors * 2 ~ to * 2 ' - l . It may be observed that the latter two merges can be done
concurrently. Thus we have
i
T'"<1)0,
T (2') T ( 2
w
K 1
) + 2 /21
ri
tR
21
(3*2
T (2')-
(,+1)
/ 2 _ 4 ) t + i t , if i is odd, and
R
M ,
(/j*2'/ - 4 ) t + i t , if i is even.
2
Let S ' " ( 2 J ) be the time taken by the corresponding sorting algorithm (for a square
array). Then
2
S"'(l) 0 ,
S"'(2 J) " S ( 2 J ) + T"'(2 J>
2
_ 1
-1
Hence
S ' " ( 2 i ) = (14(2J-1) - 8 j ) t + ( 2 j + j ) t .
2
thereom.
Theorem 8.1
If the shuffled row-major indexing is used, the bitonic sort can be done in time
(14(n-l) - 8log n ) t (2log n + log n)t ,
R
If t < 2 t p , it may be seen that the bitonic merge sort algorithm is optimal to
within a factor of 4.5 for all n (since 4(n-l)tp time is necessary, as shown in Section
4). Preliminary investigation indicates that the bitonic merge sort i$ faster than the
s - w a y odd-even merge sort for n<512, under the assumption that t^ 2tp,
c
22
Lemma 9.1
If N nxn elements have already been sorted with respect to some index function
and if each processor can store n elements, then the M elements can be sorted
with respect to any other index function by using an additional 4 ( n - l ) t p units of
time.
The proof follows from the fact that all elements can be moved to
their
destinations by four sweeps of n-1 routing steps in all four directions.
(ii) If the processors are connected in a kxm rectangular array,
v
instead of a square array, similar results can still be obtained.
corresponding to Theorem 7.1, we have
For example,
Theorem 9,1
If the snake-like row-major indexing is used, the sorting problem for a kxm
processor array (k,m powers of 2) can be done in time
(4m + 2k + O(h / log h t + (k + Q(h / log h ) ) t ,
2
23
l/ .
3 )
(iii) The number of elements to be sorted could be larger than N, the number of
processors. An efficient means of handling this situation is to distribute an
approximately equal number of elements to each processor initially and to use a
merge-splitting operation for each comparison-interchange operation. This idea is
discussed by Knuth [1973, Exercise 5.3.4-38], and used by Baudet and Stevenson
[1975]. Baudet and Stevenson's results wiH be immediately improved if the
algorithms of this paper are used, since they used Orcutt's 0(n log n) algorithm.
1
+ jXn-1) - 2j
b y adapting Batcher's bitonic merge sort algorithm to the "j-way shuffled r o w major ordering". This new ordering is derived from the binary representation of
the row-major indexing by a j - w a y bit shuffle.
If n 2 ^ , j*3> and
2 1 0 2 1 0 2*1*0
~ )
index, then the j - w a y shuffled index is
Z Y X Z Y X Z Y X .
Z
i s
r o w
m a
o r
This formula may be derived in the following way. The t term is not dimensiondependent: the same number of comparison is performed in any mapping of the
bitonic sort onto an N processor array. The
term is the solution of
c
52
2i
l<i<log n
52 uiog N > ~ i j * k ) ,
l<k<j
th
Theorem 9.2
If N processors are j-dimensionally mesh-connected, then the bitonic sort
can be performed in time (XN /)), using the j - w a y shuffled row-major index
1
scheme.
24
By using the argument of Section 4, one can easily check that the bound in the
theorem is asymptotically optimal for large N.
25
Appendix.
The odd-even transposition sort (Knuth [1973], p. 241) may be mapped onto our
2-dimensional arrays with snake-like row-major ordering following way. Given N
processors initially loaded with a data value, repeat N/2 times:
01.
02.
processor by the odd-even transposition sort Into the snake-like row-major ordering,
then
0
, IF jk-1 ELSE
2t + t
, IF jk=2 ELSE
JW2t + t ) , IF j l OR k=2 ELSJE
J3t + t )
R
oe J )~
(
26
References
Barnes, G. K, et. al [1968]. "The ILLIAC IV Computer," IEEE Trans, on Computers, C-17,
pp. 746-757.
Batcher, . E. [1968]. "Sorting Networks and Their Applications," P roc. AFIP S Spring
Joint Computer Conference. Vol. 32, pp. 307-314.
Baudet, G., and Stevenson, D. [1975]. "Optimal Sorting Algorithms for P arallel
Computers", Computer Science Department report, Carnegie-Mellon University,
May 1975.
Flynn, M. J . [1966].
1901-1909.
Knuth, D. E. [1973]. The Art of Computer P rogramming: Vol. 3. Sorting and Searching.
Reading, Mass.: Addison-Wesley.
Orcutt, S. E. [1974]. Computer Organization and Algorithms for Very-High Speed
Computations. P h.D. Thesis, Stanford University, September 1974, Chap. 2, pp.
20-23.
Stone, H. S. [1971]. "P arallel P rocessing with the P erfect Shuffle," JEEE Trans, on
Computers. Vol. C-20, pp. 153-161.
27
UNCLASSIFIED
4.
T I T L E C"<l
SubUtl)
7.
*6
P E R F O RM
"8.
CONTRACT
NAM E AND
ADDRESS
Carnegie-Mellon University
Computer Science Dept.
Pittsburgh, P A
15213
C O N T R O L L I N G O F F I C E NAM E AND
12.
ADDRESS
NUM
NUM
BEP
BERS)
ONITORING
4GENCV
PROGRAM
ELEM ENT. PROJECT
A R E A & WORK U N I T N U M B E R S
REPORT
TASK
DATE
March 1976
OR GRANT
N00014-76-C-0370,
NR 044-422
10.
II
EPOR
AUTHORS
ING ORG
13.
NUM BER OF
PAGES
29
15.
N A M E 6 A O D R E S S f i / different
fro
m
Controlling
SECURITY
CLASS
'of
this
report)
Office)
UNCLASSIFIED
bm
16-
OlSTRIBU 'ON
T
S T A T E M E N T (of thl.
Approved
DECLASSIFICATION
SCHEDULE
DOWNGRADING
Report)
17
D.S^.BUT.ON
STATEM ENT
1.
SUPPLEM ENTARY
19.
K E Y W O R D S 'Continue
, . .b.l,.c,
. r W . r . d I n Stock
20. H U f *
*'>
NOTES
on
revere,
.id.
it necee.ery
end identify
by block
nu
m ber)
DD
1 jA^n
1473
EDITION OF
UNCLASSIFIED
1 N O V 65 IS O B S O L E T E
SECURITY
CLASSIFICATION
THI S PAGE
(Whm n
Entered)