0% found this document useful (0 votes)
243 views

Sorting On A Mesh-Connected Parallel Computer

this is help full for final year student of computer science

Uploaded by

sajjad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
243 views

Sorting On A Mesh-Connected Parallel Computer

this is help full for final year student of computer science

Uploaded by

sajjad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Carnegie Mellon University

Research Showcase @ CMU


Computer Science Department

School of Computer Science

1976

Sorting on a mesh-connected parallel computer


C. D. Thompson
Carnegie Mellon University

H. T. Kung

Follow this and additional works at: http://repository.cmu.edu/compsci

This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
contact [email protected].

NOTICE WARNING CONCERNING C O P Y R I G H T RESTRICTIONS:


The copyright law of the United States (title 17, U.S. Code) governs the making
of photocopies or other reproductions of copyrighted material. Any copying of this
document without permission of its author may be prohibited by law.

SORTING ON A MESH-CONNECTED PARALLEL COMPUTER

C. D. Thompson and H. T. Kung

March 1976

Department of Computer Science


Carnegie-Mellon University
Pittsburgh, PA
15213

ABSTRACT

Two algorithms are presented for sorting n elements on an nxn mesh-connected


processor array that require 0(n) routing and comparison steps. The best previous algorithm takes time 0(n log n ) . The algorithms of this paper are shown
to be optimal in time within small constant factors. Extensions to higherdimensional mesh-connected processor arrays are also given.

This research was supported in part by the National Science Foundation under
Grant MCS75-222-55 and the Office of Naval Research under Contract N00014-76C-0370, NR 044-422.

1. Introduction
In the course of a parallel computation, individual processors will need to
distribute their results to other processors and complicated data flow problems may
arise. One way to handle this problem is by sorting "destination tags" attached to each
data element, as discussed in Batcher [1968]. Hence efficient sorting algorithms for
parallel machines with some fixed processor interconnection pattern are relevant to
almost any use of these machines.
In this paper we present two algorithms for sorting N n^ elements on an nxn
mesh-type processor array that require 0{n.) unit-distance routing steps and 0<n)
comparison steps (n is assumed to be a power of 2). The best previous algorithm
takes time 0(n log n) (Orcytt [1974]). One of our algorithms, the s^-way merge sort, is
shown optimal within a factor of 2 in time for sufficiently large n, if one comparison
step takes no more than twice the time of a routing step. Our other 0{n) algorithm, an
adaptation of Batcher's bitonic merge sort, is much less complex but optimal under the
same assumption to within a factor of 4.5 for aH n, and is more efficient for moderate
n.
We believe that the algorithms of this paper will give the most efficient sorting
algorithms for ILL1AC IV-type parallel computers.
Our algorithms can be generalized to higher-dimensional array interconnection
patterns. For example, our second algorithm can be modified to sort N elements on a
j-dimensionally mesh-connected N-processor computer in O(N^J) time, which is optimal
within a small constant factor.
Efficient sorting algorithms have been developed for interconnection patterns
other than the "mesh" considered in this paper. Stone [1971] maps Batcher's bitonic
merge sort onto the "perfect shuffle" interconnection scheme, obtaining an N-element
sort time of Oflog^N) on N processors. The odd-even transposition sort (see
Appendix) requires an optimal 0{N) time on a linearly connected N-processor computer.
Sorting time is thus seen to be strongly dependent on the interconnection pattern.
Exploration of this dependence for a given problem is of interest from both an
architectural and an algorithmic point of view.
In Section 2 we give the model of computation. The sorting problem is defined
precisely in Section 3. A lower bound on the sorting time is given in Section 4.
Batcher's 2-way odd-even merge is mapped on our 2-dimensional mesh-connected
processor array in the next section. Generalizing the 2-way odd-even merge, we
introduce a 2s-way merge algorithm in Section 6. This is further generalized to an s ^ way merge in Section 7, from which our most efficient sorting algorithm for large n is
developed. Section 8 shows that Batcher's bitonic sort can be performed efficiently on
our model by choosing an appropriate processor indexing scheme. Some extensions
and implications of our results are discussed in Section 9. The Appendix contains a
description of the odd-even transposition sort.
2

2. Model of Computation
We assume a parallel computer with N - nxn identical processors. The
architecture of the machine is similar to that of the ILL1AC IV (Barnes, el. al. [1968]).
The major assumptions are as follows:
(i> The interconnections between the processors are a subset of those on the U L I A C
IV, and are defined by the following two dimensional array:

where the p's denote the processors, That is, each processor is connected to all
its neighbors. Processors at the perimeter have two or three rather than four
neighbors; there are ho "wrap-around" connections as found on the ILLIAC IV.
The bounds obtained in this paper would be affected at most by a factor of
2 if "wrap-around" connections were included, but we feel that this addition would
obscure the ideas of this paper without substantially strengthening the results.
(ii) It is a SIMD (Single Instruction stream Multiple Data stream) machine (Flynn[1966]X
During each time unit, a single instruction is broadcast to all processors, but only
executed by the set of processors specified in the instruction. For the purpose of
the paper, only two instruction types are needed: the routing instruction for
interprocessor data moves, and the comparison instruction on two data elements in
each processor. The comparison instruction is a conditional interchange on the
contents of two registers in each processor. Actually, we need both "types" of
such comparison instructions to allow either register to receive the minimum;
normally both types will be issued during "one comparison step"
(iii) Define
tp time required for one unit-distance routing Step, i.e., moving one item
from a processor to one of its neighbors,

3
HUNT LIBRARY

\Q time required for one comparison step.


For example, a comparison-interchange step between two items in adjacent
processors can be done in time 2tp+tQ. Of course, concurrent data movement is
allowed, so long as it is all in the same direction; and also any number (up to N) of
concurrent comparisons may be performed simultaneously.

3, The Sorting Problem


The processors may be indexed by any function that is a one-to-one mapping
from {1,2, . . .,n}x{l,2, . . .,n} onto {0,1, . . .,N-1}. Assume.that N elements from a
linearly ordered set are initially loaded in the N processors, each receiving exactly one
element. With respect to any index function, the sorting problem is defined to be the
problem of moving the jth smallest element to the processor indexed by j for a\\ j 0,
1,. . . , N - 1 .
Example 3.1.
Suppose that n * 4 (hence N 1$) and that we want to sort 16 elements initially
loaded as follows:
10

--

-- o

--

15

12

T
9

11

14

Three ways of indexing the processors will be considered in this paper,

(i) Row-Maior Indexing: After sorting we have

(ii> Shuffled Row-Maior Inri^vinp,; After sorting we have

10

11

r-

12

13

14

15

Note that this indexing is obtained by shuffling the binary representation of the
row-major index. For example, the row-major index 5 has the binary
representation 0101. Shuffling the bits gives 0011 which is 3. (In general, the
shuffled binary number, say, "abcdefgh" is "aebfcgdh")
(iii) Snake-Like Row-Maior Indexing: After sorting we have
ROW 1

ROW 2
8

lb

14

10

11

ROW 3

13

12

ROW 4

This indexing is obtained from the rpw-majpr indexing by reversing the ordering
in even rows.

The choice of a particular indexing scheme depends upon how the sorted
elements will be used (or accessed), and upon which sorting algorithm is to be used.
For example, we found that the row-major indexing is poor for merge-sorting.
It is clear that the sorting problem with respect to any index scheme can be
solved by using the routing and comparison steps. We are interested in designing
algorithms which minimize the time spent in routing and comparing.

4. A Lower Bound
Observe that for any index scheme there are situations where the two elements
initially loaded at the opposite corner processors have to be transposed during the
sorting.

IK

-7K

SORTING

It is easy to argue that even for this simple transposition we need


1) unit-distance routing steps. This implies that np algorithm can sort n
time less than 0{n). In this paper, we shall show two algorithms which
elements in time 0(n). One will be developed in Sections 5 through 7,
Section 8.

at least 4 ( n elements in
can sort n
the other in
2

5. The 2~Way Odd-Even Merge


Batcher's odd-even merge (Batcher[L968], Knuth [1973, pp. 224-226]) of two
sorted sequences {u(i)} and {v(i)} is performed in two stages. First, the "odd
sequences"
{u( 1 ),u(3),u(5),.,.,u(2j+l),.,.}
and
{v( 1 ),v(3),...,v(2j+l),..,}
are
merged
concurrently with the merging of the "even sequences' {u(2),u(4),...,u(2j),.} and
{v(2),v(4),...,v(2j),...}. Then the two merged sequences are interleaved, and a single
parallel comparison-interchange step produces the sorted result The merges in the
first stage are done in the same manner (i.e., recursively).
1

We first illustrate how the odd-even


linearly connected processors, then the
connected arrays. If two sorted sequences
loaded in 8 linearly-connected processors,
diagrammed as:

LI.

method can be. performed efficiently on


idea is generalized to 2-dimensionally
{1, 3, 4, 6} and {0, 2, 5, 7} are initially
then Batcher's odd-even merge can be

Unshuffle: OddHndexed elements to left, evens to right.

tu
L2. Merge the "odd sequences" and the "even sequences".

H T 1 - I

L3. Shuffle.
C

L4. Comparison-interchange (the C's indicate comparison-interchanges).

Step L3 above is the "perfect shuffle" <Stone[ 1971]) and step L I is its inverse,
the "unshuffle". Note that the perfect shuffle can be achieved by using the triangular
interchange pattern below:
5

**>

<->

<*+

where the
indicate interchanges. Similarly, an inverted triangular interchange
pattern will do the unshuffle. Therefore, both the perfect shuffle and unshuffle can be
done in k-1 interchanges (i.e., 2k-2 routing steps) when performed on a row of
length 2k in our model.
We now give an implementation of the odd-even merge pn a rectangular region
of our model. Let M(j,k) denote our algorithm of merging two j . b y k/2 sorted adjacent
subarrays to form a sorted j by k array, where j,k are powers of 2, k>l, and all the
arrays are arranged in the snake-like row major ordering. We first consider the case
where k 2 . If j - 1 , a single comparison-interchange step suffices to sort the two unit
"sub-arrays". Given two sorted columns of length ,j>l,.'M(j,2) consists of the following
steps:

Jl.

Move all odds to the left column and all evens to the right. Time: 2tp

J2. Use the "odd-even transposition sort" (see Appendix) to sort each column. Time:
2Jt +jt
R

J3. Interchange on even rows. Time: 2tp

J4.

One step of comparison-interchange (every "even with the next "odd").


41

The following diagram illustrates the algorithm M(|,2) for

Time:

For k>2, M(j,k) is defined recursively in the following way. Steps M l and M2
unshuffle the elements, step M3 recursively merges the "odd sequences" and the "even
sequences", steps M4 and M5 shuffle the "odds" and "evens" together, and step M5
performs the final comparison-interchange. The accompanying diagrams illustrate the
algorithm M(4,4), where the two given sorted 4 by 2 subarrays are initially stored in
16 processors as follows:

Ml.

Single interchange step on even rows if j>2, so that columns contain either all
evens or all odds. If j=2, do nothing; the columns are already segregated. Time:

10

i 7
>- 11

M2. Unshuffle each row. Time: <k-2)t

12

M3. Merge b y calling M(j,k/2) on each half. Time: T(j,k/2)

M4. Shuffle each row. Time: ( k - 2 ) t

11

10

15

11

<*-*>

'. Interchange on even rows. Time: 2 t

12
<J-f>
13

T(j,k) be the time needed by M(j,k). Then we have

T(j,2) ( 2 j + 6 ) t
and for k >

+(j+ l ) t ,
c

2,

12

Tak)

(2k 4 ) t + t
R

+ T(j.k/2).

These imply that

T(j,k) < (2j + 4k + 4log k ) t + (j + log k)t .


R

(All logarithms in this paper are taken to base 2.)


An nxn sort may be composed of M(j,k) by sorting allqbJumns in 0(n) routes and
compares by, say, the odd-even transposition sort, then using M(n,2), M(n,4),
M(n,8),...,M(n,n), for a total of 0(n log n) routes and compares. This poor performance
may be assigned to two inefficiences in the algorithm. First, the recursive s u b problems (M(n,n/2), M(n,n/4),...,M(n,l)) generated by M(n,n) are not decreasing in size
along both dimensions: they are all 0(n) in complexity. Second, the method is
extremely "local" in the sense that no comparisons are made between elements initially
in different halves of the array until the last possible moment, when each half has
already been independently sorted.
The first inefficiency can be attacked by designing an "upwards" merge to
complement the "sideways" merge just described Even more powerful is the idea of
combining many upwards merges with a sideways one. This idea is used in the next
section.

13
HUNi L1BRAHT
eMflttlE-MELLQII UBIYEBSITY

6. The 2s-Way Merge


In this section we give an algorithm M'O^s) for merging 2s arrays of size j/s by
k/2 in a j by k region of our processors, where j,k,s are powers of 2, s > l , and the
arrays are in the snake-like row-major ordering. The algorithm M (j,-k s) is almost the
same as the algorithm M{j,k) described in the previous section, except that M*(j,k,s)
requires a few more comparison-interchanges during step M6. These steps are
exactly those performed in the initial portion of the odd-even transposition sort
mapped onto our "snake" (see Appendix), More precisely, for k>2, M l and M6 are
replaced by
?

MT.

Single interchange step on even rows; if i>s so that columns contain either all
evens or all odds. If jj^s, do nothing: the columns are already segregated. Time:
2t
R

M6\

Perform the first 2 s - l parallel comparison-interchange steps of the odd-even


transposition sort on the "snake" It is not difficult to see that the time needed
is at most
s ( 4 t + t ) + ( s - l ) ( 2 t . + t ) - (6s-2)t + ( 2 s - l ) t .
R

Note that the original step M6 is just the first step of an odd-even transposition sort.
Thus the 2-way merge is sefen to be a special case of the 2s-way merge.
Similarly, for M'(j,2,s), j>s, J4 is replaced by M6 , which takes time
?

(2s-l)(2t +t ).
R

M'(s,2,s) is a special case analogous to M( 1,2), and may be performed by the odd-even
transposition sort (see Appendix) in time 4 s t + 2stQ.
R

The validity of this algorithm may be demonstrated by use of the 0-1 principle
(Knuth [1973], p.224): if a network sorts all sequences of O's and T s , then it will sort
any arbitrary sequence of elements chosen from a linearly ordered set. Thus, we may
assume that the inputs are O's and Vs. It is easy to check that there may be as many
as 2s more zeros on the left as on the right after unshuffling (i.e., after step J l or
step M2). After the shuffling, the first 2 s - l steps of an odd-^ven transposition sort
suffice to sort the resulting array.
Let T\j,k,s) be the time required by the algorithm MXj>k,s). Then we have that
r(j,2,s) (2j + 4s + 2 ) t + (j + 2s - 1 ) t
R

and that for k > 2,


r(j,k,s) S (2k + 6s - 2 ) t + (2s - l ) t + r(j,k/2,s).
R

14

These imply that

T'(j,M) - (2j + 4k + (6s)log k + Qfs+log k))t

+ (j +(2s)log k + O(s+log k ) .
c

For s 2, a merge sort may be derived that has! the following time behavior
S'(n,n) - S'(n/2,n/2) + r(n,n,2).
Thus,
S'(n,n) (12n + 0(log n))t + (2n 0(log2 )).t .
2

Suddenly, we have an algorithm that sorts in linear time. In the following


section, the constants will be reduced by a factor of 2 by the use of a more
complicated multi-way merge algorithm.

15

7.

The s^-Way Merge

The s ^ - w a y merge M"(j>k,s) to be introduced in this section is a generalization of


the 2~way merge M(j,k). Input to M"(j,k) is s sorted j/s by k/s arrays in a j by k
region of our processors, where j,k,s are powers of 2 and $>1. Steps M l and M2 still
suffice to move odd-indexed elements to the left and evens to the right so long as j>$
and k>s; M"(j,s,s) is a special case analogous to M(j,2> of the 2-way merge. Steps M l
and M6 are now replaced by
2

M1 .
M

Single interchange step on even rows if


so that columns contain either all
evens or all odds. If i^s, do nothing: the columns are already segregated. Time:
2t
R

M6'\

Perform the first s - l parallel comparison-interchange steps of the odd-even


transposition.sort on the "snake" (see Appendix). The time required for this is
2

( s / 2 ) ( 4 t + t ) + (s /2 - l ) ( 2 t + t ) ( 3 s - 2 ) t + ( s ~ l ) t .
2

The motivation for this new step comes from the realization that when the inputs are
O's and l's, there may be as many as s more zeros on the left half as the right after
unshuffling.
2

M"(j>s,s), j$, can be performed in the following way;


N1. (log s/2) 2-way merges: M(j/s,2),M(i/s,4),...,M(j/s s/2).
N2. A single 2s-way merge: M'(j,s-,s).
l

If T"(j,k,s) is the time taken by M*'(j,k s.),'we have for ks


A

T"(j,s,s) = (2j + 0((s + j/s)log s ) ) t +(] + 0((s + j/s)lo s ) ) t


R

and for k > s,


T'Xj,k,s) - (2k + 3 s + O U ) ) t + ( s + 0 ( t ) ) t + T (j,k/2,s).
2

Therefore,
T " ( j , M ) - (4k + 2j + 3s log(k/$) + 0((s + j/s)log s * log k ) ) t
2

+ (j + s log(k/s) + 0(($ + j/sjlog s + log k))t .


2

A sorting algorithm may be developed from the s - w a y merge; a good value for
s is approximately n*/^ (remember that s must be a power of 2). Then the time of
sorting nxn elements satisfies
2

16

S (n,n) ~ S " ( n / , n / ) + . H n A n / ) .
H

This leads immediately to the following result.

Theorem 7.1
If the snake-like row-major indexing is used, the sorting problem can be done in
time
(6n + 0(n / log n))t + (n + 0(n / log n))t .
2

If t < 2 t , Theorem 7.1 implies that (6n+2n+0(n / log n ) ) t is sufficient time


for sorting. In section 4, we showed that 4(n-l)tp time is necessary. Thus, for large
enough n, the s - w a y algorithm is optimal to within a factor of 2. Preliminary
investigation indicates that a careful implementation of the s - w a y merge sort is
optimal within a factor of 7 for aH n, under the assumption that IQ < 2 t .
c

17

8.

The Bitonic Merge

In this section we shall show that Batcher's bitonic merge algorithm {Batcher
[1968], Knuth [1973, pp. 232-237]) lends itself well to sorting on a mesh-connected
parallel computer, once the proper indexing scheme.has been selected. Two indexing
schemes will be considered, the "row-major" and the "shuffled row-major" indexing
schemes defined in Section 3.
The bitonic merge of two sections of a bitonic array of j/2 elements each takes
log j passes, where pass i consists of a. comparison-interchange between processors
with indices differing only in the i* bit of their binary representations. (This
operation will be. termed "comparison-interchange on the i* bit".) Sorting an entire
array of 2^ elements by the bitonic method requires k comparison-interchanges on the
0 ^ bit (the least significant bit), k-1 comparison-interchanges on the first bit,...,(k-i)
comparison-interchanges on the i^ bit,.,., and 1 comparison-interchange on the most
significant bit. For any fixed indexing scheme, in general a comparison-interchange on
the bit will take a different amount of time than when done on the j ^ bit: an
optimal processor indexing scheme for the bitonic algorithm minimizes the time spent
on comparison-interchange steps. A necessary condition for optimality is that a
comparison-interchange on the j ^ bit be no more expensive than the (j+1) * bit for all
j . I f this were not the case for some j , then a better indexing scheme could
immediately be derived from the supposedly optimal one by interchanging the j** and
the ( j + l ) * bits of all processor indices (since more comparison-interchanges will be
done on the original j bit than on the (j+l) * bit).
h

The bitonic algorithm has been analyzed for the row-major indexing scheme: I t
takes
0(n log n ) t + O ( l o g n ) t
R

time to sort n elements on n processors (Orcutt [1974]). However, the row-major


indexing scheme is decidedly non-optimal. For the case n * 64, processor indices
have six bits. A comparison-interchange on bit 0 takes just 2tp + t^, for the
processors are horizontally adjacent. A comparison-interchange oft bit 1 takes 4 t +
, tQ, since the processors are two units apart. Similarly, a Comparison-interchange on
bit 2 takes 8tp + \Q, but a comparison-interchange on bit 3 takes only 2 t + t^
because the processors are vertically adjacent. This phenomenon may be analyzed by
considering the row-major index as the concatenation of a T and an X binary vector:
in the case n 64, the index is
^ 0 X 2 X 1 X 0 A comparison-interchange on Xj
takes more time than one on Xj when i > jj however, a comparison-interchange on Yj
takes exactly the same time as on Xj. Thus a better indexing scheme may be derived
b y "shuffling" the X and *Y vectors, obtaining (in the case n - 64) Y X Y X Y X J
this "shuffled row-major" indexing scheme satisfies our condition on Optimality.
2

Geometrically, "shuffling" the X and T vectors ensures that all arrays


encountered in the merging process are nearly square, so that routing time will not be

18

excessive in either direction. The standard row-major indexing causes the bitonic sort
to contend with sub-arrays that are always at least as wide as they are tall; the
aspect ratio can be as high as n on an nxn processor array.
Programming the bitonic sort would be a little tricky, as the "direction" of a
comparison-interchange step depends on the processor index. Orcutt [1974] covers
these g o r y details for row-major indexing; his algorithm may easily be modified to
handle the shuffled row-major indexing scheme. Here is an example of the bitonic
merge sort on a 4x4 processor array for the shuffled row-major indexing; the
comparison "directions" were derived from the following diagram (Knuth [1973], p.
237):

FRUCLSSR U
0
1
2
3
4
5
B
7
8
9
10
11
12
13
14

ShlRC

StilRO 2

Stane 3

71" ~

> f
y
\f

\<
\

t
I

'

fJ t -

> <

>s
> f

y.
t v
.

J\

/s

'

> f

>y

\ f
>

'

vf

> f

Initial data configuration:


10

15

14
11

12

13
C
8
Stage

1. Merge pairs of adjacent l x l


indicated. Time: 2tp + trj.

matrices by the comparison-interchange

19

14

10
Cl

cl]

et

Cl

15
H MM

12

11

13

Cil

Cil

Ct [

Ctj

3 0

Stage 2. Merge pairs of 1x2 matrices; note that one member of a pair is sorted in
ascending order, the other in descending order. This will always be the case in
any bitonic merge. Time: 4tp + 2tQ.

__C,

,C

, ,
14

15

10

12

11

10

13

15

11

13

11

8
Stage 3. Merge pairs of 2x2 matrices. Time: 8 t + 3t.
R

CI

CI

c*

c*

12

14

15

1.1

C
10

->

13

ct

Ct

c t

c t

20

'4~

15

14

C
13

12

5.

4-

Stage 4. Merge the two 2x4 matrices. Time: 12t + 4 t .


R

CM

C M C *

C4

i H o
13

11

11

12

c i\
10

c*
10

14 ( H 15

13

ci

12

14

15

12

13

14

15

C
1

CM

c
9

c
8

13

c
11

12

15

10

c
10

14

Let T'"(2J) be the time to merge the bitonically sorted elements in processors *0
through * 2 ' - l , where the shuffled row-major indexing is used. Then after one pass of
comparison-interchange, which takes time
the problem is reduced to the
bitonic merge of elements in processors *0 through * 2 ~ - l , and that of elements in
procesors * 2 ~ to * 2 ' - l . It may be observed that the latter two merges can be done
concurrently. Thus we have
i

T'"<1)0,
T (2') T ( 2
w

K 1

) + 2 /21
ri

tR

21

(3*2
T (2')-

(,+1)

/ 2 _ 4 ) t + i t , if i is odd, and
R

M ,

(/j*2'/ - 4 ) t + i t , if i is even.
2

Let S ' " ( 2 J ) be the time taken by the corresponding sorting algorithm (for a square
array). Then
2

S"'(l) 0 ,
S"'(2 J) " S ( 2 J ) + T"'(2 J>
2

_ 1

- S"'<2 <J )) + T ( 2 ' ) + T ( 2 H > .


2

-1

Hence
S ' " ( 2 i ) = (14(2J-1) - 8 j ) t + ( 2 j + j ) t .
2

In our model, we have 2 i * N n


2

processors, leading to the following

thereom.
Theorem 8.1
If the shuffled row-major indexing is used, the bitonic sort can be done in time
(14(n-l) - 8log n ) t (2log n + log n)t ,
R

If t < 2 t p , it may be seen that the bitonic merge sort algorithm is optimal to
within a factor of 4.5 for all n (since 4(n-l)tp time is necessary, as shown in Section
4). Preliminary investigation indicates that the bitonic merge sort i$ faster than the
s - w a y odd-even merge sort for n<512, under the assumption that t^ 2tp,
c

22

9. Extensions and Implications


(i)

By Theorem 7.1 or 8.1, th elements may be sorted into snake-lik row-major


ordering or in the shuffled row-mapr ordering in 0(n) time. By the following
lemma we know that they can be rearranged to obey any other index function with
relatively insignificant extra costs, provided each processor has sufficient memory
size.

Lemma 9.1
If N nxn elements have already been sorted with respect to some index function
and if each processor can store n elements, then the M elements can be sorted
with respect to any other index function by using an additional 4 ( n - l ) t p units of
time.
The proof follows from the fact that all elements can be moved to
their
destinations by four sweeps of n-1 routing steps in all four directions.
(ii) If the processors are connected in a kxm rectangular array,

v
instead of a square array, similar results can still be obtained.
corresponding to Theorem 7.1, we have

For example,

Theorem 9,1
If the snake-like row-major indexing is used, the sorting problem for a kxm
processor array (k,m powers of 2) can be done in time
(4m + 2k + O(h / log h t + (k + Q(h / log h ) ) t ,
2

23

where h - min(k,m), by using the s - w a y merge sort with s*0(h


2

l/ .
3 )

(iii) The number of elements to be sorted could be larger than N, the number of
processors. An efficient means of handling this situation is to distribute an
approximately equal number of elements to each processor initially and to use a
merge-splitting operation for each comparison-interchange operation. This idea is
discussed by Knuth [1973, Exercise 5.3.4-38], and used by Baudet and Stevenson
[1975]. Baudet and Stevenson's results wiH be immediately improved if the
algorithms of this paper are used, since they used Orcutt's 0(n log n) algorithm.
1

(iv) Higher-dimensional array interconnection patterns, i.e N n ' processors each


connected to its 2j nearest neighbors, may be sorted by algorithms generalized
from those presented in this paper. For example, N "nJ elements may be sorted in
time
((3j

+ jXn-1) - 2j

log N)tp + <l/2)(log N + log N ) t ,


2

b y adapting Batcher's bitonic merge sort algorithm to the "j-way shuffled r o w major ordering". This new ordering is derived from the binary representation of
the row-major indexing by a j - w a y bit shuffle.
If n 2 ^ , j*3> and
2 1 0 2 1 0 2*1*0
~ )
index, then the j - w a y shuffled index is
Z Y X Z Y X Z Y X .
Z

i s

r o w

m a

o r

This formula may be derived in the following way. The t term is not dimensiondependent: the same number of comparison is performed in any mapping of the
bitonic sort onto an N processor array. The
term is the solution of
c

52

2i

l<i<log n

52 uiog N > ~ i j * k ) ,
l<k<j

where the 2* term is the cost of a comparison-interchange on the ( H ) * bit of any


of the "k -dimension indices" (i.e., Z ^ ^ Y ^ j , and X j ^ when j 3 as in the example
above). The ((log N) - ij + k) term is the number of times a comparisoninterchange is performed on the (ij-k)* bit of the j - w a y shuffled row-major index
during the bitonic sort. Therefore we have the following theorem.
s

th

Theorem 9.2
If N processors are j-dimensionally mesh-connected, then the bitonic sort
can be performed in time (XN /)), using the j - w a y shuffled row-major index
1

scheme.

24

By using the argument of Section 4, one can easily check that the bound in the
theorem is asymptotically optimal for large N.

25

Appendix.

The Odd-Even Transposition Sort

The odd-even transposition sort (Knuth [1973], p. 241) may be mapped onto our
2-dimensional arrays with snake-like row-major ordering following way. Given N
processors initially loaded with a data value, repeat N/2 times:
01.

"Expensive comparison-interchange" of processors #(2i+l) with processors


#(2i+2), 0<i<N/2-l. Time: 4 t + t if processor array has more than two columns
and more than one row; 0 if N=2; and 2tp + t^ otherwise.
R

02.

"Cheap comparison-interchange" of processors *(2i) with processors *(2i+l),


0<i<N/2-l. Time: 2 t + t .
R

If T ( j , k ) is the time required to sort jk elements in a jxk region of our


o e

processor by the odd-even transposition sort Into the snake-like row-major ordering,
then
0
, IF jk-1 ELSE
2t + t
, IF jk=2 ELSE
JW2t + t ) , IF j l OR k=2 ELSJE
J3t + t )
R

oe J )~
(

Step J2 of the 2-way odd-even merge (Section 4) cannot be performed by the


version of the odd-even transposition sort indicated above. Since N is even here (N
2j), step 02 may be placed before step 01 in the algorithm description above (see
Knuth [1973]). Now step 02 may be performed in the normal time of 2 t + t , even
starting from the non-standard initial configuration depicted in Section 4 as the result
of step J l .
R

26

References
Barnes, G. K, et. al [1968]. "The ILLIAC IV Computer," IEEE Trans, on Computers, C-17,
pp. 746-757.
Batcher, . E. [1968]. "Sorting Networks and Their Applications," P roc. AFIP S Spring
Joint Computer Conference. Vol. 32, pp. 307-314.
Baudet, G., and Stevenson, D. [1975]. "Optimal Sorting Algorithms for P arallel
Computers", Computer Science Department report, Carnegie-Mellon University,
May 1975.
Flynn, M. J . [1966].
1901-1909.

"Very High-Speed Computing Systems," P roc. IEEE. Vol.54, pp.

Knuth, D. E. [1973]. The Art of Computer P rogramming: Vol. 3. Sorting and Searching.
Reading, Mass.: Addison-Wesley.
Orcutt, S. E. [1974]. Computer Organization and Algorithms for Very-High Speed
Computations. P h.D. Thesis, Stanford University, September 1974, Chap. 2, pp.
20-23.
Stone, H. S. [1971]. "P arallel P rocessing with the P erfect Shuffle," JEEE Trans, on
Computers. Vol. C-20, pp. 153-161.

27

UNCLASSIFIED

4.

T I T L E C"<l

SubUtl)

SORTING ON A MESH-CONNECTED P ARALLEL COMP UTER

7.

*6

P E R F O RM

"8.

CONTRACT

PERFORM ING ORGANIZATION

NAM E AND

ADDRESS

Carnegie-Mellon University
Computer Science Dept.
Pittsburgh, P A
15213
C O N T R O L L I N G O F F I C E NAM E AND

12.

ADDRESS

NUM

NUM

BEP

BERS)

ONITORING

4GENCV

PROGRAM
ELEM ENT. PROJECT
A R E A & WORK U N I T N U M B E R S

REPORT

TASK

DATE

March 1976

Office of Naval Research


Arlington, VA
22217
U

OR GRANT

N00014-76-C-0370,
NR 044-422
10.

II

EPOR

AUTHORS

C. D. Thompson & H. T. Kung


a.

ING ORG

13.

NUM BER OF

PAGES

29
15.

N A M E 6 A O D R E S S f i / different

fro
m

Controlling

SECURITY

CLASS

'of

this

report)

Office)

UNCLASSIFIED
bm

16-

OlSTRIBU 'ON
T

S T A T E M E N T (of thl.

Approved

DECLASSIFICATION
SCHEDULE

DOWNGRADING

Report)

for public release; distribution unlimited.

17

D.S^.BUT.ON

STATEM ENT

1.

SUPPLEM ENTARY

19.

K E Y W O R D S 'Continue

, . .b.l,.c,

. r W . r . d I n Stock

20. H U f *

*'>

NOTES

on

revere,

.id.

it necee.ery

end identify

by block

nu
m ber)

Lor sorting .? elements on an nxn mesh-connected processor array that require


(n) r e e l i n g and comparison steps. The best previous algorithm takes time
1
The algorithms of this paper are shown to be optimal in time within
s i a l l c o n ^ a n t factors. Extensions to higher-dimensional mesh-connected processo
arrays are also given

DD

1 jA^n

1473

EDITION OF

UNCLASSIFIED

1 N O V 65 IS O B S O L E T E
SECURITY

CLASSIFICATION

THI S PAGE

(Whm n

Entered)

You might also like