0% found this document useful (0 votes)

248 views30 pages

Sorting On A Mesh-Connected Parallel Computer

this is help full for final year student of computer science

Uploaded by

sajjad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

248 views30 pages

Sorting On A Mesh-Connected Parallel Computer

this is help full for final year student of computer science

Uploaded by

sajjad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Carnegie Mellon University

Research Showcase @ CMU

Computer Science Department

School of Computer Science

1976

Sorting on a mesh-connected parallel computer

C. D. Thompson
Carnegie Mellon University

H. T. Kung

Follow this and additional works at: http://repository.cmu.edu/compsci

This Technical Report is brought to you for free and open access by the School of Computer Science at Research Showcase @ CMU. It has been
accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase @ CMU. For more information, please
contact [email protected].

NOTICE WARNING CONCERNING C O P Y R I G H T RESTRICTIONS:

The copyright law of the United States (title 17, U.S. Code) governs the making
of photocopies or other reproductions of copyrighted material. Any copying of this
document without permission of its author may be prohibited by law.

SORTING ON A MESH-CONNECTED PARALLEL COMPUTER

C. D. Thompson and H. T. Kung

March 1976

Department of Computer Science

Carnegie-Mellon University
Pittsburgh, PA
15213

ABSTRACT

Two algorithms are presented for sorting n elements on an nxn mesh-connected

processor array that require 0(n) routing and comparison steps. The best previous algorithm takes time 0(n log n ) . The algorithms of this paper are shown
to be optimal in time within small constant factors. Extensions to higherdimensional mesh-connected processor arrays are also given.

This research was supported in part by the National Science Foundation under
Grant MCS75-222-55 and the Office of Naval Research under Contract N00014-76C-0370, NR 044-422.

1. Introduction
In the course of a parallel computation, individual processors will need to
distribute their results to other processors and complicated data flow problems may
arise. One way to handle this problem is by sorting "destination tags" attached to each
data element, as discussed in Batcher [1968]. Hence efficient sorting algorithms for
parallel machines with some fixed processor interconnection pattern are relevant to
almost any use of these machines.
In this paper we present two algorithms for sorting N n^ elements on an nxn
mesh-type processor array that require 0{n.) unit-distance routing steps and 0<n)
comparison steps (n is assumed to be a power of 2). The best previous algorithm
takes time 0(n log n) (Orcytt [1974]). One of our algorithms, the s^-way merge sort, is
shown optimal within a factor of 2 in time for sufficiently large n, if one comparison
step takes no more than twice the time of a routing step. Our other 0{n) algorithm, an
adaptation of Batcher's bitonic merge sort, is much less complex but optimal under the
same assumption to within a factor of 4.5 for aH n, and is more efficient for moderate
n.
We believe that the algorithms of this paper will give the most efficient sorting
algorithms for ILL1AC IV-type parallel computers.
Our algorithms can be generalized to higher-dimensional array interconnection
patterns. For example, our second algorithm can be modified to sort N elements on a
j-dimensionally mesh-connected N-processor computer in O(N^J) time, which is optimal
within a small constant factor.
Efficient sorting algorithms have been developed for interconnection patterns
other than the "mesh" considered in this paper. Stone [1971] maps Batcher's bitonic
merge sort onto the "perfect shuffle" interconnection scheme, obtaining an N-element
sort time of Oflog^N) on N processors. The odd-even transposition sort (see
Appendix) requires an optimal 0{N) time on a linearly connected N-processor computer.
Sorting time is thus seen to be strongly dependent on the interconnection pattern.
Exploration of this dependence for a given problem is of interest from both an
architectural and an algorithmic point of view.
In Section 2 we give the model of computation. The sorting problem is defined
precisely in Section 3. A lower bound on the sorting time is given in Section 4.
Batcher's 2-way odd-even merge is mapped on our 2-dimensional mesh-connected
processor array in the next section. Generalizing the 2-way odd-even merge, we
introduce a 2s-way merge algorithm in Section 6. This is further generalized to an s ^ way merge in Section 7, from which our most efficient sorting algorithm for large n is
developed. Section 8 shows that Batcher's bitonic sort can be performed efficiently on
our model by choosing an appropriate processor indexing scheme. Some extensions
and implications of our results are discussed in Section 9. The Appendix contains a
description of the odd-even transposition sort.
2

2. Model of Computation
We assume a parallel computer with N - nxn identical processors. The
architecture of the machine is similar to that of the ILL1AC IV (Barnes, el. al. [1968]).
The major assumptions are as follows:
(i> The interconnections between the processors are a subset of those on the U L I A C
IV, and are defined by the following two dimensional array:

where the p's denote the processors, That is, each processor is connected to all
its neighbors. Processors at the perimeter have two or three rather than four
neighbors; there are ho "wrap-around" connections as found on the ILLIAC IV.
The bounds obtained in this paper would be affected at most by a factor of
2 if "wrap-around" connections were included, but we feel that this addition would
obscure the ideas of this paper without substantially strengthening the results.
(ii) It is a SIMD (Single Instruction stream Multiple Data stream) machine (Flynn[1966]X
During each time unit, a single instruction is broadcast to all processors, but only
executed by the set of processors specified in the instruction. For the purpose of
the paper, only two instruction types are needed: the routing instruction for
interprocessor data moves, and the comparison instruction on two data elements in
each processor. The comparison instruction is a conditional interchange on the
contents of two registers in each processor. Actually, we need both "types" of
such comparison instructions to allow either register to receive the minimum;
normally both types will be issued during "one comparison step"
(iii) Define
tp time required for one unit-distance routing Step, i.e., moving one item
from a processor to one of its neighbors,

3
HUNT LIBRARY

\Q time required for one comparison step.

For example, a comparison-interchange step between two items in adjacent
processors can be done in time 2tp+tQ. Of course, concurrent data movement is
allowed, so long as it is all in the same direction; and also any number (up to N) of
concurrent comparisons may be performed simultaneously.

3, The Sorting Problem

The processors may be indexed by any function that is a one-to-one mapping
from {1,2, . . .,n}x{l,2, . . .,n} onto {0,1, . . .,N-1}. Assume.that N elements from a
linearly ordered set are initially loaded in the N processors, each receiving exactly one
element. With respect to any index function, the sorting problem is defined to be the
problem of moving the jth smallest element to the processor indexed by j for a\\ j 0,
1,. . . , N - 1 .
Example 3.1.
Suppose that n * 4 (hence N 1$) and that we want to sort 16 elements initially
loaded as follows:
10

-- o

T
9

Three ways of indexing the processors will be considered in this paper,

(i) Row-Maior Indexing: After sorting we have

(ii> Shuffled Row-Maior Inri^vinp,; After sorting we have

Note that this indexing is obtained by shuffling the binary representation of the
row-major index. For example, the row-major index 5 has the binary
representation 0101. Shuffling the bits gives 0011 which is 3. (In general, the
shuffled binary number, say, "abcdefgh" is "aebfcgdh")
(iii) Snake-Like Row-Maior Indexing: After sorting we have
ROW 1

ROW 2
8

ROW 3

ROW 4

This indexing is obtained from the rpw-majpr indexing by reversing the ordering
in even rows.

The choice of a particular indexing scheme depends upon how the sorted
elements will be used (or accessed), and upon which sorting algorithm is to be used.
For example, we found that the row-major indexing is poor for merge-sorting.
It is clear that the sorting problem with respect to any index scheme can be
solved by using the routing and comparison steps. We are interested in designing
algorithms which minimize the time spent in routing and comparing.

4. A Lower Bound
Observe that for any index scheme there are situations where the two elements
initially loaded at the opposite corner processors have to be transposed during the
sorting.

-7K

SORTING

It is easy to argue that even for this simple transposition we need

1) unit-distance routing steps. This implies that np algorithm can sort n
time less than 0{n). In this paper, we shall show two algorithms which
elements in time 0(n). One will be developed in Sections 5 through 7,
Section 8.

at least 4 ( n elements in
can sort n
the other in
2

5. The 2~Way Odd-Even Merge

Batcher's odd-even merge (Batcher[L968], Knuth [1973, pp. 224-226]) of two
sorted sequences {u(i)} and {v(i)} is performed in two stages. First, the "odd
sequences"
{u( 1 ),u(3),u(5),.,.,u(2j+l),.,.}
and
{v( 1 ),v(3),...,v(2j+l),..,}
are
merged
concurrently with the merging of the "even sequences' {u(2),u(4),...,u(2j),.} and
{v(2),v(4),...,v(2j),...}. Then the two merged sequences are interleaved, and a single
parallel comparison-interchange step produces the sorted result The merges in the
first stage are done in the same manner (i.e., recursively).
1

We first illustrate how the odd-even

linearly connected processors, then the
connected arrays. If two sorted sequences
loaded in 8 linearly-connected processors,
diagrammed as:

LI.

method can be. performed efficiently on

idea is generalized to 2-dimensionally
{1, 3, 4, 6} and {0, 2, 5, 7} are initially
then Batcher's odd-even merge can be

Unshuffle: OddHndexed elements to left, evens to right.

tu
L2. Merge the "odd sequences" and the "even sequences".

H T 1 - I

L3. Shuffle.
C

L4. Comparison-interchange (the C's indicate comparison-interchanges).

Step L3 above is the "perfect shuffle" <Stone[ 1971]) and step L I is its inverse,
the "unshuffle". Note that the perfect shuffle can be achieved by using the triangular
interchange pattern below:
5

**>

<->

<*+

where the
indicate interchanges. Similarly, an inverted triangular interchange
pattern will do the unshuffle. Therefore, both the perfect shuffle and unshuffle can be
done in k-1 interchanges (i.e., 2k-2 routing steps) when performed on a row of
length 2k in our model.
We now give an implementation of the odd-even merge pn a rectangular region
of our model. Let M(j,k) denote our algorithm of merging two j . b y k/2 sorted adjacent
subarrays to form a sorted j by k array, where j,k are powers of 2, k>l, and all the
arrays are arranged in the snake-like row major ordering. We first consider the case
where k 2 . If j - 1 , a single comparison-interchange step suffices to sort the two unit
"sub-arrays". Given two sorted columns of length ,j>l,.'M(j,2) consists of the following
steps:

Jl.

Move all odds to the left column and all evens to the right. Time: 2tp

J2. Use the "odd-even transposition sort" (see Appendix) to sort each column. Time:
2Jt +jt
R

J3. Interchange on even rows. Time: 2tp

J4.

One step of comparison-interchange (every "even with the next "odd").

The following diagram illustrates the algorithm M(|,2) for

Time:

For k>2, M(j,k) is defined recursively in the following way. Steps M l and M2
unshuffle the elements, step M3 recursively merges the "odd sequences" and the "even
sequences", steps M4 and M5 shuffle the "odds" and "evens" together, and step M5
performs the final comparison-interchange. The accompanying diagrams illustrate the
algorithm M(4,4), where the two given sorted 4 by 2 subarrays are initially stored in
16 processors as follows:

Ml.

Single interchange step on even rows if j>2, so that columns contain either all
evens or all odds. If j=2, do nothing; the columns are already segregated. Time:

i 7
>- 11

M2. Unshuffle each row. Time: <k-2)t

M3. Merge b y calling M(j,k/2) on each half. Time: T(j,k/2)

M4. Shuffle each row. Time: ( k - 2 ) t

<*-*>

'. Interchange on even rows. Time: 2 t

12
<J-f>
13

T(j,k) be the time needed by M(j,k). Then we have

T(j,2) ( 2 j + 6 ) t
and for k >

+(j+ l ) t ,
c

Tak)

(2k 4 ) t + t
R

+ T(j.k/2).

These imply that

T(j,k) < (2j + 4k + 4log k ) t + (j + log k)t .

(All logarithms in this paper are taken to base 2.)

An nxn sort may be composed of M(j,k) by sorting allqbJumns in 0(n) routes and
compares by, say, the odd-even transposition sort, then using M(n,2), M(n,4),
M(n,8),...,M(n,n), for a total of 0(n log n) routes and compares. This poor performance
may be assigned to two inefficiences in the algorithm. First, the recursive s u b problems (M(n,n/2), M(n,n/4),...,M(n,l)) generated by M(n,n) are not decreasing in size
along both dimensions: they are all 0(n) in complexity. Second, the method is
extremely "local" in the sense that no comparisons are made between elements initially
in different halves of the array until the last possible moment, when each half has
already been independently sorted.
The first inefficiency can be attacked by designing an "upwards" merge to
complement the "sideways" merge just described Even more powerful is the idea of
combining many upwards merges with a sideways one. This idea is used in the next
section.

13
HUNi L1BRAHT
eMflttlE-MELLQII UBIYEBSITY

6. The 2s-Way Merge

In this section we give an algorithm M'O^s) for merging 2s arrays of size j/s by
k/2 in a j by k region of our processors, where j,k,s are powers of 2, s > l , and the
arrays are in the snake-like row-major ordering. The algorithm M (j,-k s) is almost the
same as the algorithm M{j,k) described in the previous section, except that M*(j,k,s)
requires a few more comparison-interchanges during step M6. These steps are
exactly those performed in the initial portion of the odd-even transposition sort
mapped onto our "snake" (see Appendix), More precisely, for k>2, M l and M6 are
replaced by
?

MT.

Single interchange step on even rows; if i>s so that columns contain either all
evens or all odds. If jj^s, do nothing: the columns are already segregated. Time:
2t
R

M6\

Perform the first 2 s - l parallel comparison-interchange steps of the odd-even

transposition sort on the "snake" It is not difficult to see that the time needed
is at most
s ( 4 t + t ) + ( s - l ) ( 2 t . + t ) - (6s-2)t + ( 2 s - l ) t .
R

Note that the original step M6 is just the first step of an odd-even transposition sort.
Thus the 2-way merge is sefen to be a special case of the 2s-way merge.
Similarly, for M'(j,2,s), j>s, J4 is replaced by M6 , which takes time
?

(2s-l)(2t +t ).
R

M'(s,2,s) is a special case analogous to M( 1,2), and may be performed by the odd-even
transposition sort (see Appendix) in time 4 s t + 2stQ.
R

The validity of this algorithm may be demonstrated by use of the 0-1 principle
(Knuth [1973], p.224): if a network sorts all sequences of O's and T s , then it will sort
any arbitrary sequence of elements chosen from a linearly ordered set. Thus, we may
assume that the inputs are O's and Vs. It is easy to check that there may be as many
as 2s more zeros on the left as on the right after unshuffling (i.e., after step J l or
step M2). After the shuffling, the first 2 s - l steps of an odd-^ven transposition sort
suffice to sort the resulting array.
Let T\j,k,s) be the time required by the algorithm MXj>k,s). Then we have that
r(j,2,s) (2j + 4s + 2 ) t + (j + 2s - 1 ) t
R

and that for k > 2,

r(j,k,s) S (2k + 6s - 2 ) t + (2s - l ) t + r(j,k/2,s).
R

These imply that

T'(j,M) - (2j + 4k + (6s)log k + Qfs+log k))t

+ (j +(2s)log k + O(s+log k ) .
c

For s 2, a merge sort may be derived that has! the following time behavior
S'(n,n) - S'(n/2,n/2) + r(n,n,2).
Thus,
S'(n,n) (12n + 0(log n))t + (2n 0(log2 )).t .
2

Suddenly, we have an algorithm that sorts in linear time. In the following

section, the constants will be reduced by a factor of 2 by the use of a more
complicated multi-way merge algorithm.

The s^-Way Merge

The s ^ - w a y merge M"(j>k,s) to be introduced in this section is a generalization of

the 2~way merge M(j,k). Input to M"(j,k) is s sorted j/s by k/s arrays in a j by k
region of our processors, where j,k,s are powers of 2 and $>1. Steps M l and M2 still
suffice to move odd-indexed elements to the left and evens to the right so long as j>$
and k>s; M"(j,s,s) is a special case analogous to M(j,2> of the 2-way merge. Steps M l
and M6 are now replaced by
2

M1 .
M

Single interchange step on even rows if

so that columns contain either all
evens or all odds. If i^s, do nothing: the columns are already segregated. Time:
2t
R

M6'\

Perform the first s - l parallel comparison-interchange steps of the odd-even

transposition.sort on the "snake" (see Appendix). The time required for this is
2

( s / 2 ) ( 4 t + t ) + (s /2 - l ) ( 2 t + t ) ( 3 s - 2 ) t + ( s ~ l ) t .
2

The motivation for this new step comes from the realization that when the inputs are
O's and l's, there may be as many as s more zeros on the left half as the right after
unshuffling.
2

M"(j>s,s), j$, can be performed in the following way;

N1. (log s/2) 2-way merges: M(j/s,2),M(i/s,4),...,M(j/s s/2).
N2. A single 2s-way merge: M'(j,s-,s).
l

If T"(j,k,s) is the time taken by M*'(j,k s.),'we have for ks

T"(j,s,s) = (2j + 0((s + j/s)log s ) ) t +(] + 0((s + j/s)lo s ) ) t

and for k > s,

T'Xj,k,s) - (2k + 3 s + O U ) ) t + ( s + 0 ( t ) ) t + T (j,k/2,s).
2

Therefore,
T " ( j , M ) - (4k + 2j + 3s log(k/$) + 0((s + j/s)log s * log k ) ) t
2

+ (j + s log(k/s) + 0(($ + j/sjlog s + log k))t .

A sorting algorithm may be developed from the s - w a y merge; a good value for
s is approximately n*/^ (remember that s must be a power of 2). Then the time of
sorting nxn elements satisfies
2

S (n,n) ~ S " ( n / , n / ) + . H n A n / ) .
H

This leads immediately to the following result.

Theorem 7.1
If the snake-like row-major indexing is used, the sorting problem can be done in
time
(6n + 0(n / log n))t + (n + 0(n / log n))t .
2

If t < 2 t , Theorem 7.1 implies that (6n+2n+0(n / log n ) ) t is sufficient time

for sorting. In section 4, we showed that 4(n-l)tp time is necessary. Thus, for large
enough n, the s - w a y algorithm is optimal to within a factor of 2. Preliminary
investigation indicates that a careful implementation of the s - w a y merge sort is
optimal within a factor of 7 for aH n, under the assumption that IQ < 2 t .
c

The Bitonic Merge

In this section we shall show that Batcher's bitonic merge algorithm {Batcher
[1968], Knuth [1973, pp. 232-237]) lends itself well to sorting on a mesh-connected
parallel computer, once the proper indexing scheme.has been selected. Two indexing
schemes will be considered, the "row-major" and the "shuffled row-major" indexing
schemes defined in Section 3.
The bitonic merge of two sections of a bitonic array of j/2 elements each takes
log j passes, where pass i consists of a. comparison-interchange between processors
with indices differing only in the i* bit of their binary representations. (This
operation will be. termed "comparison-interchange on the i* bit".) Sorting an entire
array of 2^ elements by the bitonic method requires k comparison-interchanges on the
0 ^ bit (the least significant bit), k-1 comparison-interchanges on the first bit,...,(k-i)
comparison-interchanges on the i^ bit,.,., and 1 comparison-interchange on the most
significant bit. For any fixed indexing scheme, in general a comparison-interchange on
the bit will take a different amount of time than when done on the j ^ bit: an
optimal processor indexing scheme for the bitonic algorithm minimizes the time spent
on comparison-interchange steps. A necessary condition for optimality is that a
comparison-interchange on the j ^ bit be no more expensive than the (j+1) * bit for all
j . I f this were not the case for some j , then a better indexing scheme could
immediately be derived from the supposedly optimal one by interchanging the j** and
the ( j + l ) * bits of all processor indices (since more comparison-interchanges will be
done on the original j bit than on the (j+l) * bit).
h

The bitonic algorithm has been analyzed for the row-major indexing scheme: I t
takes
0(n log n ) t + O ( l o g n ) t
R

time to sort n elements on n processors (Orcutt [1974]). However, the row-major

indexing scheme is decidedly non-optimal. For the case n * 64, processor indices
have six bits. A comparison-interchange on bit 0 takes just 2tp + t^, for the
processors are horizontally adjacent. A comparison-interchange oft bit 1 takes 4 t +
, tQ, since the processors are two units apart. Similarly, a Comparison-interchange on
bit 2 takes 8tp + \Q, but a comparison-interchange on bit 3 takes only 2 t + t^
because the processors are vertically adjacent. This phenomenon may be analyzed by
considering the row-major index as the concatenation of a T and an X binary vector:
in the case n 64, the index is
^ 0 X 2 X 1 X 0 A comparison-interchange on Xj
takes more time than one on Xj when i > jj however, a comparison-interchange on Yj
takes exactly the same time as on Xj. Thus a better indexing scheme may be derived
b y "shuffling" the X and *Y vectors, obtaining (in the case n - 64) Y X Y X Y X J
this "shuffled row-major" indexing scheme satisfies our condition on Optimality.
2

Geometrically, "shuffling" the X and T vectors ensures that all arrays

encountered in the merging process are nearly square, so that routing time will not be

excessive in either direction. The standard row-major indexing causes the bitonic sort
to contend with sub-arrays that are always at least as wide as they are tall; the
aspect ratio can be as high as n on an nxn processor array.
Programming the bitonic sort would be a little tricky, as the "direction" of a
comparison-interchange step depends on the processor index. Orcutt [1974] covers
these g o r y details for row-major indexing; his algorithm may easily be modified to
handle the shuffled row-major indexing scheme. Here is an example of the bitonic
merge sort on a 4x4 processor array for the shuffled row-major indexing; the
comparison "directions" were derived from the following diagram (Knuth [1973], p.
237):

FRUCLSSR U
0
1
2
3
4
5
B
7
8
9
10
11
12
13
14

ShlRC

StilRO 2

Stane 3

71" ~

> f
y
\f

\<
\

t
I

fJ t -

> <

>s
> f

y.
t v
.

> f

\ f
>

> f

Initial data configuration:

14
11

13
C
8
Stage

1. Merge pairs of adjacent l x l

indicated. Time: 2tp + trj.

matrices by the comparison-interchange

10
Cl

cl]

15
H MM

Cil

Ct [

Ctj

3 0

Stage 2. Merge pairs of 1x2 matrices; note that one member of a pair is sorted in
ascending order, the other in descending order. This will always be the case in
any bitonic merge. Time: 4tp + 2tQ.

__C,

, ,
14

8
Stage 3. Merge pairs of 2x2 matrices. Time: 8 t + 3t.
R

1.1

C
10

c t

'4~

C
13

Stage 4. Merge the two 2x4 matrices. Time: 12t + 4 t .

C M C *

i H o
13

c i\
10

c*
10

14 ( H 15

C
1

c
9

c
8

c
11

c
10

Let T'"(2J) be the time to merge the bitonically sorted elements in processors *0
through * 2 ' - l , where the shuffled row-major indexing is used. Then after one pass of
comparison-interchange, which takes time
the problem is reduced to the
bitonic merge of elements in processors *0 through * 2 ~ - l , and that of elements in
procesors * 2 ~ to * 2 ' - l . It may be observed that the latter two merges can be done
concurrently. Thus we have
i

T'"<1)0,
T (2') T ( 2
w

K 1

) + 2 /21
ri

(3*2
T (2')-

(,+1)

/ 2 _ 4 ) t + i t , if i is odd, and
R

M ,

(/j*2'/ - 4 ) t + i t , if i is even.
2

Let S ' " ( 2 J ) be the time taken by the corresponding sorting algorithm (for a square
array). Then
2

S"'(l) 0 ,
S"'(2 J) " S ( 2 J ) + T"'(2 J>
2

_ 1

- S"'<2 <J )) + T ( 2 ' ) + T ( 2 H > .

-1

Hence
S ' " ( 2 i ) = (14(2J-1) - 8 j ) t + ( 2 j + j ) t .
2

In our model, we have 2 i * N n

processors, leading to the following

thereom.
Theorem 8.1
If the shuffled row-major indexing is used, the bitonic sort can be done in time
(14(n-l) - 8log n ) t (2log n + log n)t ,
R

If t < 2 t p , it may be seen that the bitonic merge sort algorithm is optimal to
within a factor of 4.5 for all n (since 4(n-l)tp time is necessary, as shown in Section
4). Preliminary investigation indicates that the bitonic merge sort i$ faster than the
s - w a y odd-even merge sort for n<512, under the assumption that t^ 2tp,
c

9. Extensions and Implications

(i)

By Theorem 7.1 or 8.1, th elements may be sorted into snake-lik row-major

ordering or in the shuffled row-mapr ordering in 0(n) time. By the following
lemma we know that they can be rearranged to obey any other index function with
relatively insignificant extra costs, provided each processor has sufficient memory
size.

Lemma 9.1
If N nxn elements have already been sorted with respect to some index function
and if each processor can store n elements, then the M elements can be sorted
with respect to any other index function by using an additional 4 ( n - l ) t p units of
time.
The proof follows from the fact that all elements can be moved to
their
destinations by four sweeps of n-1 routing steps in all four directions.
(ii) If the processors are connected in a kxm rectangular array,

v
instead of a square array, similar results can still be obtained.
corresponding to Theorem 7.1, we have

For example,

Theorem 9,1
If the snake-like row-major indexing is used, the sorting problem for a kxm
processor array (k,m powers of 2) can be done in time
(4m + 2k + O(h / log h t + (k + Q(h / log h ) ) t ,
2

where h - min(k,m), by using the s - w a y merge sort with s*0(h

l/ .
3 )

(iii) The number of elements to be sorted could be larger than N, the number of
processors. An efficient means of handling this situation is to distribute an
approximately equal number of elements to each processor initially and to use a
merge-splitting operation for each comparison-interchange operation. This idea is
discussed by Knuth [1973, Exercise 5.3.4-38], and used by Baudet and Stevenson
[1975]. Baudet and Stevenson's results wiH be immediately improved if the
algorithms of this paper are used, since they used Orcutt's 0(n log n) algorithm.
1

(iv) Higher-dimensional array interconnection patterns, i.e N n ' processors each

connected to its 2j nearest neighbors, may be sorted by algorithms generalized
from those presented in this paper. For example, N "nJ elements may be sorted in
time
((3j

+ jXn-1) - 2j

log N)tp + <l/2)(log N + log N ) t ,

b y adapting Batcher's bitonic merge sort algorithm to the "j-way shuffled r o w major ordering". This new ordering is derived from the binary representation of
the row-major indexing by a j - w a y bit shuffle.
If n 2 ^ , j*3> and
2 1 0 2 1 0 2*1*0
~ )
index, then the j - w a y shuffled index is
Z Y X Z Y X Z Y X .
Z

i s

r o w

m a

o r

This formula may be derived in the following way. The t term is not dimensiondependent: the same number of comparison is performed in any mapping of the
bitonic sort onto an N processor array. The
term is the solution of
c

l<i<log n

52 uiog N > ~ i j * k ) ,
l<k<j

where the 2* term is the cost of a comparison-interchange on the ( H ) * bit of any

of the "k -dimension indices" (i.e., Z ^ ^ Y ^ j , and X j ^ when j 3 as in the example
above). The ((log N) - ij + k) term is the number of times a comparisoninterchange is performed on the (ij-k)* bit of the j - w a y shuffled row-major index
during the bitonic sort. Therefore we have the following theorem.
s

Theorem 9.2
If N processors are j-dimensionally mesh-connected, then the bitonic sort
can be performed in time (XN /)), using the j - w a y shuffled row-major index
1

scheme.

By using the argument of Section 4, one can easily check that the bound in the
theorem is asymptotically optimal for large N.

Appendix.

The Odd-Even Transposition Sort

The odd-even transposition sort (Knuth [1973], p. 241) may be mapped onto our
2-dimensional arrays with snake-like row-major ordering following way. Given N
processors initially loaded with a data value, repeat N/2 times:
01.

"Expensive comparison-interchange" of processors #(2i+l) with processors

#(2i+2), 0<i<N/2-l. Time: 4 t + t if processor array has more than two columns
and more than one row; 0 if N=2; and 2tp + t^ otherwise.
R

02.

"Cheap comparison-interchange" of processors (2i) with processors (2i+l),

0<i<N/2-l. Time: 2 t + t .
R

If T ( j , k ) is the time required to sort jk elements in a jxk region of our

o e

processor by the odd-even transposition sort Into the snake-like row-major ordering,
then
0
, IF jk-1 ELSE
2t + t
, IF jk=2 ELSE
JW2t + t ) , IF j l OR k=2 ELSJE
J3t + t )
R

oe J )~
(

Step J2 of the 2-way odd-even merge (Section 4) cannot be performed by the

version of the odd-even transposition sort indicated above. Since N is even here (N
2j), step 02 may be placed before step 01 in the algorithm description above (see
Knuth [1973]). Now step 02 may be performed in the normal time of 2 t + t , even
starting from the non-standard initial configuration depicted in Section 4 as the result
of step J l .
R

References
Barnes, G. K, et. al [1968]. "The ILLIAC IV Computer," IEEE Trans, on Computers, C-17,
pp. 746-757.
Batcher, . E. [1968]. "Sorting Networks and Their Applications," P roc. AFIP S Spring
Joint Computer Conference. Vol. 32, pp. 307-314.
Baudet, G., and Stevenson, D. [1975]. "Optimal Sorting Algorithms for P arallel
Computers", Computer Science Department report, Carnegie-Mellon University,
May 1975.
Flynn, M. J . [1966].
1901-1909.

"Very High-Speed Computing Systems," P roc. IEEE. Vol.54, pp.

Knuth, D. E. [1973]. The Art of Computer P rogramming: Vol. 3. Sorting and Searching.
Reading, Mass.: Addison-Wesley.
Orcutt, S. E. [1974]. Computer Organization and Algorithms for Very-High Speed
Computations. P h.D. Thesis, Stanford University, September 1974, Chap. 2, pp.
20-23.
Stone, H. S. [1971]. "P arallel P rocessing with the P erfect Shuffle," JEEE Trans, on
Computers. Vol. C-20, pp. 153-161.

UNCLASSIFIED

T I T L E C"<l

SubUtl)

SORTING ON A MESH-CONNECTED P ARALLEL COMP UTER

P E R F O RM

"8.

CONTRACT

PERFORM ING ORGANIZATION

NAM E AND

ADDRESS

Carnegie-Mellon University
Computer Science Dept.
Pittsburgh, P A
15213
C O N T R O L L I N G O F F I C E NAM E AND

12.

ADDRESS

NUM

BEP

BERS)

ONITORING

4GENCV

PROGRAM
ELEM ENT. PROJECT
A R E A & WORK U N I T N U M B E R S

REPORT

TASK

DATE

March 1976

Office of Naval Research

Arlington, VA
22217
U

OR GRANT

N00014-76-C-0370,
NR 044-422
10.

EPOR

AUTHORS

C. D. Thompson & H. T. Kung

ING ORG

13.

NUM BER OF

PAGES

29
15.

N A M E 6 A O D R E S S f i / different

fro
m

Controlling

SECURITY

CLASS

'of

this

report)

Office)

UNCLASSIFIED
bm

16-

OlSTRIBU 'ON
T

S T A T E M E N T (of thl.

Approved

DECLASSIFICATION
SCHEDULE

DOWNGRADING

Report)

for public release; distribution unlimited.

D.S^.BUT.ON

STATEM ENT

SUPPLEM ENTARY

19.

K E Y W O R D S 'Continue

, . .b.l,.c,

. r W . r . d I n Stock

20. H U f *

*'>

NOTES

revere,

.id.

it necee.ery

end identify

by block

nu
m ber)

Lor sorting .? elements on an nxn mesh-connected processor array that require

(n) r e e l i n g and comparison steps. The best previous algorithm takes time
1
The algorithms of this paper are shown to be optimal in time within
s i a l l c o n ^ a n t factors. Extensions to higher-dimensional mesh-connected processo
arrays are also given

1 jA^n

1473

EDITION OF

UNCLASSIFIED

1 N O V 65 IS O B S O L E T E
SECURITY

CLASSIFICATION

THI S PAGE

(Whm n

Entered)

CVP Stoppage Instructions
No ratings yet
CVP Stoppage Instructions
3 pages
Case Study
33% (3)
Case Study
4 pages
Leather Industry in Sri Lanka
No ratings yet
Leather Industry in Sri Lanka
6 pages
A Cooperative Sort Algorithm Based On Indexing
No ratings yet
A Cooperative Sort Algorithm Based On Indexing
6 pages
Linear Array: Jyotika Jain
No ratings yet
Linear Array: Jyotika Jain
22 pages
Parallel Algorithm & Sorting in Parallel Programming: Submitted By:-Submitted To: - Dalpat Songra
No ratings yet
Parallel Algorithm & Sorting in Parallel Programming: Submitted By:-Submitted To: - Dalpat Songra
42 pages
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
No ratings yet
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
7 pages
Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
No ratings yet
Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
7 pages
hj_listrank (1)
No ratings yet
hj_listrank (1)
20 pages
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
No ratings yet
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
5 pages
Ca 3
No ratings yet
Ca 3
34 pages
Parallel Merge Sort
No ratings yet
Parallel Merge Sort
6 pages
Lectures on Parallel
No ratings yet
Lectures on Parallel
19 pages
Pquick
No ratings yet
Pquick
19 pages
3.parallel Processing - Algorithms
No ratings yet
3.parallel Processing - Algorithms
37 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
Parallel Algorithms Unit 3 By Dr. Choudhary Ravi Singh
No ratings yet
Parallel Algorithms Unit 3 By Dr. Choudhary Ravi Singh
21 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
412 pages
Information Processing Letters: Thorsten Ehlers
No ratings yet
Information Processing Letters: Thorsten Ehlers
4 pages
PDC
No ratings yet
PDC
14 pages
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
No ratings yet
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
14 pages
HPC2
No ratings yet
HPC2
22 pages
Quick Sort
No ratings yet
Quick Sort
5 pages
10 Sorting
No ratings yet
10 Sorting
20 pages
Lect8 Parallel System
No ratings yet
Lect8 Parallel System
43 pages
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
No ratings yet
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
16 pages
Unit 2 - 2.2 (Basic Algorithms)
No ratings yet
Unit 2 - 2.2 (Basic Algorithms)
8 pages
Nabil Mohsen Alzeqri
No ratings yet
Nabil Mohsen Alzeqri
7 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
55 pages
Scientific Writing Parallel Computing V2
No ratings yet
Scientific Writing Parallel Computing V2
15 pages
Q2.Nabil Mohsen Alzeqri
No ratings yet
Q2.Nabil Mohsen Alzeqri
7 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
Parallel Sorting Algorithms
No ratings yet
Parallel Sorting Algorithms
22 pages
04-Mergesort
No ratings yet
04-Mergesort
55 pages
A Comparison of Parallel Sorting Algorithms On Different Architectures
No ratings yet
A Comparison of Parallel Sorting Algorithms On Different Architectures
18 pages
n32 Parallel
No ratings yet
n32 Parallel
16 pages
Basic Merge of Two Sorted Lists
No ratings yet
Basic Merge of Two Sorted Lists
8 pages
merge
No ratings yet
merge
8 pages
Comparación de Los Algoritmos de Busqueda.
No ratings yet
Comparación de Los Algoritmos de Busqueda.
11 pages
12sorting
No ratings yet
12sorting
82 pages
1 Counting Sort
No ratings yet
1 Counting Sort
8 pages
An Introduction To Programming Though C++: Abhiram G. Ranade Ch. 16: Arrays and Recursion
No ratings yet
An Introduction To Programming Though C++: Abhiram G. Ranade Ch. 16: Arrays and Recursion
49 pages
Efficient Algorithms For Sorting and Synchronization (Andrew
No ratings yet
Efficient Algorithms For Sorting and Synchronization (Andrew
115 pages
Cours 3
No ratings yet
Cours 3
54 pages
FPGA Based Hardware Accelerator For Sorting Data
No ratings yet
FPGA Based Hardware Accelerator For Sorting Data
4 pages
Performance Analysis of Parallel Sorting Algorithms Using MPI
No ratings yet
Performance Analysis of Parallel Sorting Algorithms Using MPI
6 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Three Sorting Algorithms Using Priority Queues 1st edition by Amr Elmasry ISBN 3540206958Â 9783540206958 instant download
100% (4)
Three Sorting Algorithms Using Priority Queues 1st edition by Amr Elmasry ISBN 3540206958Â 9783540206958 instant download
50 pages
Lec 5 Dec, 7 Dec
No ratings yet
Lec 5 Dec, 7 Dec
45 pages
Parallel Quick Sort Algorithm
No ratings yet
Parallel Quick Sort Algorithm
8 pages
CSCE 3110 Data Structures & Algorithm Analysis: Rada Mihalcea Sorting (II) Reading: Chap.7, Weiss
No ratings yet
CSCE 3110 Data Structures & Algorithm Analysis: Rada Mihalcea Sorting (II) Reading: Chap.7, Weiss
26 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Loop-shaping Robust Control
From Everand
Loop-shaping Robust Control
Philippe Feyel
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Nonlinear Control Feedback Linearization Sliding Mode Control
From Everand
Nonlinear Control Feedback Linearization Sliding Mode Control
Mourad Boufadene
No ratings yet
MATLAB for Beginners: A Gentle Approach
From Everand
MATLAB for Beginners: A Gentle Approach
Peter I. Kattan
No ratings yet
Revised SBI Kiosk Banking Commission - Rest CSPs (2nd May)
57% (40)
Revised SBI Kiosk Banking Commission - Rest CSPs (2nd May)
2 pages
CPU Scheduler Simulation Report
100% (1)
CPU Scheduler Simulation Report
20 pages
Blue Brain Seminar Report
No ratings yet
Blue Brain Seminar Report
34 pages
A Synopsis On Mini Project: Sant Longowal Institute of Engineering and Technology
No ratings yet
A Synopsis On Mini Project: Sant Longowal Institute of Engineering and Technology
3 pages
Implementation SOCSTA CL
No ratings yet
Implementation SOCSTA CL
66 pages
Zara & Starbucks - Report - MGT330-Semester-Project-Final
No ratings yet
Zara & Starbucks - Report - MGT330-Semester-Project-Final
12 pages
ETI Microproject
No ratings yet
ETI Microproject
11 pages
ME364 Casting Processes
No ratings yet
ME364 Casting Processes
5 pages
Rahul
No ratings yet
Rahul
2 pages
Waterstops Testing
No ratings yet
Waterstops Testing
1 page
Microwind
No ratings yet
Microwind
13 pages
GEI004 LEC 8 WWW & Web Browsing
No ratings yet
GEI004 LEC 8 WWW & Web Browsing
32 pages
EMELEC - Inductive Power Harvesting Power Supply - V0
No ratings yet
EMELEC - Inductive Power Harvesting Power Supply - V0
2 pages
Operations Research
No ratings yet
Operations Research
58 pages
1A-INTRODUCTION TO TRANSPORT ECONOMICS NOTES
No ratings yet
1A-INTRODUCTION TO TRANSPORT ECONOMICS NOTES
6 pages
Nuclear Power Plants Thesis
100% (3)
Nuclear Power Plants Thesis
8 pages
PLS-CADD - Version 13.2 © Power Line Systems, Inc. 2014 17
No ratings yet
PLS-CADD - Version 13.2 © Power Line Systems, Inc. 2014 17
1 page
Office of Public Affairs _ Attorney General Pamela Bondi Intervenes in Lawsuit Against Illinois for Unlawfully Requiring Nonprofits to Publicly Post Race Based Data _ United States Department of Justice
No ratings yet
Office of Public Affairs _ Attorney General Pamela Bondi Intervenes in Lawsuit Against Illinois for Unlawfully Requiring Nonprofits to Publicly Post Race Based Data _ United States Department of Justice
3 pages
UM 043 S 2023 CmLRk17J7C7bLhQtRzfN PDF
No ratings yet
UM 043 S 2023 CmLRk17J7C7bLhQtRzfN PDF
6 pages
Forecasting PDF
No ratings yet
Forecasting PDF
69 pages
Zomato Cleansed
No ratings yet
Zomato Cleansed
4,464 pages
Parent Consent
No ratings yet
Parent Consent
1 page
Arafat Vai - Performance Appraisal Form.16.9.2023
No ratings yet
Arafat Vai - Performance Appraisal Form.16.9.2023
7 pages
Module 4: Regression Shrinkage Methods
No ratings yet
Module 4: Regression Shrinkage Methods
5 pages
W5500 Ethernet Shield - WIZnet Co., LTD
No ratings yet
W5500 Ethernet Shield - WIZnet Co., LTD
2 pages
Grey Minimalist Business Project Presentation
No ratings yet
Grey Minimalist Business Project Presentation
19 pages
Windows Phone
No ratings yet
Windows Phone
29 pages
The Industrial Revolution Project
No ratings yet
The Industrial Revolution Project
8 pages
AIS-CH-3
No ratings yet
AIS-CH-3
22 pages
BIMCO Container Shipping Market Overview Outlook September 2024 Highlights
No ratings yet
BIMCO Container Shipping Market Overview Outlook September 2024 Highlights
1 page
Unit 2 - Manpower & Competency Profiling
No ratings yet
Unit 2 - Manpower & Competency Profiling
11 pages
Event Management
No ratings yet
Event Management
3 pages
BPP Publishing ACCA 2017 Studying Materials: Study Text Practice & Revision Kit
100% (1)
BPP Publishing ACCA 2017 Studying Materials: Study Text Practice & Revision Kit
4 pages

Sorting On A Mesh-Connected Parallel Computer

Uploaded by

Sorting On A Mesh-Connected Parallel Computer

Uploaded by

Carnegie Mellon University

Research Showcase @ CMU

School of Computer Science

Sorting on a mesh-connected parallel computer

Follow this and additional works at: http://repository.cmu.edu/compsci

NOTICE WARNING CONCERNING C O P Y R I G H T RESTRICTIONS:

SORTING ON A MESH-CONNECTED PARALLEL COMPUTER

C. D. Thompson and H. T. Kung

Department of Computer Science

Two algorithms are presented for sorting n elements on an nxn mesh-connected

\Q time required for one comparison step.

3, The Sorting Problem

Three ways of indexing the processors will be considered in this paper,

(i) Row-Maior Indexing: After sorting we have

(ii> Shuffled Row-Maior Inri^vinp,; After sorting we have

It is easy to argue that even for this simple transposition we need

5. The 2~Way Odd-Even Merge

We first illustrate how the odd-even

method can be. performed efficiently on

Unshuffle: OddHndexed elements to left, evens to right.

L4. Comparison-interchange (the C's indicate comparison-interchanges).

J3. Interchange on even rows. Time: 2tp

One step of comparison-interchange (every "even with the next "odd").

The following diagram illustrates the algorithm M(|,2) for

M2. Unshuffle each row. Time: <k-2)t

M3. Merge b y calling M(j,k/2) on each half. Time: T(j,k/2)

M4. Shuffle each row. Time: ( k - 2 ) t

'. Interchange on even rows. Time: 2 t

T(j,k) be the time needed by M(j,k). Then we have

These imply that

T(j,k) < (2j + 4k + 4log k ) t + (j + log k)t .

(All logarithms in this paper are taken to base 2.)

6. The 2s-Way Merge

Perform the first 2 s - l parallel comparison-interchange steps of the odd-even

and that for k > 2,

These imply that

T'(j,M) - (2j + 4k + (6s)log k + Qfs+log k))t

Suddenly, we have an algorithm that sorts in linear time. In the following

The s^-Way Merge

The s ^ - w a y merge M"(j>k,s) to be introduced in this section is a generalization of

Single interchange step on even rows if

Perform the first s - l parallel comparison-interchange steps of the odd-even

M"(j>s,s), j$, can be performed in the following way;

If T"(j,k,s) is the time taken by M*'(j,k s.),'we have for ks

T"(j,s,s) = (2j + 0((s + j/s)log s ) ) t +(] + 0((s + j/s)lo s ) ) t

and for k > s,

+ (j + s log(k/s) + 0(($ + j/sjlog s + log k))t .

This leads immediately to the following result.

If t < 2 t , Theorem 7.1 implies that (6n+2n+0(n / log n ) ) t is sufficient time

The Bitonic Merge

time to sort n elements on n processors (Orcutt [1974]). However, the row-major

Geometrically, "shuffling" the X and T vectors ensures that all arrays

Initial data configuration:

1. Merge pairs of adjacent l x l

matrices by the comparison-interchange

Stage 4. Merge the two 2x4 matrices. Time: 12t + 4 t .

- S"'<2 <J )) + T ( 2 ' ) + T ( 2 H > .

In our model, we have 2 i * N n

processors, leading to the following

9. Extensions and Implications

By Theorem 7.1 or 8.1, th elements may be sorted into snake-lik row-major

where h - min(k,m), by using the s - w a y merge sort with s*0(h

(iv) Higher-dimensional array interconnection patterns, i.e N n ' processors each

log N)tp + <l/2)(log N + log N ) t ,

where the 2* term is the cost of a comparison-interchange on the ( H ) * bit of any

The Odd-Even Transposition Sort

"Expensive comparison-interchange" of processors #(2i+l) with processors

"Cheap comparison-interchange" of processors *(2i) with processors *(2i+l),

If T ( j , k ) is the time required to sort jk elements in a jxk region of our

Step J2 of the 2-way odd-even merge (Section 4) cannot be performed by the

"Very High-Speed Computing Systems," P roc. IEEE. Vol.54, pp.

SORTING ON A MESH-CONNECTED P ARALLEL COMP UTER

PERFORM ING ORGANIZATION

Office of Naval Research

C. D. Thompson & H. T. Kung

for public release; distribution unlimited.

"Cheap comparison-interchange" of processors (2i) with processors (2i+l),