Parallel Sorting Algorithms
Parallel Sorting Algorithms
1
Potential Speedup
O( n log n)
optimal parallel time complexity O(log n)
n
2
Compare-and-Exchange
Sorting Algorithms
Form the basis of several, if not most, classical sequential sorting
algorithms.
P0 P1
A B
MA
MIN
X
3
Compare-and-Exchange Two Sublists
4
Odd-Even Transposition Sort - example
Each PE gets n/p numbers. First, PEs sort n/p locally, then they run
odd-even trans. algorithm each time doing a merge-split for 2n/p numbers.
P0 P1 P2 P3
13 7 12 8 5 4 6 1 3 9 2 10
Local sort
7 12 13 4 5 8 1 3 6 2 9 10
O-E
4 5 7 8 12 13 1 2 3 6 9 10
E-O
4 5 7 1 2 3 8 12 13 6 9 10
O-E
1 2 3 4 5 7 6 8 9 10 12 13
E-O
SORTED: 1 2 3 4 5 6 7 8 9 10 12 13
7
Mergesort - Time complexity
Sequential :
n n n
Tseq 1* n 2 * 2 * 2 2 * log n
2 log n
2 2 2
Tseq O( n log n)
Parallel :
n n n n
T par 2 0 1 2 k 2 1
2 2 2 2
2n 20 2 1 2 2 2 log n
T par O( 4n)
8
Bitonic Mergesort
Bitonic Sequence
A bitonic sequence is defined as a list with no more than one
LOCAL MAXIMUM and no more than one LOCAL MINIMUM.
(Endpoints must be considered - wraparound )
9
A bitonic sequence is a list with no more than one LOCAL
MAXIMUM and no more than one LOCAL MINIMUM.
(Endpoints must be considered - wraparound )
This is ok!
1 Local MAX; 1 Local MIN
The list is bitonic!
10
Binary Split
1. Divide the bitonic list into two equal halves.
2. Compare-Exchange each item on the first half
with the corresponding item in the second half.
Result:
Two bitonic sequences where the numbers in one sequence are all less
than the numbers in the other sequence. 11
Repeated application of binary split
Bitonic list:
24 20 15 9 4 2 5 8 | 10 11 12 13 22 30 32 45
10 11 12 9 . 4 2 5 8 | 24 20 15 13 . 22 30 32
45
4 2 . 5 8 10 11 . 12 9 | 22 20 . 15 13 24 30 . 32
45
4 . 2 5 . 8 10 . 9 12 .11 15 . 13 22 . 20 24 . 30 32 .
45
Sorting a bitonic sequence
Compare-and-exchange moves smaller numbers of each pair to left
and larger numbers of pair to right.
Given a bitonic sequence,
recursively performing binary split will sort the list.
13
Sorting an arbitrary sequence
In the final step, a single bitonic sequence sorted into a single increasing
sequence.
14
Bitonic Sort
Step No. Processor No.
1 L H H L L H H L
2 L L H H H H L L
3 L H L H H L H L
4 L L L L H H H H
5 L L H H L L H H
6 L H L H L H L H
K G J M C A N F
Lo Hi Hi Lo Lo Hi
High Low
G K M J A C N F
L L H H H H
L L
G J M K N F A C
L H L H H L
H L
G J K M N F C A
L L L L H H
H H
G F C A N J K M
L L H H L L
H H
C A G F K J N M
16
A C F G J K M N
Number of steps (P=n)
i log n
log n(log n 1)
T bitonic
par i O(log n)
2
i 1 2
17
Bitonic sort (for N >> P)
x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x x x x
18
P0 P1 P2 P3 P4 P5 P6 P7
000 001 010 011 100 101 110 111
Bitonic sort (for N >> P)
2 7 4 13 6 9 4 18 5 12 1 7 6 3 14 11 6 8 4 10 5 2 15
17
L H H L L H High Low
2 4 6 7 9 13 7 12 18 1 4 5 3 6 6 8 11 14 10 15 17 2 4 5
L L H H H H L L
2 4 6 1 4 5 7 12 18 7 9 13 10 15 17 8 11 14 3 6 6 2 4
5
L H L H H L H L
1 2 4 4 5 6 7 7 9 12 13 18 14 15 17 8 10 11 5 6 6 2 3
4
L L L L H H H H
1 2 4 4 5 6 5 6 6 2 3 4 14 15 17 8 10 11 7 7 9 12 13
18
L L H H L L H H
1 2 4 2 3 4 5 6 6 4 5 6 7 7 9 8 10 11 14 15 17 12 13
18
Number of steps (for N >> P)
bitonic
T par Local Sort Parallel Bitonic Merge
N N N
log 2 (1 2 3 ... log P)
P P P
N N log P (1 log P )
{log 2( )}
P P 2
N
(log N log P log P log P)2
P
N
T bitonic
par (log N log P )
2
P 20
Parallel sorting - summary
21
Sorting on Specific Networks
23
Shearsort
Alternate row and column sorting until list is fully sorted.
Alternate row directions to get snake-like sorting:
24
Shearsort Time complexity
Tseq
Speedupshearsort O ( n) (for P n 2 )
T par
1
However, efficiency
n
25
Rank Sort
First a[0] is read and compared with each of the other numbers,
a[1] a[n-1], recording the number of elements less than a[0].
The number a[0] is copied into the final sorted list b[0] b[n-1],
at location b[x]. Actions repeated with the other numbers.
27
Parallel Rank Sort (P=n)
One number is assigned to each processor.
Pi finds the final index of a[i] in O(n) steps.
28
Parallel Rank Sort with P = n2
Use n processors to find the rank of one element. The final count,
i.e. rank of a[i] can be obtained using a binary addition operation
(global sum MPI_Reduce())
Time complexity
(for P=n2):
Tpar = O(log n)
Can we do it in O(1) ?
29