A L D I S HW/SW C - D: Shun-Wen Cheng
A L D I S HW/SW C - D: Shun-Wen Cheng
Shun-Wen Cheng
Tamkang University Taipei, TAIWAN E-mail: [email protected] Abstract The coming of multimedia era and information
security era indicates that must process longer digit integer data. Previous sort researches focus on pure performance of large amount of finite fixed digit/bit number. This paper discusses on effectively solving arbitrary long digit integer sorting problem by HW/SW co-design under the AreaTime2 (AT2) price-performance constraint. The work proposes multi-level (two-level) sort architecture to attain the object: an accomplished fixed-digit (k-bit) hardware sorter implements the first or basic level sorting, software programmed radix 2k sort implements the second or higher level sorting. By Super Radix Sorting HW/SW co-design and reuse techniques, the work makes fixed-digit HW sorters more flexible and useful.
a b
m a x ( a , b) min(a, b)
a b
a
b b
a b
a b
a
b b
b a
a b
min(a, b) m a x ( a , b)
a b
a
a b
a b
a
b a
A1
A0
A<B A=B
B1 A<B A0 B0 A>B A=B
B0
A>B
1. INTRODUCTION
Sorting is one of the most important problems in computer science. Many fundamental processes in computing and communication systems require sorting of data. Sorting network play a key role in the areas of parallel computing, multi-access memories and multiprocessing [3], [4], [5], [6], [11], [13], [14], [19]. Compare and swap elements of data are vital for sorting, as depicted in Fig. 1. But if someone needs to process very long digit integer sorting, then directly design a corresponding digit integer hardware sorter, the comparators and networks will become very huge. The circuit schematics of 1, 2, 4, and 16 bit magnitude comparators are depicted in Fig. 2. And about bus, if it is designed for 32-digit integer, every bus represents 32-bit line. And if it is designed for 64-digit integer, every bus represents 64-bit line. That means it needs double wire structures and areas. More importantly, circuit cost/complexity of a (2k)bit comparator are not only twice than a k-bit comparator, as shown in Table 1. Also, the ability of CMOS circuit fan-out is limited; it still needs to add some additional buffers in the comparator circuits.
A3 B3 A2 B2 A1 B1 A0 B0 A=B
A<B A>B A<B A<B A>B A<B A>B A<B A>B A=B A>B
B 4-7 A 4-7
A>B
B 0-3 A 0-3
Table 1. Circuit cost/complexity of a long bit/digit comparator are more higher than a short bit/digit one.
m number single sorter chip design 16-bit Enumeration Sorter [23] (1982) 16-bit VLSI Sorter [16] (1983) 16-bit Rebound Sorter [8] (1978 [8], 1989 [2])
Function blocks of one-sorter cell Two 16-bit data registers One 16-bit comparator One 8-bit counter Two 16-bit data registers One 16-bit comparator Two 16-bit 2-way multiplexers Two 8-bit data registers One 8-bit comparator Two 8-bit 2-way multiplexers One 1-bit comparator One 16-bit shift register Two 1-bit 2-way multiplexers Two 1-bit delay elements
Cost of each functional block (CMOS transistor gate count) 256P + 256N 399P + 399N 136P + 136N 256P + 256N 399P + 399N 64P + 64N 128P + 128N 195P + 195N 32P + 32N 12P + 12N 452P + 452N 4P + 4N 2P + 2N
Number of cells
2N
m/2
4N
2N
N+1
In Table 2, some sorter chip designs had shown hardware expandable properties [1], [2], [8]. But they are not good enough for arbitrary long digit integer sorter design. The time performance of a fixed-digit (k-bit) hardware sorter is often better than a same digit software sort program, as displayed in Table 3. But a pure hardware sorter still has higher area cost and some restrictions, so it is not popular yet on common commercial CPUs. Base on the physical considerations, the author focuses on effectively solving arbitrary long digit integer sort problem by HW/SW co-design under Area-Time2 (AT2) cost-performance trade-off constraint [20], [21]. Several AT2-optimal sorting networks under different word length models have been proposed in [7], [9], [15], and [17]. For embedded systems, a uniprocessor software solution is often not applicable due to the insufficient I/O and performance, while realizing multiprocessor sorting methods on parallel computers is much too expensive with respect to area cost and power consumption. When the trends of data processing migrate from 32bit to 64-bit, 128-bit or uncertainly higher, a fixed-digit pure HW sorter cannot content demands alone. All of the sorting algorithms or circuits in this paper are based on commonly known algorithms and structures. But make an accomplished hardware sorter reusable [12], make a pure HW sorter more flexible and balance its cost-performance, are very valuable and necessary. This paper is organized as follows. Section 2 briefly introduces the basic LSD radix sort algorithm. Then a cost-benefit balanced multi-level (two-level) HW/SW mixed sort architecture is given and discussed in Section 3. Finally conclude the major findings and outline the future work.
Design Uniprocessor Heapsort (1 + log 2 N) processor Mergesort (log 2 N) 2 processor Bitonic Sort Nprocessor Bitonic Sort on Mesh Nprocessor Bitonic Sort on Shuffle-Exchange Net [19] N (log 2 N)2 Comparator Bitonic Sort N2 comparators Bubble Sort
Area Perf. (A) log N (log 2 N)2 (log 2 N)3 N (log 2 N)2 N2 / (log 2 N)2 N2 / log 2 N N log 2 N
Time Perf. (Td) N (log 2 N)2 N log 2 N N sqrt(N) (log 2 N)3 (log 2 N)2 N
Table 3. Area-Time Bounds for the finite and fixed bit/digit number sorting problem [21].
II. STRAIGHT RADIX SORT ALGORITHM This approach begins with the least significant key first, and is known as LSD (Least Significant Digit) sort. Following the sort on a key, the piles are put together to obtain a single pile that is then sorted on the next significant key. This process is continued until the pile is sorted on the most significant key [13]. And the sorted sequence is obtained. Complexity As shown in Fig. 3, it takes n steps to put all the elements in queue AUX, and d steps to initialize the queues Q[i]. The main loop of the algorithm, which is executed m times, pops each element from AUX and pushes it into one of the Q[i]s. It also concatenates all the Q[i]s together. So the overall running time of the algorithm is O(m n). But if m is limited or small, it can be ignored. So the time complexity of the algorithm is O(n), this is a common condition under a common CPU.
Algorithm Straight_Radix_Sort (A[ ], n, k) (* Input: A[ ](an array of integer, each with k digits, in the rage 1 to n). Output: A[ ] ( the array in sorted order). *) begin Assume that all elements are initially in a auxiliary queue AUX; (* The use of AUX is for simplicity; it can be implemented by Array A *) for i:= 1 to d do (* d is the possible digits; d = 10 in case of decimal numbers *) Initialize queue Q[i] to be empty; for i:= k downto d do while AUX is not empty do Pop x from AUX; d := the i-th digit of x; Insert x into Q[d]; for j:= 1 to d do Insert Q[j] into AUX; for i:= 1 to n do Pop A[i] from AUX; end.
Figure 3. Basic straight radix sort algorithm and a radix-10 sorting example.
A Radix -10 Sorting Example: 232, 321, 213, 231, 111, 112, 132, 123, 221 1S 321, 231, 111, 221 2S 232, 112, 132 3S 213, 123 321, 231, 111, 221, 232, 112, 132, 213, 123 10S 111, 112, 213 20S 321, 221, 123 30S 231, 232, 132 111, 112, 213, 321, 221, 123, 231, 232, 132 100S 111, 112, 123, 132 200S 213, 221, 231, 232 300S 321 Result: 111, 112, 123, 132, 213, 221, 231, 232, 321
multi-level mixed sort architecture can be considered. Of course, if the input sequence is also arbitrary long, some special design have provided solutions [24]. Or the sequence is separated into several pieces, and then merges them to get the total result after sorting. Because the bit length of numbers is very long, compare two numbers than directly swap them is very ineffective [18]. An indirect method -- only record swapped indices and hold them in cache is a good idea. If the system only has an common CPU and the bits of the longest number is m, and the sort algorithm is radix sort, the average overall running time of the proposed method is m O(N). But if the system has an accomplished fixed-digit (k-bit) hardware sorter on the system and the bits of the longest number is m, the overall running time of the proposed method becomes m / k Td. If the HW sorter is N (log 2 N)2 Comparator Bitonic Sorter, the overall running time is m / k O( (log 2 N)2 ). Some comparisons are shown in Table 4. The proposed HW/SW mixed super radix sorting architecture can process and change HW/SW partitioning ratio easily, as displayed in Fig. 6, to get a cost-benefit balanced flexible HW/SW mixed design. And the accomplished fixed-digit (k-bit) hardware sorter can choose any your favor or your own design.
DATA SEGMENT NUM DB ............. ............... DATA ENDS .......................... CODE SEGMENT .......................... SORT_START: MOV ESI, OFFSET NUM ; source address MOV EBX, 88D ; digit of number = 88 MOV EDX, 9D ; there are 9 numbers to be sorted SORT ......................... CODE ENDS
load data High Speed System Bus control C P U [88 /32 3] (CX=3) control TAG Cache load data k-bit (32-bit) HW Sorter
Figure 5. Why does not instruction SORT appear in instruction sets of nowadays?
Increasing Software Sorting Processing (Super Radix Sort)
SW
SW
SW HW
SW HW HW
HW Increasing Hardware Sorter Common CPU 16-bit HW 32-bit HW 64-bit HW Pure ASIC
Slow
Fast
(c) Sort step 3 and complete the work. Figure 4. Two-level HW/SW mixed sort architecture.
Figure 7 demonstrates a super radix sort: 88-digit integer SW LSD radix-4,294,967,296 (radix-232 ) sort with 32-bit HW sorter mixed sorting, it needs 3 steps. And it is processed by 88-digit integer SW LSD Radix-65,536 (Radix-216 ) sort with 16-bit HW sorter mixed sorting, it will needs 6 steps. If the hardware sorter can be easily decomposed to several stages then pipeline, the hardware sorter can get more higher hardware sharing and throughputs, as Fig. 8 depicts.
Tag[X] num[X][2] num[X][1] num[X][0] Origin: [1] 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 [2] 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 [3] 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 [4] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 [5] 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 [6] 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 [7] 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 [8] 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 [9] 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 Step 1: [4] [5] [2] [6] [9] [7] [3] [1] [8] Step 2: [3] [5] [6] [8] [9] [2] [7] [1] [4] Step 3: [5] [6] [8] [7] [3] [9] [4] [1] [2] 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 00000000 0000000000000000 0000000000000000 0000010100001000 0000000000000000 0000001100110011 00000000 0000000000000000 0000000000000001 0000100010000000 0000000000000100 0010100000010000 00000000 0000000000000000 0000000100100000 0000001000010100 0010001000101000 0000001010000100 00000000 0000000000000001 0001010000001000 0000000100101000 0000000101000000 0000000000111000 00000000 0000010111110000 0000000000000000 0000000010101010 0000010101000000 0101100000001100 00000000 0001000010101000 0000100001110000 0000100000011000 0000000000001111 0000000001100000 00000011 0111100000000100 1100000011110000 0001110000010010 0000000000000000 0000000000000010 00000110 0100000010010000 1010000000010100 0010101000000000 0001000000101000 0001010101010000 10101000 0010101010101000 0000101010001010 0100000010101011 0000000000000010 0000000000000000
Step 1: Input Sequence: A[1][0], A[2][0], A[3][0], A[4][0], A[5][0], A[6][0], A[7][0], A[8][0], A[9][0] After HW Sorting: A[4][0], A[5][0], A[2][0], A[6][0], A[9][0], A[7][0], A[3][0], A[1][0], A[8][0] ONLY Record swapped index: 4, 5, 2, 6, 9, 7, 3, 1, 8. Step 2: Input Sequence: A[4][1], A[5][1], A[2][1], A[6][1], A[9][1], A[7][1], A[3][1], A[1][1], A[8][1] After HW Sorting: A[3][1], A[5][1], A[6][1], A[8][1], A[9][1], A[2][1], A[7][1], A[1][1], A[4][1] ONLY Record swapped index: 3, 5, 6, 8, 9, 2, 7, 1, 4. Step 3: Input Sequence: A[3][2], A[5][2], A[6][2], A[8][2], A[9][2], A[2][2], A[7][2], A[1][2], A[4][2] After HW Sorting: A[5][2], A[6][2], A[8][2], A[7][2], A[3][2], A[9][2], A[4][2], A[1][2], A[2][2] ONLY Record swapped index: 5, 6, 8, 7, 3, 9, 4, 1, 2. The final index is the answer. * Swap the original whole number in these sorting steps is unnecessary. Figure 7. Super Radix Sort: 88-digit integer SW LSD radix- 4,294,967,296 (radix- 232 ) sort with 32-bit HW sorter mixed sorting.
Pure Common CPU Pure Common CPU / Radix Sort / Quick Sort [11] m O(N) m O(N log N)
CPU & One 256-bit HW Sorter* / Super Radix Sort m / 256 Td.
* If the HW sorter is an N (log 2 N)2 comparator bitonic processor, the order of Td is O((log 2 N)2).
Table 4. The performance order comparison between original architectures and new mixed architectures.
Level-1 stage 1 Level-2 stage 1 stage 2 Level 3 sub-sorter stage 1 stage 2 stage 3
[3] [4]
[5]
[6] [7]
[8] stage 1 stage 2 stage 3 stage 4 6-stage pipeline stage 5 stage 6 [9] [10] [11] [12] [13] [14]
Figure 8. A three-level bitonic sorter. Pipeline this type circuit can get higher hardware sharing and throughputs.
Parameter Range Layout cost Hardware reusing Old design Fixed k-digit 1 0 New design 2
32
k - digit
[15] [16]
[17]
[18]
REFERENCES
[1] [2] M. Afghahi, A 512 16-b Bit-serial Sorter Chip, IEEE J. SolidState Circuits, vol. 26, pp. 14521457, Oct. 1991. B. Ahn and J. M. Murray, A Pipelined, Expandable VLSI Sorting Engine Implemented in CMOS Technology, in Proc. IEEE Intl. Symp. on Circuits and Systems, 1989, pp. 134137.
[24]
S. G. Akl, Parallel Sorting Algorithms. Reading, New York: Academic Press, 1985. K. E. Batcher, Sorting Networks and Their Applications, in Proc. AFIPS 1968 Spring Joint Computer Conference, pp. 307314, Apr. 1968. G. Baudet and D. Stevenson, Optimal Sorting Algorithms for Parallel Computer, IEEE Trans. Computers, vol. 27, pp.8487, Jan. 1978. R. Beigel and J.Gill, Sorting n Objects with a k-sorter, IEEE Trans. Computers, vol. 39, pp.714716, May 1990. G. Bilardi and F. P. Preparata, A Minimum Area VLSI Network for O(log n) Time Sorting, IEEE Trans. Computers, vol. 34, pp.336343, May 1985. T. C. Chen, Vincent Y. Lum, and C. Tung, The Rebound Sorter: An Efficient Sort Engine for Large File, in IEEE Proc. 4th Intl Conf. on Very Large Data Bases, pp. 312318, Sep. 1978. R. Cole and A. R. Seigel, Optimal VLSI Circuits for Sorting, JACM, vol. 35, pp.777-809, 1988. Edward. H. Friend, Sorting on Electronic Computer Systems, JACM, vol. 3, pp.134-168, 1956. C. A. R. Hoare, Quicksort, Computing Journal, vol. 5, pp. 10 15, 1962. M. Keating and P. Bricaud, Reuse Methodology Manual. Reading: Kluwer, 1998. D. E. Knuth, The Art of Computer Programming, Vol 3: Sorting and Searching. Reading: Addison-Wesley, 1973. J.-G. Lee and B.-G. Lee, Realization of Large-scale Distributors Based on Batcher Sorters, IEEE Trans. Communications, vol. 47, pp. 11031110, July 1999. T. Leighton, Tight Bounds on The Complexity of Parallel Sorting, IEEE Trans. Computers, vol. 34, pp. 344354, Apr. 1985. G. S. Miranker, Luong Tang, and Chak-Kuen Wong, A ZeroTime VLSI Sorter, IBM J. Research & Development, vol. 27, pp. 140148, Mar. 1983. S. Olariu, M. C. Pinotti, and S. Q. Zheng, How to Sort N Items Using a Sorting Network of Fixed I/O Size, IEEE Trans. Parallel and Distributed Sys., vol. 10, pp 487499, May 1999. B. Parhami and D.-M. Kwai, Data-driven Control Scheme for Linear Arrays: Application to a Stable Insertion Sorter, IEEE Trans. Parallel and Distributed Sys., vol. 10, pp 2328, Jan. 1999. H. S. Stone, Parallel Processing with the Perfect Shuffle, IEEE Trans. Computers, vol. 20, pp.153161, Feb. 1971. C. D. Thompson, Area-Time Complexity for VLSI, in Proc. 11th Annual ACM Symp. on Theory of Comp., pp. 8188, Apr. 1979. C. D. Thompson, The VLSI Complexity of Sorting, IEEE Trans. Computers, vol. 32, pp.11711184, Dec. 1983. N. H. E. Weste and K. Eshraghian, Principle of CMOS VLSI Design, 2nd Ed., Reading: AddisonWesley, 1993. H. Yasuura, N. Takagi, and S. Yajima, The Parallel Enumeration Sorting Scheme for VLSI, IEEE Trans. Computers, vol. 31, pp.11921201, Dec. 1982. S. Q. Zheng, S. Olariu, and M. C. Pinotti, A Systolic Architecture for Sorting an Arbitrary Number of Elements, in Proc. 1997 3rd Int. Conf. Algorithms and Architectures for Parallel Processing, pp. 113 -126, 1997.