A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA
A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA
Abstract—We present a hybrid pipelined sorting architecture nature of the algorithm [12]. To increase the throughput,
capable of finding and producing as its output the K largest Srivastava et al. [13] proposed a hybrid design based on merge
elements from an input sequence. The architecture consists of sort where the final few stages in the merge sort network are
a bitonic sorter and L cascaded sorting units. The sorting unit
replaced with folded bitonic merge networks. Merge sort can
is designed to output P elements during every cycle with the
aim of increasing the throughput and lowering the latency. The also combine two sorted sequences into a sorted sequence.
function of the bitonic sorter is to generate a segmented ordered Mashimo et al. [14] optimized the merge logic, which trans-
sequence. The sorting unit processes this sequence to identify and forms the merge network into a pipelined architecture and uses
output the P largest elements. Hence, the K=PL largest elements fewer feedback registers.
are obtained after the segmented ordered sequence proceeds The algorithms mentioned above all aim to obtain a fully
through L cascaded sorting units. Variable-length and continu- sorted sequence. However, in many applications, only the
ous sequences are supported by the proposed sorting architecture.
The results of the implementation show that the sorting archi-
K largest ones from the N input elements are of interest.
tecture can achieve a throughput of 22.88 GB/s with P=16 on a Farmahini-Farahani et al. [15] proposed a modular technique
state-of-the-art Field Programmable Gate Array (FPGA). based on the bitonic sorting algorithm to design units that
return only the M largest values in ascending or descend-
Index Terms—Field programmable gate array (FPGA), sorting
architecture, high throughput, low latency.
ing order. However, it has the same disadvantage as bitonic
sort namely that the area cost is huge in the case of a large-
sized problem. Apart from this, when the length of the input
I. I NTRODUCTION sequence changes, it is also necessary to adjust the sorting
ORTING is one of the most important computing tasks architecture. Matsumoto et al. [16] presented a FIFO-based
S in many applications such as database operations[1], sig-
nal processing, and statistical methodology. The arrival of
parallel merge sorter, which is resource efficient with great
performance in terms of latency. The disadvantage of this
Big Data has led researchers to believe that sorting oper- merge sorter is that it requires the input sequence to have
ations consume an excessive number of CPU cycles with a fixed length; moreover, the length should be a power of
a software implementation[2]. The acceleration of sorting two. A modular serial pipelined sorting architecture, which is
operations has therefore become increasingly urgent and able to accept continuous and variable-length sequences was
has been implemented on hardware platforms such as Field presented [17]. The simplicity of the sorting cell architecture
Programmable Gate Arrays (FPGAs) [3]–[5], GPU [6], [7] and the short critical path delay enabled a high-frequency
and ASIC [8]–[10]. Among these platforms, FPGA is attract- implementation to be achieved. However, the sorting cells
ing considerable attention in the field of hardware acceleration only accept one data item in every cycle, which limits the
owing to its high degree of parallelism, low power consump- throughput.
tion, and reconfiguration capability. The objective of this brief is to sort the K largest elements
Several sorting algorithms have been proposed to speed from an input sequence. We present a hybrid pipelined sort-
up sorting operations. Bitonic sort is a typical parallel algo- ing architecture which not only has all the advantages of the
rithm that offers great performance. However, the area cost existing architecture [17] but also increases the throughput and
can be extremely high when the problem size becomes large. lowers the latency. In general terms, the main contributions of
Chen et al. [11] proposed an energy and memory efficient this brief are as follows:
mapping methodology for implementing the bitonic sorting • Inspired by [14], a sorting unit with data parallelism
network. Merge sort can sort a list of N elements using P which can output P data every cycle, is designed.
a merge tree of depth logN; however, at the root of the Compared with previous sorting cells [17], the relative
tree no parallelism can be exploited because of the serial throughput1 increases to P times and the latency in clock
cycles decreases to (1/2 + 1/2P).
Manuscript received July 25, 2019; accepted August 23, 2019. Date of • A hybrid pipelined sorting architecture, which is com-
publication September 4, 2019; date of current version August 4, 2020. This posed of a bitonic sorter and L cascaded sorting units, is
brief was recommended by Associate Editor W. N. N. Hung. (Corresponding
author: Feng Yu.) constructed. The use of the data parallelism P can output
W. Chen and F. Yu are with the Department of Instrument Science PL largest elements from an input sequence in ascending
and Technology, Zhejiang University, Hangzhou 310027, China (e-mail: order.
[email protected]; [email protected]).
W. Li is with the Ningbo Institute of Technology, Zhejiang University,
Ningbo 315100, China (e-mail: [email protected]). 1 The relative throughput is defined as the number of elements sorted per
Digital Object Identifier 10.1109/TCSII.2019.2938892 cycle.
1549-7747
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1450 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: HYBRID PIPELINED ARCHITECTURE FOR HIGH PERFORMANCE TOP-K SORTING ON FPGA 1451
Fig. 3. Example of the behavior of the sorting unit with P = 4 when two sequences enter.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1452 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020
Fig. 5. Total latency of various top-K sorters based on different problem sizes.
TABLE I
C OMPARISON OF VARIOUS T OP -K S ORTERS W ITH L in [17], then
C ASCADED S ORTING U NITS
Tp = P + 1
T1 = 2
As our sorting unit can hold the P largest elements, which is
equivalent to P cascaded sorting cells of [17], the latency of
our sorting unit decreases to (1/2 + 1/2P). The total latency
of the top-K sorters proposed in this brief and in that of
Chen et al. [17] are listed in Table I and their difference is
= N − N/P − ((log2 P + 1) log2 P)/2 (K = PL)
III. A NALYSIS
Because N is the problem size, its value can be of the order
A. Increasing the Relative Throughput of thousands, millions, or even billions. Contrary to this, it is
Clearly, the relative throughput of the sorting architecture necessary to limit the size of P to avoid an increase in resource
increases P times compared with [17] as P elements can be consumption. In our implementation, the value of P is in the
input every cycle when processing a sequence. Actually, the set {2, 4, 8, 16}. Hence, 0, which means the total latency
ability to accept P elements per cycle is one of the two contrib- of our top-K sorter is sharply reduced.
utors to our relative throughput increase. The other is that the Assume that the data parallelism P increases to its upper
control signal CI and the corresponding muxes, of which limit N. In this case, the cascaded sorting units are discarded
the selection signals are CI, enable the sorting unit to out- and the sorting architecture becomes a bitonic sorting network.
put the results of the previous sequence and initialize the Rm Conversely, if P decreases to its lower limit, the bitonic
registers simultaneously. Therefore, sequences are input into sorter would be unnecessary and the sorting architecture would
the sorting architecture seamlessly. However, previously [14], almost be as common as that in [17]. The only difference is
additional elements with a complete bit set had to be added the order in which the input data are being registered and com-
to obtain the elements stored in the feedback registers (Rm pared. Therefore, the sorting architecture presented by [17] is
registers in this case). Hence, PL such elements are required merely one of the cases proposed in this brief when the data
to stream into the proposed sorting architecture to obtain the parallelism is 1.
results before any other sequences can be processed if CI
and the corresponding muxes are absent. Then, the relative IV. E XPERIMENTAL R ESULTS
throughput is computed as below (N is the length of the input We implemented our proposed sorting architecture on a
sequence) Xilinx Virtex-7 FPGA XC7VX485T FFG1157-2 using Xilinx
Relative throughput = PN/(N + PL) < P. Vivado 2017.4. The data parallelism is set to P = 2, 4, 8, 16
and the input elements are in 32-bit format. The correspond-
ing implementation results are listed in Table II. According to
B. Decreasing the Latency in Clock Cycles Table II, the throughput of the proposed sorting architecture
Table I compares various top-K sorters, among which [17] almost increases to P times compared with [17], [19]. The total
is the only one that supports continuous and variable- latency (in nanoseconds) consists of both fixed and incremental
length sequences and additionally provides the highest relative components and the latter almost decreases to 1/P compared
throughput apart from our proposed sorter. Next, we compare with [17], [19]. As the performance of our sorting architecture
our top-K sorter with that of Chen et al. [17] in terms of is improved significantly, the resource consumption increases.
latency in clock cycles. We define the latency of a single sort- However, our design is more resource efficient as each of our
ing unit as the clock cycles required from the time at which the top-K sorters has a larger throughput-to-resource ratio, which
first valid element is input to the time the first valid element is labeled as ratio1 in Table II (the resource is the sum of the
is output. The total latency is defined as the cycles required register and LUT, ‘-’ indicates that the corresponding archi-
from the time at which the first valid element is input until tecture cannot be implemented because of limited resources).
the time the first result is output. If Tp is the latency of the Power consumption is another essential metric in hardware
sorting unit in this brief, T1 is the latency of the sorting cell design. Our top-K sorters are also more energy efficient in
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: HYBRID PIPELINED ARCHITECTURE FOR HIGH PERFORMANCE TOP-K SORTING ON FPGA 1453
TABLE II
C OMPARISON OF THE I MPLEMENTATION R ESULTS OF VARIOUS T OP -K S ORTERS
terms of the throughput-to-power ratio, which is labeled as [6] D. Merrill and A. Grimshaw, “High performance and scalable radix
ratio2. sorting: A case study of implementing dynamic parallelism for
GPU computing,” Parallel Process. Lett., vol. 21, no. 2, pp. 245–272,
Fig. 5 shows the total latency of several kinds of top-K 2011.
sorters. As the results show, increasing the data parallelism [7] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, “Efficient parallel
has the effect of decreasing the total latency of the proposed merge sort for fixed and variable length keys,” in Proc. Innov. Parallel
top-K sorters sharply. For instance, the total latency of our Comput. (InPar), 2012, pp. 1–9.
[8] N. Tsuda, T. Satoh, and T. Kawada, “A piepline sorting chip,” in
top-128 sorter was reduced by 2.61s and 2.64s, respectively, IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, vol. 30, 1987,
compared with [17], [19] for P=16 when N is set to 1G. pp. 270–271.
[9] B. Y. Kong, H. Yoo, and I.-C. Park, “Efficient sorting architecture
for successive-cancellation-list decoding of polar codes,” IEEE Trans.
V. C ONCLUSION Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 673–677, Jul. 2016.
[10] G. Xiao, M. Martina, G. Masera, and G. Piccinini, “A parallel radix-
This brief presents the construction of a hybrid pipelined sort-based VLSI architecture for finding the first W maximum/minimum
sorting architecture, which consists of a bitonic sorter and L values,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11,
cascaded sorting units. It not only supports continuous and pp. 890–894, Nov. 2014.
variable-length sequences but also provides high throughput [11] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int. Symp.
and low latency. Our theoretical analysis indicated that the FPGAs, 2015, pp. 240–249.
sorting architecture presented in [17] is merely one of the cases [12] R. Chen and V. K. Prasanna, “Computer generation of high throughput
proposed in this brief, i.e., it corresponds to the case when the and memory efficient sorting designs on FPGA,” IEEE Trans. Parallel
Distrib. Syst., vol. 28, no. 11, pp. 3100–3113, Nov. 2017.
data parallelism is set to 1. The results of the implementation [13] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A hybrid
showed that the proposed sorting architecture is both resource design for high performance large-scale sorting on FPGA,” in Proc.
and energy efficient in terms of the throughput-to-resource IEEE Int. Conf. Reconfig. Comput. FPGAs, 2015, pp. 1–6.
ratio and the throughput-to-power ratio. [14] S. Mashimo, T. Van Chu, and K. Kise, “High-performance hardware
merge sorter,” in Proc. IEEE 25th Annu. Int. Symp. FCCM, 2017,
pp. 1–8.
R EFERENCES [15] A. Farmahini-Farahani, A. Gregerson, M. Schulte, and K. Compton,
“Modular high-throughput and low-latency sorting units for FPGAs in
[1] J. Casper and K. Olukotun, “Hardware acceleration of database oper- the large hadron collider,” in Proc. IEEE 9th SASP, 2011, pp. 38–45.
ations,” in Proc. ACM/SIGDA FPGA, Monterey, CA, USA, 2014, [16] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal parallel hardware
pp. 151–160. K-sorter and top K-sorter, with FPGA implementations,” in Proc. IEEE
[2] J. Chhugani et al., “Efficient implementation of sorting on multi- 14th Int. Symp. Parallel Distrib. Comput., 2015, pp. 138–147.
core SIMD CPU architecture,” Proc. VLDB Endow., vol. 1, no. 2, [17] T. Chen, W. Li, F. Yu, and Q. Xing, “Modular serial pipelined sort-
pp. 1313–1324, 2008. ing architecture for continuous variable-length sequences with a very
[3] R. Marcelino, H. C. Neto, and J. M. P. Cardoso, “Unbalanced FIFO simple control strategy,” IEICE Trans. Fund. Elect., vol. 100, no. 4,
sorting for FPGA-based systems,” in Proc. 16th IEEE ICECS, 2009, pp. 1074–1078, 2017.
pp. 431–434. [18] S. Dong, X. Wang, and X. Wang, “A novel high-speed parallel scheme
[4] D. Koch and J. Torresen, “FPGAsort: A high performance sorting archi- for data sorting algorithm based on FPGA,” in Proc. IEEE 2nd Int.
tecture exploiting run-time reconfiguration on FPGAs for large problem Congr. Image Signal Process., 2009, pp. 1–4.
sorting,” in Proc. 19th ACM/SIGDA Int. Symp. FPGAs, 2011, pp. 45–54. [19] C.-S. Lin and B.-D. Liu, “Design of a pipelined and expandable sort-
[5] R. Mueller, J. Teubner, and G. Alonso, “Sorting networks on FPGAs,” ing architecture with simple control scheme,” in Proc. IEEE Int. Symp.
VLDB J. Int. Very Large Data Bases, vol. 21, no. 1, pp. 1–23, 2012. Circuits Syst., vol. 4, 2002, pp. 217–220.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.