0% found this document useful (0 votes)
12 views

A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

This document presents a hybrid pipelined sorting architecture designed for high-performance top-K sorting on FPGA, which can efficiently find and output the K largest elements from an input sequence. The architecture utilizes a bitonic sorter and multiple cascaded sorting units to enhance throughput and reduce latency, achieving a throughput of 22.88 GB/s with P=16. The proposed design supports variable-length sequences and aims to optimize sorting operations in various applications, particularly in the context of Big Data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

This document presents a hybrid pipelined sorting architecture designed for high-performance top-K sorting on FPGA, which can efficiently find and output the K largest elements from an input sequence. The architecture utilizes a bitonic sorter and multiple cascaded sorting units to enhance throughput and reduce latency, achieving a throughput of 22.88 GB/s with P=16. The proposed design supports variable-length sequences and aims to optimize sorting operations in various applications, particularly in the context of Big Data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO.

8, AUGUST 2020 1449

A Hybrid Pipelined Architecture for High


Performance Top-K Sorting on FPGA
Weijie Chen , Weijun Li , and Feng Yu , Member, IEEE

Abstract—We present a hybrid pipelined sorting architecture nature of the algorithm [12]. To increase the throughput,
capable of finding and producing as its output the K largest Srivastava et al. [13] proposed a hybrid design based on merge
elements from an input sequence. The architecture consists of sort where the final few stages in the merge sort network are
a bitonic sorter and L cascaded sorting units. The sorting unit
replaced with folded bitonic merge networks. Merge sort can
is designed to output P elements during every cycle with the
aim of increasing the throughput and lowering the latency. The also combine two sorted sequences into a sorted sequence.
function of the bitonic sorter is to generate a segmented ordered Mashimo et al. [14] optimized the merge logic, which trans-
sequence. The sorting unit processes this sequence to identify and forms the merge network into a pipelined architecture and uses
output the P largest elements. Hence, the K=PL largest elements fewer feedback registers.
are obtained after the segmented ordered sequence proceeds The algorithms mentioned above all aim to obtain a fully
through L cascaded sorting units. Variable-length and continu- sorted sequence. However, in many applications, only the
ous sequences are supported by the proposed sorting architecture.
The results of the implementation show that the sorting archi-
K largest ones from the N input elements are of interest.
tecture can achieve a throughput of 22.88 GB/s with P=16 on a Farmahini-Farahani et al. [15] proposed a modular technique
state-of-the-art Field Programmable Gate Array (FPGA). based on the bitonic sorting algorithm to design units that
return only the M largest values in ascending or descend-
Index Terms—Field programmable gate array (FPGA), sorting
architecture, high throughput, low latency.
ing order. However, it has the same disadvantage as bitonic
sort namely that the area cost is huge in the case of a large-
sized problem. Apart from this, when the length of the input
I. I NTRODUCTION sequence changes, it is also necessary to adjust the sorting
ORTING is one of the most important computing tasks architecture. Matsumoto et al. [16] presented a FIFO-based
S in many applications such as database operations[1], sig-
nal processing, and statistical methodology. The arrival of
parallel merge sorter, which is resource efficient with great
performance in terms of latency. The disadvantage of this
Big Data has led researchers to believe that sorting oper- merge sorter is that it requires the input sequence to have
ations consume an excessive number of CPU cycles with a fixed length; moreover, the length should be a power of
a software implementation[2]. The acceleration of sorting two. A modular serial pipelined sorting architecture, which is
operations has therefore become increasingly urgent and able to accept continuous and variable-length sequences was
has been implemented on hardware platforms such as Field presented [17]. The simplicity of the sorting cell architecture
Programmable Gate Arrays (FPGAs) [3]–[5], GPU [6], [7] and the short critical path delay enabled a high-frequency
and ASIC [8]–[10]. Among these platforms, FPGA is attract- implementation to be achieved. However, the sorting cells
ing considerable attention in the field of hardware acceleration only accept one data item in every cycle, which limits the
owing to its high degree of parallelism, low power consump- throughput.
tion, and reconfiguration capability. The objective of this brief is to sort the K largest elements
Several sorting algorithms have been proposed to speed from an input sequence. We present a hybrid pipelined sort-
up sorting operations. Bitonic sort is a typical parallel algo- ing architecture which not only has all the advantages of the
rithm that offers great performance. However, the area cost existing architecture [17] but also increases the throughput and
can be extremely high when the problem size becomes large. lowers the latency. In general terms, the main contributions of
Chen et al. [11] proposed an energy and memory efficient this brief are as follows:
mapping methodology for implementing the bitonic sorting • Inspired by [14], a sorting unit with data parallelism
network. Merge sort can sort a list of N elements using P which can output P data every cycle, is designed.
a merge tree of depth logN; however, at the root of the Compared with previous sorting cells [17], the relative
tree no parallelism can be exploited because of the serial throughput1 increases to P times and the latency in clock
cycles decreases to (1/2 + 1/2P).
Manuscript received July 25, 2019; accepted August 23, 2019. Date of • A hybrid pipelined sorting architecture, which is com-
publication September 4, 2019; date of current version August 4, 2020. This posed of a bitonic sorter and L cascaded sorting units, is
brief was recommended by Associate Editor W. N. N. Hung. (Corresponding
author: Feng Yu.) constructed. The use of the data parallelism P can output
W. Chen and F. Yu are with the Department of Instrument Science PL largest elements from an input sequence in ascending
and Technology, Zhejiang University, Hangzhou 310027, China (e-mail: order.
[email protected]; [email protected]).
W. Li is with the Ningbo Institute of Technology, Zhejiang University,
Ningbo 315100, China (e-mail: [email protected]). 1 The relative throughput is defined as the number of elements sorted per
Digital Object Identifier 10.1109/TCSII.2019.2938892 cycle.
1549-7747 
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1450 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020

Rm registers hold the P largest elements and they satisfy that:


Rm1 ≥ Rm2 ≥ · · · ≥ Rm(p−1) ≥ Rmp

B. Working Principle of the Sort Logic Module


We suppose the input elements are arranged in ascend-
ing order (DI1 ≤DI2 ≤ · · · ≤DIp−1 ≤DIp ). In this case, the
output elements (DO1 −DOp ) are still in ascending order
Fig. 1. Sorting unit with data parallelism P.
under the action of the sort logic module. P comparators
are employed, each of which is used to compare the input
element DIi (1 ≤ i ≤ p) with Rmo . When CI is low, the
sort logic module operates in “comparing mode” and the
results of the comparators are used to determine the output
order of each input element. Therefore, the values of the
selection signals (SEL0 -SELp ) depend on the results of the
comparators. If C1 -Cp represent the outputs of the comparators
(Ci =DIi >Rmo , 1 ≤ i ≤ p), then

⎨ SEL0 [0:0] = {Cp }
SEL1 [0:0] = {C1 }
⎩ SEL [1:0] = {C , C } 2 ≤ i ≤ p
i i i−1

Logically, Rmi should originate from DIp or Rmo such that Cp is


used to select the output of m0 from the two inputs. Likewise,
the smallest data value should be from DI1 or Rmo such that
Fig. 2. Sort logic module in Fig. 1. DO1 outputs from mux m1 based on the selection signal SEL1 ,
which has the value of C1 . For the other outputs DOi , it should
be from DIi , DIi−1 , and Rmo (2 ≤ i ≤ p) such that the results
of the two corresponding comparators are referred to when
• Theoretical analysis indicates the architecture presented
selecting the output from the three inputs.
in [17] is simply a specific case of the proposed sorting
When CI is high, the sort logic module operates in “initial-
architecture, which has a generic form.
ization mode” for the arrival of a new sequence (Sp ). In this
case, the results of the comparators are ignored as Rm either
II. A RCHITECTURE contains nothing or one of the elements of another sequence.
Rmo should be considered as the smallest among the input ele-
A. Overview of the Sorting Unit
ments to ensure the correctness of the sorting results. In this
To increase the throughput and reduce the latency, a sort- way, the initialization mode can be regarded as a particular
ing unit with data parallelism P, which is a significant case of the comparing mode when the outputs of all the com-
component of our sorting architecture, is designed. The P parators are high. Therefore, the results of the comparators are
largest elements of a segmented ordered sequence (Sp ) can bypassed and the values of the selection signals are replaced
be found, when the entire sequence proceeds through the with logic “1”. Then
sorting unit. ⎧
⎨ SEL0 [0:0] = 1 b1
Sp = {a1 , a2 , . . . , ap ; b1 , b2 , . . . , bp ; · · · · · · } SEL1 [0:0] = 1 b1
⎩ SEL [1:0] = 2 b11 2 ≤ i ≤ p
a1 ≤ a2 ≤ · · · ≤ ap ; b1 ≤ b2 ≤ · · · ≤ bp ; · · · · · · i

DIp is selected to initialize the Rm register and Rmo is output


Fig. 1 provides an overview of the sorting unit, which is com- from DO1 . Other input elements DIi are output from DOi+1
posed of P “comparing and reordering” parts (stage 1-p) and (1 ≤ i ≤ p − 1).
one output part. The control signal CI stays at high level when
the first P elements of a sequence (Sp ) appear on DI1 -DIp .
For each comparing and reordering part, a set of registers Rt C. Obtain and Output the P Largest Elements
exists. The purpose of these registers is to transfer the data When the first P elements of a sequence (Sp ) enter the sort-
stream (DI1 -DIp ) and the control signal. The sort logic mod- ing unit, the sort logic module in stage 1 is in initialization
ule, of which the structure is shown in Fig. 2, is responsible mode. Then ap initializes Rm1 and the value stored in Rm1 is
for reordering the input data stream {DI1 -DIp , Rmo } (Rmo is the output from DO1 . If the P largest elements of another sequence
output of the Rm register) and outputting the sorted result in have been stored in {Rm1 -Rmp } which are {β1 , β2 , . . . , βp }, the
ascending order {DO1 -DOp , Rmi } (Rmi is the input of the Rm output (DO1 −DOp ) of the sort logic module in stage 1 is
register). Under the action of the sort logic module, the largest {β1 , a1 , a2 , . . . , ap−1 }. In the next cycle, the output of the sort
value among {DI1 -DIp , Rmo } updates the content of Rm and logic module in stage 2 is {β2 , β1 , a1 , a2 , . . . , ap−2 } as ap−1
the left data stream is output to the next stage in every cycle. initializes Rm2 and β2 is output from DO1 similarly. (P−1)
After the entire sequence goes through the sorting unit, the cycles later, a1 initializes Rmp and the output of the sort logic

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: HYBRID PIPELINED ARCHITECTURE FOR HIGH PERFORMANCE TOP-K SORTING ON FPGA 1451

Fig. 3. Example of the behavior of the sorting unit with P = 4 when two sequences enter.

module in stage p is {βp , . . . , β2 , β1 }. It takes P cycles to ini-


tialize the Rm registers of the sorting unit during which the P
largest elements of another sequence that have been stored in
{Rm1 -Rmp } can also be obtained in the intended order. After
Fig. 4. Proposed hybrid pipelined sorting architecture comprising a bitonic
the initialization of each Rm , it is compared with the subse- sorter (BN-P) and L cascaded sorting units (SU-P).
quent elements of Sp . The largest element from {DI1 -DIp , Rmo }
updates the content of Rm and the remaining elements are
output to the next stage in ascending order in every cycle
under the action of the sort logic module. Therefore, the P If an interval exists between two sequences, it is necessary to
largest elements can be obtained and stored in {Rm1 -Rmp } in add an additional pulse of CI to obtain the results.
descending order when the entire sequence Sp is processed
by the sorting unit. Besides, the output of the sorting unit is
still a segmented ordered sequence such that multiple sorting
units can be cascaded to sort more elements, as described in E. Output the K Largest Elements in Ascending Order
Section II-E. When the sequence Sp proceeds through the L cascaded
sorting units (SU1 -SUL ), SU1 holds the P largest elements of
Sp , SU2 holds the P largest elements from the sequence output
D. Example of the Working Process of the Sorting Unit from SU1 , which is also a segmented ordered sequence, etc.
Illustrated by P = 4, we present a detailed discussion of Therefore, the K= PL largest elements can be obtained after
the way in which the sorting unit identifies and outputs the P the sequence Sp is processed by L cascaded sorting units and
largest elements from a segmented ordered sequence. Assume the PL largest elements stored in the Rm registers meet the
there are two such sequences A and B, along with their control following relationship (j is the label of each sorting unit):
signals CIA and CIB :
 ⎧
A = {1, 3, 4, 6|2, 5, 7, 8}, CIA = {1, 0} ⎨ Rm1j ≥ Rm2j ≥ · · · ≥ Rmpj
B = {10, 20, 21, 25|12, 15, 17, 18}, CIB = {1, 0} Rm1(j+1) ≥ Rm2(j+1) ≥ · · · ≥ Rmp(j+1)
⎩R ≥ R
mpj m1(j+1) (1 ≤ j ≤ L − 1)
Fig. 3 shows the behavior of the sorting unit when the two
sequences enter. In the original state, the Rm registers are not
initialized and have a value of zero. In cycle 1, the first four When a logic high is applied to the CI port of SU1 , the P ele-
elements {1, 3, 4, 6} arrive at the sorting unit. With CI being ments stored in SU1 are output in ascending order as described
high, {Rm1 -Rm4 } are initialized with {6, 4, 3, 1}, respectively, in Section II-C and they update the Rm registers of SU2 . Then,
from cycle 2 to cycle 5. In cycle 2, the last four elements of the P elements stored in SU2 are output as larger values arrive.
sequence A {2, 5, 7, 8} enter the sorting unit. With CI being They subsequently reside in the Rm registers of SU3 and the
low, they are compared with the values of the Rm registers and remainder of the procedure can be deduced in the same way.
the larger ones, which are {8, 7, 6, 5}, update the contents of Finally, the P elements stored in SUL are output in ascending
{Rm1 -Rm4 }. In cycle 3, the first four elements of sequence B order. As the CI signal is output from Rc in SU1 and arrives
enter. Likewise, these elements reinitialize the Rm registers as at the CI port of SU2 , the P largest elements stored in SU2 ,
the previous largest values have been stored. At the same time, which are originally from SU1 , are output again. After the CI
with CI being high, the previous largest values (marked with signal has proceeded through the L cascaded sorting units, all
grey color) are output from DO1 in each sort logic module the PL elements are output from SUL and they are arranged
from cycle 3 to cycle 6. Valid data begin to output in cycle in ascending order.
5 and the four largest elements of sequence A {5, 6, 7, 8} are Furthermore, to find and output the K=PL largest elements
obtained in cycle 6. As the example shows, the sequences can from a normal sequence, we employ a bitonic sorter with an
be seamless and the high level of CI not only enables the ini- input size P which can output P sorted elements per cycle.
tialization of the Rm registers but also ensures that the largest The proposed sorting architecture, which comprises a bitonic
values of the previous sequence are output at the same time. sorter and L cascaded sorting units, is shown in Fig. 4.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1452 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020

Fig. 5. Total latency of various top-K sorters based on different problem sizes.

TABLE I
C OMPARISON OF VARIOUS T OP -K S ORTERS W ITH L in [17], then
C ASCADED S ORTING U NITS 
Tp = P + 1
T1 = 2
As our sorting unit can hold the P largest elements, which is
equivalent to P cascaded sorting cells of [17], the latency of
our sorting unit decreases to (1/2 + 1/2P). The total latency
of the top-K sorters proposed in this brief and in that of
Chen et al. [17] are listed in Table I and their difference is
 = N − N/P − ((log2 P + 1) log2 P)/2 (K = PL)

III. A NALYSIS
Because N is the problem size, its value can be of the order
A. Increasing the Relative Throughput of thousands, millions, or even billions. Contrary to this, it is
Clearly, the relative throughput of the sorting architecture necessary to limit the size of P to avoid an increase in resource
increases P times compared with [17] as P elements can be consumption. In our implementation, the value of P is in the
input every cycle when processing a sequence. Actually, the set {2, 4, 8, 16}. Hence,   0, which means the total latency
ability to accept P elements per cycle is one of the two contrib- of our top-K sorter is sharply reduced.
utors to our relative throughput increase. The other is that the Assume that the data parallelism P increases to its upper
control signal CI and the corresponding muxes, of which limit N. In this case, the cascaded sorting units are discarded
the selection signals are CI, enable the sorting unit to out- and the sorting architecture becomes a bitonic sorting network.
put the results of the previous sequence and initialize the Rm Conversely, if P decreases to its lower limit, the bitonic
registers simultaneously. Therefore, sequences are input into sorter would be unnecessary and the sorting architecture would
the sorting architecture seamlessly. However, previously [14], almost be as common as that in [17]. The only difference is
additional elements with a complete bit set had to be added the order in which the input data are being registered and com-
to obtain the elements stored in the feedback registers (Rm pared. Therefore, the sorting architecture presented by [17] is
registers in this case). Hence, PL such elements are required merely one of the cases proposed in this brief when the data
to stream into the proposed sorting architecture to obtain the parallelism is 1.
results before any other sequences can be processed if CI
and the corresponding muxes are absent. Then, the relative IV. E XPERIMENTAL R ESULTS
throughput is computed as below (N is the length of the input We implemented our proposed sorting architecture on a
sequence) Xilinx Virtex-7 FPGA XC7VX485T FFG1157-2 using Xilinx
Relative throughput = PN/(N + PL) < P. Vivado 2017.4. The data parallelism is set to P = 2, 4, 8, 16
and the input elements are in 32-bit format. The correspond-
ing implementation results are listed in Table II. According to
B. Decreasing the Latency in Clock Cycles Table II, the throughput of the proposed sorting architecture
Table I compares various top-K sorters, among which [17] almost increases to P times compared with [17], [19]. The total
is the only one that supports continuous and variable- latency (in nanoseconds) consists of both fixed and incremental
length sequences and additionally provides the highest relative components and the latter almost decreases to 1/P compared
throughput apart from our proposed sorter. Next, we compare with [17], [19]. As the performance of our sorting architecture
our top-K sorter with that of Chen et al. [17] in terms of is improved significantly, the resource consumption increases.
latency in clock cycles. We define the latency of a single sort- However, our design is more resource efficient as each of our
ing unit as the clock cycles required from the time at which the top-K sorters has a larger throughput-to-resource ratio, which
first valid element is input to the time the first valid element is labeled as ratio1 in Table II (the resource is the sum of the
is output. The total latency is defined as the cycles required register and LUT, ‘-’ indicates that the corresponding archi-
from the time at which the first valid element is input until tecture cannot be implemented because of limited resources).
the time the first result is output. If Tp is the latency of the Power consumption is another essential metric in hardware
sorting unit in this brief, T1 is the latency of the sorting cell design. Our top-K sorters are also more energy efficient in

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: HYBRID PIPELINED ARCHITECTURE FOR HIGH PERFORMANCE TOP-K SORTING ON FPGA 1453

TABLE II
C OMPARISON OF THE I MPLEMENTATION R ESULTS OF VARIOUS T OP -K S ORTERS

terms of the throughput-to-power ratio, which is labeled as [6] D. Merrill and A. Grimshaw, “High performance and scalable radix
ratio2. sorting: A case study of implementing dynamic parallelism for
GPU computing,” Parallel Process. Lett., vol. 21, no. 2, pp. 245–272,
Fig. 5 shows the total latency of several kinds of top-K 2011.
sorters. As the results show, increasing the data parallelism [7] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, “Efficient parallel
has the effect of decreasing the total latency of the proposed merge sort for fixed and variable length keys,” in Proc. Innov. Parallel
top-K sorters sharply. For instance, the total latency of our Comput. (InPar), 2012, pp. 1–9.
[8] N. Tsuda, T. Satoh, and T. Kawada, “A piepline sorting chip,” in
top-128 sorter was reduced by 2.61s and 2.64s, respectively, IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, vol. 30, 1987,
compared with [17], [19] for P=16 when N is set to 1G. pp. 270–271.
[9] B. Y. Kong, H. Yoo, and I.-C. Park, “Efficient sorting architecture
for successive-cancellation-list decoding of polar codes,” IEEE Trans.
V. C ONCLUSION Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 673–677, Jul. 2016.
[10] G. Xiao, M. Martina, G. Masera, and G. Piccinini, “A parallel radix-
This brief presents the construction of a hybrid pipelined sort-based VLSI architecture for finding the first W maximum/minimum
sorting architecture, which consists of a bitonic sorter and L values,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11,
cascaded sorting units. It not only supports continuous and pp. 890–894, Nov. 2014.
variable-length sequences but also provides high throughput [11] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int. Symp.
and low latency. Our theoretical analysis indicated that the FPGAs, 2015, pp. 240–249.
sorting architecture presented in [17] is merely one of the cases [12] R. Chen and V. K. Prasanna, “Computer generation of high throughput
proposed in this brief, i.e., it corresponds to the case when the and memory efficient sorting designs on FPGA,” IEEE Trans. Parallel
Distrib. Syst., vol. 28, no. 11, pp. 3100–3113, Nov. 2017.
data parallelism is set to 1. The results of the implementation [13] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A hybrid
showed that the proposed sorting architecture is both resource design for high performance large-scale sorting on FPGA,” in Proc.
and energy efficient in terms of the throughput-to-resource IEEE Int. Conf. Reconfig. Comput. FPGAs, 2015, pp. 1–6.
ratio and the throughput-to-power ratio. [14] S. Mashimo, T. Van Chu, and K. Kise, “High-performance hardware
merge sorter,” in Proc. IEEE 25th Annu. Int. Symp. FCCM, 2017,
pp. 1–8.
R EFERENCES [15] A. Farmahini-Farahani, A. Gregerson, M. Schulte, and K. Compton,
“Modular high-throughput and low-latency sorting units for FPGAs in
[1] J. Casper and K. Olukotun, “Hardware acceleration of database oper- the large hadron collider,” in Proc. IEEE 9th SASP, 2011, pp. 38–45.
ations,” in Proc. ACM/SIGDA FPGA, Monterey, CA, USA, 2014, [16] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal parallel hardware
pp. 151–160. K-sorter and top K-sorter, with FPGA implementations,” in Proc. IEEE
[2] J. Chhugani et al., “Efficient implementation of sorting on multi- 14th Int. Symp. Parallel Distrib. Comput., 2015, pp. 138–147.
core SIMD CPU architecture,” Proc. VLDB Endow., vol. 1, no. 2, [17] T. Chen, W. Li, F. Yu, and Q. Xing, “Modular serial pipelined sort-
pp. 1313–1324, 2008. ing architecture for continuous variable-length sequences with a very
[3] R. Marcelino, H. C. Neto, and J. M. P. Cardoso, “Unbalanced FIFO simple control strategy,” IEICE Trans. Fund. Elect., vol. 100, no. 4,
sorting for FPGA-based systems,” in Proc. 16th IEEE ICECS, 2009, pp. 1074–1078, 2017.
pp. 431–434. [18] S. Dong, X. Wang, and X. Wang, “A novel high-speed parallel scheme
[4] D. Koch and J. Torresen, “FPGAsort: A high performance sorting archi- for data sorting algorithm based on FPGA,” in Proc. IEEE 2nd Int.
tecture exploiting run-time reconfiguration on FPGAs for large problem Congr. Image Signal Process., 2009, pp. 1–4.
sorting,” in Proc. 19th ACM/SIGDA Int. Symp. FPGAs, 2011, pp. 45–54. [19] C.-S. Lin and B.-D. Liu, “Design of a pipelined and expandable sort-
[5] R. Mueller, J. Teubner, and G. Alonso, “Sorting networks on FPGAs,” ing architecture with simple control scheme,” in Proc. IEEE Int. Symp.
VLDB J. Int. Very Large Data Bases, vol. 21, no. 1, pp. 1–23, 2012. Circuits Syst., vol. 4, 2002, pp. 217–220.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.

You might also like