A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

This document presents a hybrid pipelined sorting architecture designed for high-performance top-K sorting on FPGA, which can efficiently find and output the K largest elements from an input sequence. The architecture utilizes a bitonic sorter and multiple cascaded sorting units to enhance throughput and reduce latency, achieving a throughput of 22.88 GB/s with P=16. The proposed design supports variable-length sequences and aims to optimize sorting operations in various applications, particularly in the context of Big Data.

Uploaded by

hhimanshukumar0408

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

Uploaded by

hhimanshukumar0408

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO.

8, AUGUST 2020 1449

A Hybrid Pipelined Architecture for High

Performance Top-K Sorting on FPGA
Weijie Chen , Weijun Li , and Feng Yu , Member, IEEE

Abstract—We present a hybrid pipelined sorting architecture nature of the algorithm [12]. To increase the throughput,
capable of finding and producing as its output the K largest Srivastava et al. [13] proposed a hybrid design based on merge
elements from an input sequence. The architecture consists of sort where the final few stages in the merge sort network are
a bitonic sorter and L cascaded sorting units. The sorting unit
replaced with folded bitonic merge networks. Merge sort can
is designed to output P elements during every cycle with the
aim of increasing the throughput and lowering the latency. The also combine two sorted sequences into a sorted sequence.
function of the bitonic sorter is to generate a segmented ordered Mashimo et al. [14] optimized the merge logic, which trans-
sequence. The sorting unit processes this sequence to identify and forms the merge network into a pipelined architecture and uses
output the P largest elements. Hence, the K=PL largest elements fewer feedback registers.
are obtained after the segmented ordered sequence proceeds The algorithms mentioned above all aim to obtain a fully
through L cascaded sorting units. Variable-length and continu- sorted sequence. However, in many applications, only the
ous sequences are supported by the proposed sorting architecture.
The results of the implementation show that the sorting archi-
K largest ones from the N input elements are of interest.
tecture can achieve a throughput of 22.88 GB/s with P=16 on a Farmahini-Farahani et al. [15] proposed a modular technique
state-of-the-art Field Programmable Gate Array (FPGA). based on the bitonic sorting algorithm to design units that
return only the M largest values in ascending or descend-
Index Terms—Field programmable gate array (FPGA), sorting
architecture, high throughput, low latency.
ing order. However, it has the same disadvantage as bitonic
sort namely that the area cost is huge in the case of a large-
sized problem. Apart from this, when the length of the input
I. I NTRODUCTION sequence changes, it is also necessary to adjust the sorting
ORTING is one of the most important computing tasks architecture. Matsumoto et al. [16] presented a FIFO-based
S in many applications such as database operations[1], sig-
nal processing, and statistical methodology. The arrival of
parallel merge sorter, which is resource efficient with great
performance in terms of latency. The disadvantage of this
Big Data has led researchers to believe that sorting oper- merge sorter is that it requires the input sequence to have
ations consume an excessive number of CPU cycles with a fixed length; moreover, the length should be a power of
a software implementation[2]. The acceleration of sorting two. A modular serial pipelined sorting architecture, which is
operations has therefore become increasingly urgent and able to accept continuous and variable-length sequences was
has been implemented on hardware platforms such as Field presented [17]. The simplicity of the sorting cell architecture
Programmable Gate Arrays (FPGAs) [3]–[5], GPU [6], [7] and the short critical path delay enabled a high-frequency
and ASIC [8]–[10]. Among these platforms, FPGA is attract- implementation to be achieved. However, the sorting cells
ing considerable attention in the field of hardware acceleration only accept one data item in every cycle, which limits the
owing to its high degree of parallelism, low power consump- throughput.
tion, and reconfiguration capability. The objective of this brief is to sort the K largest elements
Several sorting algorithms have been proposed to speed from an input sequence. We present a hybrid pipelined sort-
up sorting operations. Bitonic sort is a typical parallel algo- ing architecture which not only has all the advantages of the
rithm that offers great performance. However, the area cost existing architecture [17] but also increases the throughput and
can be extremely high when the problem size becomes large. lowers the latency. In general terms, the main contributions of
Chen et al. [11] proposed an energy and memory efficient this brief are as follows:
mapping methodology for implementing the bitonic sorting • Inspired by [14], a sorting unit with data parallelism
network. Merge sort can sort a list of N elements using P which can output P data every cycle, is designed.
a merge tree of depth logN; however, at the root of the Compared with previous sorting cells [17], the relative
tree no parallelism can be exploited because of the serial throughput1 increases to P times and the latency in clock
cycles decreases to (1/2 + 1/2P).
Manuscript received July 25, 2019; accepted August 23, 2019. Date of • A hybrid pipelined sorting architecture, which is com-
publication September 4, 2019; date of current version August 4, 2020. This posed of a bitonic sorter and L cascaded sorting units, is
brief was recommended by Associate Editor W. N. N. Hung. (Corresponding
author: Feng Yu.) constructed. The use of the data parallelism P can output
W. Chen and F. Yu are with the Department of Instrument Science PL largest elements from an input sequence in ascending
and Technology, Zhejiang University, Hangzhou 310027, China (e-mail: order.
[email protected]; [email protected]).
W. Li is with the Ningbo Institute of Technology, Zhejiang University,
Ningbo 315100, China (e-mail: [email protected]). 1 The relative throughput is defined as the number of elements sorted per
Digital Object Identifier 10.1109/TCSII.2019.2938892 cycle.
1549-7747
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1450 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020

Rm registers hold the P largest elements and they satisfy that:

Rm1 ≥ Rm2 ≥ · · · ≥ Rm(p−1) ≥ Rmp

B. Working Principle of the Sort Logic Module

We suppose the input elements are arranged in ascend-
ing order (DI1 ≤DI2 ≤ · · · ≤DIp−1 ≤DIp ). In this case, the
output elements (DO1 −DOp ) are still in ascending order
Fig. 1. Sorting unit with data parallelism P.
under the action of the sort logic module. P comparators
are employed, each of which is used to compare the input
element DIi (1 ≤ i ≤ p) with Rmo . When CI is low, the
sort logic module operates in “comparing mode” and the
results of the comparators are used to determine the output
order of each input element. Therefore, the values of the
selection signals (SEL0 -SELp ) depend on the results of the
comparators. If C1 -Cp represent the outputs of the comparators
(Ci =DIi >Rmo , 1 ≤ i ≤ p), then
⎧
⎨ SEL0 [0:0] = {Cp }
SEL1 [0:0] = {C1 }
⎩ SEL [1:0] = {C , C } 2 ≤ i ≤ p
i i i−1

Logically, Rmi should originate from DIp or Rmo such that Cp is

used to select the output of m0 from the two inputs. Likewise,
the smallest data value should be from DI1 or Rmo such that
Fig. 2. Sort logic module in Fig. 1. DO1 outputs from mux m1 based on the selection signal SEL1 ,
which has the value of C1 . For the other outputs DOi , it should
be from DIi , DIi−1 , and Rmo (2 ≤ i ≤ p) such that the results
of the two corresponding comparators are referred to when
• Theoretical analysis indicates the architecture presented
selecting the output from the three inputs.
in [17] is simply a specific case of the proposed sorting
When CI is high, the sort logic module operates in “initial-
architecture, which has a generic form.
ization mode” for the arrival of a new sequence (Sp ). In this
case, the results of the comparators are ignored as Rm either
II. A RCHITECTURE contains nothing or one of the elements of another sequence.
Rmo should be considered as the smallest among the input ele-
A. Overview of the Sorting Unit
ments to ensure the correctness of the sorting results. In this
To increase the throughput and reduce the latency, a sort- way, the initialization mode can be regarded as a particular
ing unit with data parallelism P, which is a significant case of the comparing mode when the outputs of all the com-
component of our sorting architecture, is designed. The P parators are high. Therefore, the results of the comparators are
largest elements of a segmented ordered sequence (Sp ) can bypassed and the values of the selection signals are replaced
be found, when the entire sequence proceeds through the with logic “1”. Then
sorting unit. ⎧
⎨ SEL0 [0:0] = 1 b1
Sp = {a1 , a2 , . . . , ap ; b1 , b2 , . . . , bp ; · · · · · · } SEL1 [0:0] = 1 b1
⎩ SEL [1:0] = 2 b11 2 ≤ i ≤ p
a1 ≤ a2 ≤ · · · ≤ ap ; b1 ≤ b2 ≤ · · · ≤ bp ; · · · · · · i

DIp is selected to initialize the Rm register and Rmo is output

Fig. 1 provides an overview of the sorting unit, which is com- from DO1 . Other input elements DIi are output from DOi+1
posed of P “comparing and reordering” parts (stage 1-p) and (1 ≤ i ≤ p − 1).
one output part. The control signal CI stays at high level when
the first P elements of a sequence (Sp ) appear on DI1 -DIp .
For each comparing and reordering part, a set of registers Rt C. Obtain and Output the P Largest Elements
exists. The purpose of these registers is to transfer the data When the first P elements of a sequence (Sp ) enter the sort-
stream (DI1 -DIp ) and the control signal. The sort logic mod- ing unit, the sort logic module in stage 1 is in initialization
ule, of which the structure is shown in Fig. 2, is responsible mode. Then ap initializes Rm1 and the value stored in Rm1 is
for reordering the input data stream {DI1 -DIp , Rmo } (Rmo is the output from DO1 . If the P largest elements of another sequence
output of the Rm register) and outputting the sorted result in have been stored in {Rm1 -Rmp } which are {β1 , β2 , . . . , βp }, the
ascending order {DO1 -DOp , Rmi } (Rmi is the input of the Rm output (DO1 −DOp ) of the sort logic module in stage 1 is
register). Under the action of the sort logic module, the largest {β1 , a1 , a2 , . . . , ap−1 }. In the next cycle, the output of the sort
value among {DI1 -DIp , Rmo } updates the content of Rm and logic module in stage 2 is {β2 , β1 , a1 , a2 , . . . , ap−2 } as ap−1
the left data stream is output to the next stage in every cycle. initializes Rm2 and β2 is output from DO1 similarly. (P−1)
After the entire sequence goes through the sorting unit, the cycles later, a1 initializes Rmp and the output of the sort logic

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: HYBRID PIPELINED ARCHITECTURE FOR HIGH PERFORMANCE TOP-K SORTING ON FPGA 1451

Fig. 3. Example of the behavior of the sorting unit with P = 4 when two sequences enter.

module in stage p is {βp , . . . , β2 , β1 }. It takes P cycles to ini-

tialize the Rm registers of the sorting unit during which the P
largest elements of another sequence that have been stored in
{Rm1 -Rmp } can also be obtained in the intended order. After
Fig. 4. Proposed hybrid pipelined sorting architecture comprising a bitonic
the initialization of each Rm , it is compared with the subse- sorter (BN-P) and L cascaded sorting units (SU-P).
quent elements of Sp . The largest element from {DI1 -DIp , Rmo }
updates the content of Rm and the remaining elements are
output to the next stage in ascending order in every cycle
under the action of the sort logic module. Therefore, the P If an interval exists between two sequences, it is necessary to
largest elements can be obtained and stored in {Rm1 -Rmp } in add an additional pulse of CI to obtain the results.
descending order when the entire sequence Sp is processed
by the sorting unit. Besides, the output of the sorting unit is
still a segmented ordered sequence such that multiple sorting
units can be cascaded to sort more elements, as described in E. Output the K Largest Elements in Ascending Order
Section II-E. When the sequence Sp proceeds through the L cascaded
sorting units (SU1 -SUL ), SU1 holds the P largest elements of
Sp , SU2 holds the P largest elements from the sequence output
D. Example of the Working Process of the Sorting Unit from SU1 , which is also a segmented ordered sequence, etc.
Illustrated by P = 4, we present a detailed discussion of Therefore, the K= PL largest elements can be obtained after
the way in which the sorting unit identifies and outputs the P the sequence Sp is processed by L cascaded sorting units and
largest elements from a segmented ordered sequence. Assume the PL largest elements stored in the Rm registers meet the
there are two such sequences A and B, along with their control following relationship (j is the label of each sorting unit):
signals CIA and CIB :
⎧
A = {1, 3, 4, 6|2, 5, 7, 8}, CIA = {1, 0} ⎨ Rm1j ≥ Rm2j ≥ · · · ≥ Rmpj
B = {10, 20, 21, 25|12, 15, 17, 18}, CIB = {1, 0} Rm1(j+1) ≥ Rm2(j+1) ≥ · · · ≥ Rmp(j+1)
⎩R ≥ R
mpj m1(j+1) (1 ≤ j ≤ L − 1)
Fig. 3 shows the behavior of the sorting unit when the two
sequences enter. In the original state, the Rm registers are not
initialized and have a value of zero. In cycle 1, the first four When a logic high is applied to the CI port of SU1 , the P ele-
elements {1, 3, 4, 6} arrive at the sorting unit. With CI being ments stored in SU1 are output in ascending order as described
high, {Rm1 -Rm4 } are initialized with {6, 4, 3, 1}, respectively, in Section II-C and they update the Rm registers of SU2 . Then,
from cycle 2 to cycle 5. In cycle 2, the last four elements of the P elements stored in SU2 are output as larger values arrive.
sequence A {2, 5, 7, 8} enter the sorting unit. With CI being They subsequently reside in the Rm registers of SU3 and the
low, they are compared with the values of the Rm registers and remainder of the procedure can be deduced in the same way.
the larger ones, which are {8, 7, 6, 5}, update the contents of Finally, the P elements stored in SUL are output in ascending
{Rm1 -Rm4 }. In cycle 3, the first four elements of sequence B order. As the CI signal is output from Rc in SU1 and arrives
enter. Likewise, these elements reinitialize the Rm registers as at the CI port of SU2 , the P largest elements stored in SU2 ,
the previous largest values have been stored. At the same time, which are originally from SU1 , are output again. After the CI
with CI being high, the previous largest values (marked with signal has proceeded through the L cascaded sorting units, all
grey color) are output from DO1 in each sort logic module the PL elements are output from SUL and they are arranged
from cycle 3 to cycle 6. Valid data begin to output in cycle in ascending order.
5 and the four largest elements of sequence A {5, 6, 7, 8} are Furthermore, to find and output the K=PL largest elements
obtained in cycle 6. As the example shows, the sequences can from a normal sequence, we employ a bitonic sorter with an
be seamless and the high level of CI not only enables the ini- input size P which can output P sorted elements per cycle.
tialization of the Rm registers but also ensures that the largest The proposed sorting architecture, which comprises a bitonic
values of the previous sequence are output at the same time. sorter and L cascaded sorting units, is shown in Fig. 4.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.
1452 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 8, AUGUST 2020

Fig. 5. Total latency of various top-K sorters based on different problem sizes.

TABLE I
C OMPARISON OF VARIOUS T OP -K S ORTERS W ITH L in [17], then
C ASCADED S ORTING U NITS
Tp = P + 1
T1 = 2
As our sorting unit can hold the P largest elements, which is
equivalent to P cascaded sorting cells of [17], the latency of
our sorting unit decreases to (1/2 + 1/2P). The total latency
of the top-K sorters proposed in this brief and in that of
Chen et al. [17] are listed in Table I and their difference is
= N − N/P − ((log2 P + 1) log2 P)/2 (K = PL)

III. A NALYSIS
Because N is the problem size, its value can be of the order
A. Increasing the Relative Throughput of thousands, millions, or even billions. Contrary to this, it is
Clearly, the relative throughput of the sorting architecture necessary to limit the size of P to avoid an increase in resource
increases P times compared with [17] as P elements can be consumption. In our implementation, the value of P is in the
input every cycle when processing a sequence. Actually, the set {2, 4, 8, 16}. Hence, 0, which means the total latency
ability to accept P elements per cycle is one of the two contrib- of our top-K sorter is sharply reduced.
utors to our relative throughput increase. The other is that the Assume that the data parallelism P increases to its upper
control signal CI and the corresponding muxes, of which limit N. In this case, the cascaded sorting units are discarded
the selection signals are CI, enable the sorting unit to out- and the sorting architecture becomes a bitonic sorting network.
put the results of the previous sequence and initialize the Rm Conversely, if P decreases to its lower limit, the bitonic
registers simultaneously. Therefore, sequences are input into sorter would be unnecessary and the sorting architecture would
the sorting architecture seamlessly. However, previously [14], almost be as common as that in [17]. The only difference is
additional elements with a complete bit set had to be added the order in which the input data are being registered and com-
to obtain the elements stored in the feedback registers (Rm pared. Therefore, the sorting architecture presented by [17] is
registers in this case). Hence, PL such elements are required merely one of the cases proposed in this brief when the data
to stream into the proposed sorting architecture to obtain the parallelism is 1.
results before any other sequences can be processed if CI
and the corresponding muxes are absent. Then, the relative IV. E XPERIMENTAL R ESULTS
throughput is computed as below (N is the length of the input We implemented our proposed sorting architecture on a
sequence) Xilinx Virtex-7 FPGA XC7VX485T FFG1157-2 using Xilinx
Relative throughput = PN/(N + PL) < P. Vivado 2017.4. The data parallelism is set to P = 2, 4, 8, 16
and the input elements are in 32-bit format. The correspond-
ing implementation results are listed in Table II. According to
B. Decreasing the Latency in Clock Cycles Table II, the throughput of the proposed sorting architecture
Table I compares various top-K sorters, among which [17] almost increases to P times compared with [17], [19]. The total
is the only one that supports continuous and variable- latency (in nanoseconds) consists of both fixed and incremental
length sequences and additionally provides the highest relative components and the latter almost decreases to 1/P compared
throughput apart from our proposed sorter. Next, we compare with [17], [19]. As the performance of our sorting architecture
our top-K sorter with that of Chen et al. [17] in terms of is improved significantly, the resource consumption increases.
latency in clock cycles. We define the latency of a single sort- However, our design is more resource efficient as each of our
ing unit as the clock cycles required from the time at which the top-K sorters has a larger throughput-to-resource ratio, which
first valid element is input to the time the first valid element is labeled as ratio1 in Table II (the resource is the sum of the
is output. The total latency is defined as the cycles required register and LUT, ‘-’ indicates that the corresponding archi-
from the time at which the first valid element is input until tecture cannot be implemented because of limited resources).
the time the first result is output. If Tp is the latency of the Power consumption is another essential metric in hardware
sorting unit in this brief, T1 is the latency of the sorting cell design. Our top-K sorters are also more energy efficient in

TABLE II
C OMPARISON OF THE I MPLEMENTATION R ESULTS OF VARIOUS T OP -K S ORTERS

terms of the throughput-to-power ratio, which is labeled as [6] D. Merrill and A. Grimshaw, “High performance and scalable radix
ratio2. sorting: A case study of implementing dynamic parallelism for
GPU computing,” Parallel Process. Lett., vol. 21, no. 2, pp. 245–272,
Fig. 5 shows the total latency of several kinds of top-K 2011.
sorters. As the results show, increasing the data parallelism [7] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, “Efficient parallel
has the effect of decreasing the total latency of the proposed merge sort for fixed and variable length keys,” in Proc. Innov. Parallel
top-K sorters sharply. For instance, the total latency of our Comput. (InPar), 2012, pp. 1–9.
[8] N. Tsuda, T. Satoh, and T. Kawada, “A piepline sorting chip,” in
top-128 sorter was reduced by 2.61s and 2.64s, respectively, IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, vol. 30, 1987,
compared with [17], [19] for P=16 when N is set to 1G. pp. 270–271.
[9] B. Y. Kong, H. Yoo, and I.-C. Park, “Efficient sorting architecture
for successive-cancellation-list decoding of polar codes,” IEEE Trans.
V. C ONCLUSION Circuits Syst. II, Exp. Briefs, vol. 63, no. 7, pp. 673–677, Jul. 2016.
[10] G. Xiao, M. Martina, G. Masera, and G. Piccinini, “A parallel radix-
This brief presents the construction of a hybrid pipelined sort-based VLSI architecture for finding the first W maximum/minimum
sorting architecture, which consists of a bitonic sorter and L values,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 11,
cascaded sorting units. It not only supports continuous and pp. 890–894, Nov. 2014.
variable-length sequences but also provides high throughput [11] R. Chen, S. Siriyal, and V. Prasanna, “Energy and memory efficient
mapping of bitonic sorting on FPGA,” in Proc. ACM/SIGDA Int. Symp.
and low latency. Our theoretical analysis indicated that the FPGAs, 2015, pp. 240–249.
sorting architecture presented in [17] is merely one of the cases [12] R. Chen and V. K. Prasanna, “Computer generation of high throughput
proposed in this brief, i.e., it corresponds to the case when the and memory efficient sorting designs on FPGA,” IEEE Trans. Parallel
Distrib. Syst., vol. 28, no. 11, pp. 3100–3113, Nov. 2017.
data parallelism is set to 1. The results of the implementation [13] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A hybrid
showed that the proposed sorting architecture is both resource design for high performance large-scale sorting on FPGA,” in Proc.
and energy efficient in terms of the throughput-to-resource IEEE Int. Conf. Reconfig. Comput. FPGAs, 2015, pp. 1–6.
ratio and the throughput-to-power ratio. [14] S. Mashimo, T. Van Chu, and K. Kise, “High-performance hardware
merge sorter,” in Proc. IEEE 25th Annu. Int. Symp. FCCM, 2017,
pp. 1–8.
R EFERENCES [15] A. Farmahini-Farahani, A. Gregerson, M. Schulte, and K. Compton,
“Modular high-throughput and low-latency sorting units for FPGAs in
[1] J. Casper and K. Olukotun, “Hardware acceleration of database oper- the large hadron collider,” in Proc. IEEE 9th SASP, 2011, pp. 38–45.
ations,” in Proc. ACM/SIGDA FPGA, Monterey, CA, USA, 2014, [16] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal parallel hardware
pp. 151–160. K-sorter and top K-sorter, with FPGA implementations,” in Proc. IEEE
[2] J. Chhugani et al., “Efficient implementation of sorting on multi- 14th Int. Symp. Parallel Distrib. Comput., 2015, pp. 138–147.
core SIMD CPU architecture,” Proc. VLDB Endow., vol. 1, no. 2, [17] T. Chen, W. Li, F. Yu, and Q. Xing, “Modular serial pipelined sort-
pp. 1313–1324, 2008. ing architecture for continuous variable-length sequences with a very
[3] R. Marcelino, H. C. Neto, and J. M. P. Cardoso, “Unbalanced FIFO simple control strategy,” IEICE Trans. Fund. Elect., vol. 100, no. 4,
sorting for FPGA-based systems,” in Proc. 16th IEEE ICECS, 2009, pp. 1074–1078, 2017.
pp. 431–434. [18] S. Dong, X. Wang, and X. Wang, “A novel high-speed parallel scheme
[4] D. Koch and J. Torresen, “FPGAsort: A high performance sorting archi- for data sorting algorithm based on FPGA,” in Proc. IEEE 2nd Int.
tecture exploiting run-time reconfiguration on FPGAs for large problem Congr. Image Signal Process., 2009, pp. 1–4.
sorting,” in Proc. 19th ACM/SIGDA Int. Symp. FPGAs, 2011, pp. 45–54. [19] C.-S. Lin and B.-D. Liu, “Design of a pipelined and expandable sort-
[5] R. Mueller, J. Teubner, and G. Alonso, “Sorting networks on FPGAs,” ing architecture with simple control scheme,” in Proc. IEEE Int. Symp.
VLDB J. Int. Very Large Data Bases, vol. 21, no. 1, pp. 1–23, 2012. Circuits Syst., vol. 4, 2002, pp. 217–220.

Authorized licensed use limited to: Indian Institute of Technology Palakkad. Downloaded on February 24,2025 at 11:32:07 UTC from IEEE Xplore. Restrictions apply.

Joshua Castromayor-6.4.8-Lab-View-Captured-Traffic-in-Wireshark
No ratings yet
Joshua Castromayor-6.4.8-Lab-View-Captured-Traffic-in-Wireshark
8 pages
BENBOX Software Manual
No ratings yet
BENBOX Software Manual
21 pages
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
No ratings yet
A Low-Cost Pipelined Architecture Based On A Hybrid Sorting Algorithm
14 pages
A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
No ratings yet
A_Low-Cost_Pipelined_Architecture_Based_on_a_Hybrid_Sorting_Algorithm
14 pages
Sorting Algorthims With Fpga
No ratings yet
Sorting Algorthims With Fpga
18 pages
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
No ratings yet
Hardware Implementatioon of Sorting Algorithm Using FPGA Ijariie7623
7 pages
An_Efficient_O_N__Comparison-Free_Sorting_Algorithm
No ratings yet
An_Efficient_O_N__Comparison-Free_Sorting_Algorithm
13 pages
FPGA Based Hardware Accelerator For Sorting Data
No ratings yet
FPGA Based Hardware Accelerator For Sorting Data
4 pages
Computer Generation of Streaming Sorting Networks
No ratings yet
Computer Generation of Streaming Sorting Networks
9 pages
10.1109ICESC48915.2020.9155623
No ratings yet
10.1109ICESC48915.2020.9155623
7 pages
Project Paper
No ratings yet
Project Paper
5 pages
FOV AAT_merged
No ratings yet
FOV AAT_merged
15 pages
Performance Analysis of Parallel Sorting Algorithms Using MPI
No ratings yet
Performance Analysis of Parallel Sorting Algorithms Using MPI
6 pages
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
No ratings yet
Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies
23 pages
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
No ratings yet
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
9 pages
isfpga16-resolve
No ratings yet
isfpga16-resolve
10 pages
Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
No ratings yet
Iterative_parallel_shift_sort__Optimization_and_design_for_area_constrained_applications
7 pages
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
No ratings yet
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
10 pages
Teaching FPGA-based Systems and Their Influence On Mechatronics
No ratings yet
Teaching FPGA-based Systems and Their Influence On Mechatronics
8 pages
Es ZG554
No ratings yet
Es ZG554
10 pages
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
No ratings yet
Fpga vs. Multi-Core Cpus vs. Gpus: Hands-On Experience With A Sorting Application
12 pages
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
No ratings yet
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
8 pages
FPGA Based Binary Heap Implementation - With An Application To Web
No ratings yet
FPGA Based Binary Heap Implementation - With An Application To Web
62 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
412 pages
Bitonic Sort
No ratings yet
Bitonic Sort
23 pages
A Cost-Effective and Scalable Merge Sorter Tree On FPGAs
No ratings yet
A Cost-Effective and Scalable Merge Sorter Tree On FPGAs
10 pages
Automated Timetable Generation For Egyptian Schools: S.R.P. Van Hal K.N.M.M.H. Osman
No ratings yet
Automated Timetable Generation For Egyptian Schools: S.R.P. Van Hal K.N.M.M.H. Osman
77 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
Aece 2014 2 11
No ratings yet
Aece 2014 2 11
6 pages
2009.13569v2
No ratings yet
2009.13569v2
75 pages
Hybrid Parallelization of The Black Hole Algorithm For Systems On Chip
No ratings yet
Hybrid Parallelization of The Black Hole Algorithm For Systems On Chip
15 pages
Sorting On A Mesh-Connected Parallel Computer
No ratings yet
Sorting On A Mesh-Connected Parallel Computer
30 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
Fpga Based An Advanced Lut Methodology For Design of A Digital Filter
No ratings yet
Fpga Based An Advanced Lut Methodology For Design of A Digital Filter
5 pages
PPC Index
No ratings yet
PPC Index
6 pages
A Brief Study of Reconfigurable Computation Systems With A Focus of FPGA Based Devices
No ratings yet
A Brief Study of Reconfigurable Computation Systems With A Focus of FPGA Based Devices
11 pages
PDF
No ratings yet
PDF
315 pages
An Efficient Sorting Algorithm With CUDA
No ratings yet
An Efficient Sorting Algorithm With CUDA
8 pages
finalproject (1)
No ratings yet
finalproject (1)
61 pages
Sample Ch1and2
No ratings yet
Sample Ch1and2
25 pages
Shorting
No ratings yet
Shorting
27 pages
Information Processing Letters: Thorsten Ehlers
No ratings yet
Information Processing Letters: Thorsten Ehlers
4 pages
L8 Parallel Algorithms
No ratings yet
L8 Parallel Algorithms
41 pages
HPC2
No ratings yet
HPC2
22 pages
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
No ratings yet
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
6 pages
ES-MEL-AEL ZG554 - Lec1
No ratings yet
ES-MEL-AEL ZG554 - Lec1
40 pages
Design and Implementation of Sorting Algorithms Based On FPGA
No ratings yet
Design and Implementation of Sorting Algorithms Based On FPGA
4 pages
2009 - Sorting On A Cell Broadband Engine SPU - Bandyopadhyay, Sahni
No ratings yet
2009 - Sorting On A Cell Broadband Engine SPU - Bandyopadhyay, Sahni
16 pages
M.Tech Syllabus For Embedded Systems at NIT Jaipur.
100% (1)
M.Tech Syllabus For Embedded Systems at NIT Jaipur.
25 pages
Dataflow Processing 1st Edition Ali R. Hurson download
No ratings yet
Dataflow Processing 1st Edition Ali R. Hurson download
56 pages
Vlsidt Syllabus Detailing - 2k15-Course
No ratings yet
Vlsidt Syllabus Detailing - 2k15-Course
2 pages
A L D I S HW/SW C - D: Shun-Wen Cheng
No ratings yet
A L D I S HW/SW C - D: Shun-Wen Cheng
6 pages
Survey of FPGA Applications in the Period 2000 - 2015
No ratings yet
Survey of FPGA Applications in the Period 2000 - 2015
43 pages
Productflyer - 978 1 84882 015 9
No ratings yet
Productflyer - 978 1 84882 015 9
1 page
ECE-863-Advanced FPGA-based Systems Design
No ratings yet
ECE-863-Advanced FPGA-based Systems Design
3 pages
Memristive Data Ranking
No ratings yet
Memristive Data Ranking
13 pages
Synthesizable Vhdl Design For Fpgas 2014th Edition Bezerra Eduardo Augusto download
100% (1)
Synthesizable Vhdl Design For Fpgas 2014th Edition Bezerra Eduardo Augusto download
47 pages
FPGA Implementation of A∗ Algorithm for Real-Time
No ratings yet
FPGA Implementation of A∗ Algorithm for Real-Time
11 pages
Teaching Field Programmable Gate Array Design of Digital Signal Processing Systems
No ratings yet
Teaching Field Programmable Gate Array Design of Digital Signal Processing Systems
4 pages
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya - The complete ebook version is now available for download
100% (1)
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya - The complete ebook version is now available for download
62 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Spanning Tree Protocol Essentials: Definitive Reference for Developers and Engineers
From Everand
Spanning Tree Protocol Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ashrith Resume
No ratings yet
Ashrith Resume
2 pages
ERP Unit-1
No ratings yet
ERP Unit-1
67 pages
AJP Assignment
No ratings yet
AJP Assignment
2 pages
SWGDE Computer Forensics
No ratings yet
SWGDE Computer Forensics
10 pages
Java Programming Constructs
No ratings yet
Java Programming Constructs
32 pages
Abstract
No ratings yet
Abstract
4 pages
RTCM Ing
No ratings yet
RTCM Ing
24 pages
Trusted SC300E Bridge Module: Product Overview
No ratings yet
Trusted SC300E Bridge Module: Product Overview
53 pages
MB Manual A520m-Ds3h-Ac e 1401
No ratings yet
MB Manual A520m-Ds3h-Ac e 1401
30 pages
Week 2 Exercise 02 - NumPy Indexing and Selection
No ratings yet
Week 2 Exercise 02 - NumPy Indexing and Selection
5 pages
How To Create A Secure Login Using PHP and MYSQL
No ratings yet
How To Create A Secure Login Using PHP and MYSQL
16 pages
Replacement of A Lost Logbook
No ratings yet
Replacement of A Lost Logbook
3 pages
590plus Series DC Drives Catalog
No ratings yet
590plus Series DC Drives Catalog
24 pages
D102057-Oracle HCM Cloud Reporting and Analytics - Ag
No ratings yet
D102057-Oracle HCM Cloud Reporting and Analytics - Ag
210 pages
Customize Ribbon Step by Step
No ratings yet
Customize Ribbon Step by Step
9 pages
KANBAN Execution - Step by Step Approach - SAP Blogs
No ratings yet
KANBAN Execution - Step by Step Approach - SAP Blogs
11 pages
O Object, Class, Inheritance and Polymorphism
No ratings yet
O Object, Class, Inheritance and Polymorphism
6 pages
The FINAL EXAM Will Cover All of CSS
No ratings yet
The FINAL EXAM Will Cover All of CSS
18 pages
GE Dashboards
No ratings yet
GE Dashboards
78 pages
Manual Eng
No ratings yet
Manual Eng
0 pages
Hostel Management System - 2017
100% (1)
Hostel Management System - 2017
32 pages
Problem Bank 01: Assignment I
No ratings yet
Problem Bank 01: Assignment I
9 pages
Program Which Shows The Different Data Declarations (Data Types) in COBOL
No ratings yet
Program Which Shows The Different Data Declarations (Data Types) in COBOL
72 pages
351910B-EN-iDialog Installation and User Guide
No ratings yet
351910B-EN-iDialog Installation and User Guide
17 pages
Hexabells X Ludo Gaming App - Scope of Work Cum Business Agreement
No ratings yet
Hexabells X Ludo Gaming App - Scope of Work Cum Business Agreement
25 pages
Fortinet NSE4 FGT Jan 2024
No ratings yet
Fortinet NSE4 FGT Jan 2024
191 pages
Datasheet Ion9000
No ratings yet
Datasheet Ion9000
12 pages
Year 8 ICT End Term 3 2024
No ratings yet
Year 8 ICT End Term 3 2024
8 pages

A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

Uploaded by

A_Hybrid_Pipelined_Architecture_for_High_Performance_Top-K_Sorting_on_FPGA

Uploaded by

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO.

8, AUGUST 2020 1449

A Hybrid Pipelined Architecture for High

Rm registers hold the P largest elements and they satisfy that:

B. Working Principle of the Sort Logic Module

Logically, Rmi should originate from DIp or Rmo such that Cp is

DIp is selected to initialize the Rm register and Rmo is output

module in stage p is {βp , . . . , β2 , β1 }. It takes P cycles to ini-

You might also like