0% found this document useful (0 votes)
3 views6 pages

Eai 12-5-2020 164497

This document presents a case study on an automatic FPGA-based hardware accelerator design framework specifically for image processing applications, demonstrating significant performance and energy optimizations. The framework allows designers to implement FPGA platforms without extensive hardware knowledge, achieving speed-ups of up to 3.15 times compared to traditional systems while reducing energy consumption by up to 66.5%. The study includes experiments on Canny edge detection and jpeg conversion applications, showcasing the framework's effectiveness in both embedded and high-performance computing environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Eai 12-5-2020 164497

This document presents a case study on an automatic FPGA-based hardware accelerator design framework specifically for image processing applications, demonstrating significant performance and energy optimizations. The framework allows designers to implement FPGA platforms without extensive hardware knowledge, achieving speed-ups of up to 3.15 times compared to traditional systems while reducing energy consumption by up to 66.5%. The study includes experiments on Canny edge detection and jpeg conversion applications, showcasing the framework's effectiveness in both embedded and high-performance computing environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EAI Endorsed Transactions

on Context-aware Systems and Applications Research Article

Automatic FPGA-based Hardware Accelerator Design: A


Case Study with Image Processing Applications
Cuong Pham-Quoc1,2*

1
Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam
2
Vietnam National University – Ho Chi Minh City, Thu Duc District, Ho Chi Minh City, Vietnam

Abstract

We present a case study of automatic FPGA-based hardware accelerator design using our proposed framework with the
image processing domain. With the framework, the ultimate systems are optimized in both performance and energy
consumption. Moreover, using the framework, designers can implement FPGA platforms without manually describing any
hardware cores or the interconnect. The systems offer accelerations in execution time compared to traditional general
purpose processors and accelerator systems designed manually. We use two applications in the image processing domain
as experiments to report our work. Those are Canny edge detection and jpeg converter. The experiments are conducted in
both embedded and high-performance computing platforms. Results show that we achieve overall speed-ups by up to
3.15´ and 2.87´ when compared to baseline systems in embedded and high-performance platforms, respectively. Our
systems consume less energy than other FPGA-based systems by up to 66.5%.

Keywords: FPGA-based design framework, Hardware accelerator, Image processing

Received on 10 April 2020, accepted on 10 May 2020, published on 12 May 2020

Copyright © 2020 Cuong Pham-Quoc licensed to EAI. This is an open access article distributed under the terms of the Creative
Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and
reproduction in any medium so long as the original work is properly cited.

doi: 10.4108/eai.12-5-2020.164497

* Corresponding author: [email protected]


FPGA-based hardware accelerator systems with an
optimized hybrid interconnect to reduce data
1. Introduction communication overhead. When using the framework,
designers are helped to build FPGA accelerator systems
In reccent years, FPGAs (Field programmable gate arrays) without much hardware knowledge and skills. The proposed
have been considering as a promising approach to overcome framework already solved a number of issues from which
many obstacles in SoC designs such as time-to-market, other similar toolchains suffer. For example, they could be
power consumption as well as flexibility. However, FPGA used for only a dedicated application domain [2][3][4][5][6]
designs still suffer from higher NRE (non-recurring or used for different applications without an interconnect
engineering) cost than general pupose procesosrs or require optimized [7][8]. One of the most important contribution of
knowledge of both hardware and software. These issues our framework is to optimize the interconnect of hardware
usally prevent the use of FPGA in real application domains cores because data communication is one of the two main
such as image processing, voice recognization, artificial sources of overhead in multicore systems [9]. Each
intelligent, or machine learning. Although a number of interconnection type, such as crossbar, bus or network-on-
toochains have been proposed to pursuade designers to chip, offers different advantages while suffering various
exploit FPGAs advantages like high-level synthesis tools, disadvantages [10]. Therefore, a hybrid interconnect
designers still need hardware knowledge and skills. consisting of multiple interconnect architectures is an
Our previous work presented in [1] proposed an appropriate approach for keeping performance of an FPGA-
automatic framework for designing and implementing based hardware accelerator improved further.

EAI Endorsed Transactions on


1 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5
Cuong Pham-Quoc

In this paper, we report a case study for the image In this framework, a target application (developed in high
processing domain. We use the design framework to level programming language) is profiled in Step 1 to collect
implement two different applications, the Canny edge execution time of functions in the application. This profiling
detection and the jpeg image converter. We discuss design step also creates a communication graph to represent data
steps and conduct experiments with the systems in both communication of functions inside the application. Based on
embedded and high-performance computing platforms. We this data communication graph, a hybrid interconnect for
analyze experimental results of both the systems in terms of hardware kernels as well as between the hardware device
execution time and energy consumption. We then compare and the host processor can be defined appropriately.
our system with general purpose processors and traditional The application is then partitioned into hardware and
FPGA-based accelerators. software parts in Step 2. Computationally intensive
The main contributions of this paper are summarized into functions should be candidates for accelerating with
two folds. hardware kernels in FPGA. Moreover, based on the data
communication graph, non-intensive functions may also be
• We first briefly introduce a case study of the image implemented in FPGA for reducing off-chip data
processing domain with two applications when using communication.
our proposed framework; As stated above, the data communication graph is used to
• We analyze and compare results of the systems define the most suitable hybrid interconnect for hardware
designed by the framework and other systems. kernels in Step 3. The hybrid interconnect may comprise
bus, crossbar, network-on-chip, or shared buffer depended
The rest of the paper is organized as follows. Section 2 on the communication patterns of the kernels. The main
quickly presents the framework designed and reported in our purpose of this step is to reduce data communication
previous work. Section 3 discusses steps in developing two overhead while keeping hardware resources usage for the
applications in the image processing domain with our interconnect minimized. That, in turn, will save energy
framework. We present our experimental results with two consumption of the system since the less resources are used,
computing platforms in Section 4. Finally, the paper the more energy is saved.
contributions are concluded in Section 5. The selected functions are then synthesized by high-level
synthesis (HLS) toolchains (in this case we use the Vivado
HLS from Xilinx) to create hardware kernels described by
2. The automatic design framework hardware description languages (in our work, we prefer
Verilog-HDL). With the support of HLS tools, functions
In our previous work [1], we already proposed an automatic written in high level programming languages can be
framework for designing an FPGA-based hardware compiled to Verilog-HDL automatically. This solves one of
accelerator with a hybrid interconnect. The framework the most difficult issues that designers usually have.
allows designers to develop a hardware accelerator system Finally, the entire system is developed, synthesized and
for a particular application without much knowledge and mapped to hardware platforms by tools provided by FPGA
working effort at the hardware level. Figure 1 depicts the manufacturers (in our case, we use Xilinx Vivado since
design flow of the proposed framework. Although the Xilinx platforms are targeted). Based on the resources
design flow includes five automatic processing steps, availability of the target platforms, computationally
designers are also able to interfere to manually make further intensive kernels can be replicated to further improve overall
improvements. performance.
As summarized above, designers do not need to interfere
with the framework much because all steps can process
automatically. However, in case designers would involve
themselves to modify some kernels or the interconnect, they
are still able to do.

3. Case study
In this section, we introduce our hardware accelerator
systems developed for the two applications, Canny edge
detection [11] and jpeg image converter [12] using the
framework.

3.1. The Canny edge detection


The main purpose of this application is to detect edges of
images by applying a number of operators like Gaussian
Figure 1. The proposed framework proposed in [1]

EAI Endorsed Transactions on


2 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5
Automatic FPGA-based Hardware Accelerator Design: A Case Study with Image Processing Applications

filter, gradient calculation, non-maximize suppression, and modules for our hardware kernels. Figure 3 shows a part of
finally hysteresis thresholding. The application was first the Verilog-HDL description generated by Xilinx Vivado
introduced in 1986 and coded by ANSI C. The profiling HLS for the gaussian_smooth function.
results indicate that three operators Gaussian filter, gradient
calculation and non-maximize suppression are the most
computationally intensive. The framework also generates
the data communication graph shown in Figure 2.

input_functions

1287880 Bytes
13360 UnMAs
13360 UnDVs

8 Bytes 689420 Bytes


8 UnMAs gaussian_smooth 53316 UnMAs Figure 3. Part of the gaussian kernel generated by
8 UnDVs 163444 UnDVs
Vivado HLS
8 Bytes 106400 Bytes
8 UnMAs 26600 UnMAs
8 UnDVs 26600UnDVs

derrivative_x_y
1120 Bytes
60 UnMAs
8 Bytes
8 UnMAs
In this work, we implement applications on both
64UnDVs 8 UnDVs embedded system with the Xilinx ML510 board [13] and
53200 Bytes
53200 UnMAs
high-performance computing system with Micron HC-2ex
53200 UnDVs [14]. Therefore, the final step needs to be performed for two
248 Bytes 65266 Bytes different target platforms. For the embedded platform,
magnitude_x_y 64 UnMAs 53200 UnMAs
860 UnDVs 53200 UnDVs because there exists only one FPGA device, we are able to
128380 Bytes build the system with 5 kernels. Hence, only the
26600 UnMAs
26600 UnDVs gaussian_smooth kernel is duplicated. Meanwhile, the high-
35282 Bytes 283937 Bytes
performance computing platform consists of 4 modern
26600 UnMAs
26600 UnDVs
non_max_supp 607 UnMAs
61130 UnDVs
FPGA devices that can host more kernels. Consequently, we
13348 Bytes
replicate the kernels up to 64 accelerators in total.
13300 UnMAs
13300 UnDVs
The system finally is synthesized and mapped to FPGA
devices by the Xilinx Vivado toolchain. This final step is
output_functions technology dependent. Resources usage and other
Figure 2. Data communication graph of the Canny parameters are reported in detail in Section 4.
edge detection application
3.2. The jpeg image converter
According to the profiling results, three aforementioned The jpeg image converter is the second application used as
operators (implemented as functions gaussian_smooth, our case study. The main purpose of this application is to
magnitude_x_y, non_max_supp) are good candidates to be encode bitmap images to the jpeg format. The application
accelerated by hardware kernels. However, as illustrated in was implemented in ANSI C and reported in the benchmark
the data communication graph, the derivative_x_y function in [12]. The main part of the application includes four
should also be accelerated by hardware kernels because computationally intensive functions. Those are huff_dc_dec,
there is a huge amount of data transferred between this huff_ac_dec, dquantz_lum, and j_rev_dct.
function and other candidates. Therefore, all the four
functions are implemented as accelerators. Other procedures
of the application are kept processing on the general-purpose
processor.
During the hybrid interconnect generation step, the most
suitable interconnect is designed for data communication
among the four hardware kernels above. Please note that, the
communication infrastructure for transferring data between
the processor and kernels is already defined by the target
platform used for building the system. In this application, a
shared buffer is used for the interconnect of the
gaussian_smooth and derivative_x_y accelerators while a
network-on-chip (NoC) is used for transferring data among
the derivative_x_y and the rest two kernels. Figure 4. Data communication graph of the jpeg
As discussed above, in this work we target Xilinx image converter application
platforms for implementing our systems; we therefore use
Xilinx Vivado HLS to generate Verilog-HDL description

EAI Endorsed Transactions on


3 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5
Cuong Pham-Quoc

Figure 4 illustrates the data communication graph mainly first are executed on the host processor. They then are
focusing on the four mentioned functions. All those processed by the entire accelerator systems, i.e., hardware
functions are accelerated by hardware kernels. Similar to the kernels process computationally intensive functions while
Canny edge detection application, a NoC should be used for software part still is kept executing on the host.
data communication of the three first kernels while a shared For the high-performance computing system, Micron
local buffer involves in transferring data between the HC-2ex (formerly Convey) is used as our experiment
dquantz_lum and j_rev_dct accelerators. Similar to the platform. The system includes of four Virtex-6 xc6vlx760
systems for the Canny edge detection application, Vivado FPGA devices and one Intel Xeon X5670 processor. While
HLS is also used for generating Verilog-HDL descriptions the host processor can function at 2.93 GHz, the accelerators
for these functions from the C code. are set to work at 200 MHz. Similar to the embedded
This application is also implemented in both the above system, we first execute the applications on the host
embedded and high-performance computing systems. With processor with full parallelization, i.e., the whole 12 cores of
the embedded system, the huff_ac_dec kernel is duplicated the processor are used to process the applications. They then
only while there are 64 kernels built in the high-performance are processed by the entire accelerator systems to compare
computing system. Figure 5 illustrates the architecture of the execution time.
jpeg application when implemented on the embedded
platform.
4.2 Experimental results
In this section, we discuss our experimental results with the
two aforementioned systems when processing the two
applications. Performance of kernels and entire systems as
well as energy consumption and resources usage are
analyzed.

Performance analysis
Table 1 depicts the kernels and the overall accelerator
systems speed-ups when compared to their host processors
(PowerPC for the embedded system and Intel Xeon for the
high-performance computing system) in the third and the
fourth columns, respectively. The table also presents speed-
ups compared to baseline systems (the baseline systems are
Figure 5. The architecture for jpeg application in the the traditional accelerator systems without helps of our
embedded platform framework for optimizing data communication but including
replicated kernels for a fair comparison) in the fifth and the
sixth columns, respectively.
4. Experiments
In this section, we present our experiments to verify the Table 1. Speed-ups comparison between our systems
FPGA-based accelerator systems reported above. Kernels and others
performance as well as overall system performance are
presented. Energy consumption compared to baseline
w.r.t host w.r.t baseline
systems is shown also. Platform App. processors systems
kernels overall kernels overall
4.1 Experimental setup Canny 3.88´ 3.15´ 2.12´ 1.83´
EMB(1)
jpeg 2.55´ 2.33´ 3.08´ 2.87´
As presented above, the embedded and high-performance Canny 2.62´ 2.61´ 1.55´ 1.54´
HPC(2)
computing systems used as our target platforms are ML510 jpeg 1.96´ 1.45´ 1.93´ 1.42´
and Micro HC-2ex, respectively. The ML510 board consists (1) EMB: Embedded system; (2) High-performance
of only one Xilinx xc5vfx130t FPGA device. Inside the computing system
FPGA device, there exist two embedded hardwired As shown in the table, compared to both the host
PowerPC processors that are used as host processors to processors and the baseline systems, our systems designed
process the software part of applications. Hardware with framework achieve better performance in term of
accelerators are mapped into the reconfigurable area of the execution time; in other words our systems outperform the
device. In this paper, we configure the processors others in term of execution time. Our systems process the
functioning at 400 MHz while hardware kernels can work at applications up to 3.15´ faster than general-purpose
100 MHz only due to huge amount of reconfigurable logic processors and up to 2.87´ faster than baseline systems.
resources used. To compare performance, the applications at

EAI Endorsed Transactions on


4 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5
Automatic FPGA-based Hardware Accelerator Design: A Case Study with Image Processing Applications

Table 2. Hardwar resources usage for the systems

Platform Applications Type Baseline Ours Energy saved


LUT 9,296 15,227
Canny 49.7%
Reg 12,707 18,865
Embedded
LUT 11,755 20,837
jpeg 66.5%
Reg 11,910 20,900
LUT 74,965 90,789
Canny 54.3%
High-performance Reg 48,994 54,849
computing LUT 86,125 101,980
jpeg 51.8%
Reg 64,716 71,527

Resources usage & energy consumption analysis


Acknowledgements.
We use synthesis reports from the Xilinx tools and XPower
This research is funded by Ho Chi Minh City University of
analyzer also from Xilinx to extract hardware resources
Technology - VNU-HCM under grant number To-KHMT-
usage and power consumption of the systems. We compare
2018-03.
our systems with the baseline systems in terms of resources
utilization and energy consumption saved when using our
systems instead of the baseline ones. References
Table 2 presents the hardware resources utilization for
one FPGA device in different systems with the two [1] C. Pham-Quoc, “Design Framework for FPGA-based
applications. We report the two most important values of Hardware Accelerator with Hybrid Interconnect,” in
resources including Look-up Tables (LUT) and Registers proceedings of 2019 6th NAFOSTED Conference on
(Reg). Please note that, we only show resources directly Information and Computer Science (NICS), December 2019,
used for the processing of our applications, i.e., we do not Hanoi, Vietnams
take resources used for managing the systems into account [2] C. Pham-Quoc, B. Kieu-Do, and T. Ngoc Thinh, “An fpga-
based seed extension ip core for bwa-mem dna alignment,” in
although they exist in the systems like debugger module or
2018 International Conference on Advanced Computing and
I/O handling modules. According to the table, our systems Applications (ACOMP), Nov 2018, pp. 1–6.
in both platforms need more resources than the baseline [3] C. Pham-Quoc, B. Tran-Thanh, and T. N. Thinh, “A scalable
ones for all the experiments. The main reason for this issue fpga-based floating-point gaussian filtering architecture,” in
is the hybrid interconnect used. 2017 International Conference on Advanced Computing and
The table also presents energy reduction that our systems Applications (ACOMP), Nov 2017, pp. 111–116.
offer when compared to the baseline systems. Please note [4] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen,
that the energy consumption is estimated by power “Hypar: Towards hybrid parallelism for deep learning
consumption (reported by Xilinx XPower analyzer) and accelerator array,” in 2019 IEEE International Symposium on
execution time. According to the table, although we need High Performance Computer Architecture (HPCA), Feb
2019, pp. 56–68.
more resources than the baseline systems due to the hybrid [5] M. Torabzadehkashi, S. Rezaei, A. Heydarigorji, H.
interconnect; our systems use less energy than the baseline Bobarshad, V. Alves, and N. Bagherzadeh, “Catalina: In-
ones by up to 66.5%. storage processing acceleration for scalable big data
analytics,” in 2019 27th Euromicro International Conference
on Parallel, Distributed and Network-Based Processing
5. Conclusions (PDP), Feb 2019, pp. 430–437.
[6] C. Pham-Quoc, B. Nguyen, and T. N. Thinh, “Fpga-based
In this paper, we summarize the automatic FPGA-based multicore architecture for integrating multiple ddos defense
hardware accelerator design framework proposed in our mechanisms,” SIGARCH Comput. Archit. News, vol. 44, no.
previous work. We then present in detail the accelerator 4, pp. 14–19, Jan. 2017.
systems for two applications belonging the image processing [7] D. Pnevmatikatos, K. Papadimitriou, T. Becker, P. Baehm, A.
Brokalakis, K. Bruneel, C. Ciobanu, T. Davidson, G.
domain. The case study proves that the framework can help Gaydadjiev, K. Heyse, W. Luk, X. Niu, I. Papaefstathiou, D.
designers reduce the NRE cost and efforts in developing Pau, O. Pell, C. Pilato, M. Santambrogio, D. Sciuto, D.
FPGA-based systems. Moreover, with the support of the Stroobandt, T. Todman, and E. Vansteenkiste, “Faster:
framework, we achieve better overall performance Facilitating analysis and synthesis technologies for effective
compared to both the host processors only and the baseline reconfiguration,” Microprocessors and Microsystems, vol. 39,
systems. We also manage to save up to 66.5% of energy no. 4, pp. 321 – 338, 2015.
consumption when compared to the baseline systems. [8] D. Glick, J. Grigg, B. Nelson, and M. Wirthlin, “Maverick: A
Although more resources are needed by our systems, more standalone cad flow for xilinx 7-series fpgas,” in Proceedings
of the 2019 ACM/SIGDA International Symposium on Field-
energy can be saved. Saving energy is one of the most
critical issues for green computing in this era.

EAI Endorsed Transactions on


5 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5
Cuong Pham-Quoc

Programmable Gate Arrays, ser. FPGA ’19. New York, NY, [11] J. Canny, “A computational approach to edge detection,”
USA: ACM, 2019, pp. 306–307. Pattern Analysis and Machine Intelligence, pp. 679 –698,
[9] D. Sanchez, G. Michelogiannakis, and C. Kozyrakis, “An 1986.
analysis of on-chip interconnection networks for large-scale [12] J. Scott, L. H. Lee, J. Arends, and B. Moyer, “Designing the
chip multiprocessors,” ACM Trans. Archit. Code Optim., vol. lowpower M_CORE architecture,” in IEEE Power Driven
7, no. 1, pp. 4:1–4:28, May 2010. Microarchitecture Workshop, 1998.
[10] C. Pham-Quoc, Z. Al-Ars, and K. Bertels, “Heterogeneous [13] Xilinx, “Ml510 reference design,” 2009
hardware accelerators interconnect: An overview,” in 2013 [14] Micron, “Hybrid core computer,” 2012
NASA/ESA Conference on Adaptive Hardware and Systems
(AHS-2013), June 2013, pp. 189–197.

EAI Endorsed Transactions on


6 Context-aware Systems and Applications
11 2019 - 05 2020 | Volume 7 | Issue 20 | e5

You might also like