Eai 12-5-2020 164497
Eai 12-5-2020 164497
1
Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam
2
Vietnam National University – Ho Chi Minh City, Thu Duc District, Ho Chi Minh City, Vietnam
Abstract
We present a case study of automatic FPGA-based hardware accelerator design using our proposed framework with the
image processing domain. With the framework, the ultimate systems are optimized in both performance and energy
consumption. Moreover, using the framework, designers can implement FPGA platforms without manually describing any
hardware cores or the interconnect. The systems offer accelerations in execution time compared to traditional general
purpose processors and accelerator systems designed manually. We use two applications in the image processing domain
as experiments to report our work. Those are Canny edge detection and jpeg converter. The experiments are conducted in
both embedded and high-performance computing platforms. Results show that we achieve overall speed-ups by up to
3.15´ and 2.87´ when compared to baseline systems in embedded and high-performance platforms, respectively. Our
systems consume less energy than other FPGA-based systems by up to 66.5%.
Copyright © 2020 Cuong Pham-Quoc licensed to EAI. This is an open access article distributed under the terms of the Creative
Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and
reproduction in any medium so long as the original work is properly cited.
doi: 10.4108/eai.12-5-2020.164497
In this paper, we report a case study for the image In this framework, a target application (developed in high
processing domain. We use the design framework to level programming language) is profiled in Step 1 to collect
implement two different applications, the Canny edge execution time of functions in the application. This profiling
detection and the jpeg image converter. We discuss design step also creates a communication graph to represent data
steps and conduct experiments with the systems in both communication of functions inside the application. Based on
embedded and high-performance computing platforms. We this data communication graph, a hybrid interconnect for
analyze experimental results of both the systems in terms of hardware kernels as well as between the hardware device
execution time and energy consumption. We then compare and the host processor can be defined appropriately.
our system with general purpose processors and traditional The application is then partitioned into hardware and
FPGA-based accelerators. software parts in Step 2. Computationally intensive
The main contributions of this paper are summarized into functions should be candidates for accelerating with
two folds. hardware kernels in FPGA. Moreover, based on the data
communication graph, non-intensive functions may also be
• We first briefly introduce a case study of the image implemented in FPGA for reducing off-chip data
processing domain with two applications when using communication.
our proposed framework; As stated above, the data communication graph is used to
• We analyze and compare results of the systems define the most suitable hybrid interconnect for hardware
designed by the framework and other systems. kernels in Step 3. The hybrid interconnect may comprise
bus, crossbar, network-on-chip, or shared buffer depended
The rest of the paper is organized as follows. Section 2 on the communication patterns of the kernels. The main
quickly presents the framework designed and reported in our purpose of this step is to reduce data communication
previous work. Section 3 discusses steps in developing two overhead while keeping hardware resources usage for the
applications in the image processing domain with our interconnect minimized. That, in turn, will save energy
framework. We present our experimental results with two consumption of the system since the less resources are used,
computing platforms in Section 4. Finally, the paper the more energy is saved.
contributions are concluded in Section 5. The selected functions are then synthesized by high-level
synthesis (HLS) toolchains (in this case we use the Vivado
HLS from Xilinx) to create hardware kernels described by
2. The automatic design framework hardware description languages (in our work, we prefer
Verilog-HDL). With the support of HLS tools, functions
In our previous work [1], we already proposed an automatic written in high level programming languages can be
framework for designing an FPGA-based hardware compiled to Verilog-HDL automatically. This solves one of
accelerator with a hybrid interconnect. The framework the most difficult issues that designers usually have.
allows designers to develop a hardware accelerator system Finally, the entire system is developed, synthesized and
for a particular application without much knowledge and mapped to hardware platforms by tools provided by FPGA
working effort at the hardware level. Figure 1 depicts the manufacturers (in our case, we use Xilinx Vivado since
design flow of the proposed framework. Although the Xilinx platforms are targeted). Based on the resources
design flow includes five automatic processing steps, availability of the target platforms, computationally
designers are also able to interfere to manually make further intensive kernels can be replicated to further improve overall
improvements. performance.
As summarized above, designers do not need to interfere
with the framework much because all steps can process
automatically. However, in case designers would involve
themselves to modify some kernels or the interconnect, they
are still able to do.
3. Case study
In this section, we introduce our hardware accelerator
systems developed for the two applications, Canny edge
detection [11] and jpeg image converter [12] using the
framework.
filter, gradient calculation, non-maximize suppression, and modules for our hardware kernels. Figure 3 shows a part of
finally hysteresis thresholding. The application was first the Verilog-HDL description generated by Xilinx Vivado
introduced in 1986 and coded by ANSI C. The profiling HLS for the gaussian_smooth function.
results indicate that three operators Gaussian filter, gradient
calculation and non-maximize suppression are the most
computationally intensive. The framework also generates
the data communication graph shown in Figure 2.
input_functions
1287880 Bytes
13360 UnMAs
13360 UnDVs
derrivative_x_y
1120 Bytes
60 UnMAs
8 Bytes
8 UnMAs
In this work, we implement applications on both
64UnDVs 8 UnDVs embedded system with the Xilinx ML510 board [13] and
53200 Bytes
53200 UnMAs
high-performance computing system with Micron HC-2ex
53200 UnDVs [14]. Therefore, the final step needs to be performed for two
248 Bytes 65266 Bytes different target platforms. For the embedded platform,
magnitude_x_y 64 UnMAs 53200 UnMAs
860 UnDVs 53200 UnDVs because there exists only one FPGA device, we are able to
128380 Bytes build the system with 5 kernels. Hence, only the
26600 UnMAs
26600 UnDVs gaussian_smooth kernel is duplicated. Meanwhile, the high-
35282 Bytes 283937 Bytes
performance computing platform consists of 4 modern
26600 UnMAs
26600 UnDVs
non_max_supp 607 UnMAs
61130 UnDVs
FPGA devices that can host more kernels. Consequently, we
13348 Bytes
replicate the kernels up to 64 accelerators in total.
13300 UnMAs
13300 UnDVs
The system finally is synthesized and mapped to FPGA
devices by the Xilinx Vivado toolchain. This final step is
output_functions technology dependent. Resources usage and other
Figure 2. Data communication graph of the Canny parameters are reported in detail in Section 4.
edge detection application
3.2. The jpeg image converter
According to the profiling results, three aforementioned The jpeg image converter is the second application used as
operators (implemented as functions gaussian_smooth, our case study. The main purpose of this application is to
magnitude_x_y, non_max_supp) are good candidates to be encode bitmap images to the jpeg format. The application
accelerated by hardware kernels. However, as illustrated in was implemented in ANSI C and reported in the benchmark
the data communication graph, the derivative_x_y function in [12]. The main part of the application includes four
should also be accelerated by hardware kernels because computationally intensive functions. Those are huff_dc_dec,
there is a huge amount of data transferred between this huff_ac_dec, dquantz_lum, and j_rev_dct.
function and other candidates. Therefore, all the four
functions are implemented as accelerators. Other procedures
of the application are kept processing on the general-purpose
processor.
During the hybrid interconnect generation step, the most
suitable interconnect is designed for data communication
among the four hardware kernels above. Please note that, the
communication infrastructure for transferring data between
the processor and kernels is already defined by the target
platform used for building the system. In this application, a
shared buffer is used for the interconnect of the
gaussian_smooth and derivative_x_y accelerators while a
network-on-chip (NoC) is used for transferring data among
the derivative_x_y and the rest two kernels. Figure 4. Data communication graph of the jpeg
As discussed above, in this work we target Xilinx image converter application
platforms for implementing our systems; we therefore use
Xilinx Vivado HLS to generate Verilog-HDL description
Figure 4 illustrates the data communication graph mainly first are executed on the host processor. They then are
focusing on the four mentioned functions. All those processed by the entire accelerator systems, i.e., hardware
functions are accelerated by hardware kernels. Similar to the kernels process computationally intensive functions while
Canny edge detection application, a NoC should be used for software part still is kept executing on the host.
data communication of the three first kernels while a shared For the high-performance computing system, Micron
local buffer involves in transferring data between the HC-2ex (formerly Convey) is used as our experiment
dquantz_lum and j_rev_dct accelerators. Similar to the platform. The system includes of four Virtex-6 xc6vlx760
systems for the Canny edge detection application, Vivado FPGA devices and one Intel Xeon X5670 processor. While
HLS is also used for generating Verilog-HDL descriptions the host processor can function at 2.93 GHz, the accelerators
for these functions from the C code. are set to work at 200 MHz. Similar to the embedded
This application is also implemented in both the above system, we first execute the applications on the host
embedded and high-performance computing systems. With processor with full parallelization, i.e., the whole 12 cores of
the embedded system, the huff_ac_dec kernel is duplicated the processor are used to process the applications. They then
only while there are 64 kernels built in the high-performance are processed by the entire accelerator systems to compare
computing system. Figure 5 illustrates the architecture of the execution time.
jpeg application when implemented on the embedded
platform.
4.2 Experimental results
In this section, we discuss our experimental results with the
two aforementioned systems when processing the two
applications. Performance of kernels and entire systems as
well as energy consumption and resources usage are
analyzed.
Performance analysis
Table 1 depicts the kernels and the overall accelerator
systems speed-ups when compared to their host processors
(PowerPC for the embedded system and Intel Xeon for the
high-performance computing system) in the third and the
fourth columns, respectively. The table also presents speed-
ups compared to baseline systems (the baseline systems are
Figure 5. The architecture for jpeg application in the the traditional accelerator systems without helps of our
embedded platform framework for optimizing data communication but including
replicated kernels for a fair comparison) in the fifth and the
sixth columns, respectively.
4. Experiments
In this section, we present our experiments to verify the Table 1. Speed-ups comparison between our systems
FPGA-based accelerator systems reported above. Kernels and others
performance as well as overall system performance are
presented. Energy consumption compared to baseline
w.r.t host w.r.t baseline
systems is shown also. Platform App. processors systems
kernels overall kernels overall
4.1 Experimental setup Canny 3.88´ 3.15´ 2.12´ 1.83´
EMB(1)
jpeg 2.55´ 2.33´ 3.08´ 2.87´
As presented above, the embedded and high-performance Canny 2.62´ 2.61´ 1.55´ 1.54´
HPC(2)
computing systems used as our target platforms are ML510 jpeg 1.96´ 1.45´ 1.93´ 1.42´
and Micro HC-2ex, respectively. The ML510 board consists (1) EMB: Embedded system; (2) High-performance
of only one Xilinx xc5vfx130t FPGA device. Inside the computing system
FPGA device, there exist two embedded hardwired As shown in the table, compared to both the host
PowerPC processors that are used as host processors to processors and the baseline systems, our systems designed
process the software part of applications. Hardware with framework achieve better performance in term of
accelerators are mapped into the reconfigurable area of the execution time; in other words our systems outperform the
device. In this paper, we configure the processors others in term of execution time. Our systems process the
functioning at 400 MHz while hardware kernels can work at applications up to 3.15´ faster than general-purpose
100 MHz only due to huge amount of reconfigurable logic processors and up to 2.87´ faster than baseline systems.
resources used. To compare performance, the applications at
Programmable Gate Arrays, ser. FPGA ’19. New York, NY, [11] J. Canny, “A computational approach to edge detection,”
USA: ACM, 2019, pp. 306–307. Pattern Analysis and Machine Intelligence, pp. 679 –698,
[9] D. Sanchez, G. Michelogiannakis, and C. Kozyrakis, “An 1986.
analysis of on-chip interconnection networks for large-scale [12] J. Scott, L. H. Lee, J. Arends, and B. Moyer, “Designing the
chip multiprocessors,” ACM Trans. Archit. Code Optim., vol. lowpower M_CORE architecture,” in IEEE Power Driven
7, no. 1, pp. 4:1–4:28, May 2010. Microarchitecture Workshop, 1998.
[10] C. Pham-Quoc, Z. Al-Ars, and K. Bertels, “Heterogeneous [13] Xilinx, “Ml510 reference design,” 2009
hardware accelerators interconnect: An overview,” in 2013 [14] Micron, “Hybrid core computer,” 2012
NASA/ESA Conference on Adaptive Hardware and Systems
(AHS-2013), June 2013, pp. 189–197.