0% found this document useful (0 votes)
35 views

8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA

This paper proposes an improved algorithm for implementing a deep learning YOLO network on an Xilinx ZYNQ FPGA. It optimizes the YOLO network model and uses 16-bit fixed-point numbers to reduce computations and memory usage. Experimental results show that the proposed method achieves a greatly improved processing speed while maintaining accuracy, making it suitable for mobile device applications requiring real-time CNN processing.

Uploaded by

Imtiaz Mohammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA

This paper proposes an improved algorithm for implementing a deep learning YOLO network on an Xilinx ZYNQ FPGA. It optimizes the YOLO network model and uses 16-bit fixed-point numbers to reduce computations and memory usage. Experimental results show that the proposed method achieves a greatly improved processing speed while maintaining accuracy, making it suitable for mobile device applications requiring real-time CNN processing.

Uploaded by

Imtiaz Mohammad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2020 International Conference on Culture-oriented Science & Technology (ICCST)

An improved algorithm for deep learning YOLO network based


on Xilinx ZYNQ FPGA

Zhenguang Li1,2,
*Corresponding author : Jintao Wang1,2
1
School of Information and Communication
1
Engineering,Communication University of China School of Information and Communication
Engineering,Communication University of China
2
Key Laboratory of Media Audio and Video
2
Key Laboratory of Media Audio and Video
(Communication University of China),
(Communication University of China),
Ministry of Education
Ministry of Education
Beijing,China
Beijing,China
[email protected]
[email protected]

Abstract—With the development of artificial intelligence, of CNN, its time complexity and space complexity gradually
convolutional neural networks (CNN) have been widely used in increase, implemented on mobile devices face enormous
image processing and other aspects due to their excellent challenges [3]. As a field programmable logic device, FPGA
performance. However, as a computationally intensive has the advantages of low power consumption, low delay,
algorithm, CNN face huge challenges in the realization of and flexibility, and it is gradually used to accelerate the
mobile devices. FPGA have the advantages of high forward reasoning process of CNN. Among them, Xilinx's
performance, reprogramming, and low power consumption, ZYNQ [4] FPGA series chips use the system architecture of
and have becoming suitable choices for CNN deployment. ARM+FPGA, including ARMCortex-A9 dual-core
Compared with various CNN algorithms, the YOLO algorithm processing system (PS) and Xilinx 28nm medium-scale
regards target detection as a regression problem. It is a one- programmable logic (PL), It supports Xilinx's SDSoC
step algorithm with fast execution speed and small amount of development process, enabling rapid deployment and
calculation. It is suitable for implementation on FPGA algorithm verification of CNN models to FPGA hardware.
hardware platforms. This paper proposes an improved
algorithm for deep learning YOLO network based on Xilinx Among the various convolutional neural networks, the
ZYNQ FPGA. By optimizing the YOLO network model and most representative ones are SSD[5], Faster R-CNNs
fixed-point, etc., the problem of large computational of CNN [6][7][8], and YOLO [9]. Table I shows the accuracy and
and limited resources on FPGA chips is solved, and the speed comparison of the main target recognition algorithms.
parallelism of FPGA is used to accelerate the CNN. Among them, the YOLO algorithm is a single neural
Experimental results show that the method proposed in this
network target detection system proposed by Joseph Redmon
paper has greatly improved the operation rate while
and Ali Farhadi. It regards target detection as a regression
problem, which is a one-step algorithm. A picture only
maintaining accuracy, and has important practical value in the
requires one calculation CNN would be able to direct output
realization of mobile terminals of CNN and real-time
categories and corresponding positioning information.
computing.
Therefore, compared with other algorithms, YOLO's
execution speed is very fast, and the amount of calculation is
Keywords-CNN, FPGA-ZYNQ, YOLO, acceleration
relatively small, suitable for implementation on the FPGA
I. INTRODUCTION hardware platform [10].
In recent years, CNN has become one of the research This paper presents an improved algorithm for deep
hotspots in the field of artificial intelligence [1], and has learning YOLO network based on Xilinx ZYNQ FPGA,
shown excellent performance in many aspects such as combined with the hardware characteristics of FPGA, to
computer vision, speech recognition, and natural language accelerate the deep learning YOLO network. The
processing [2]. Deep convolutional neural network algorithm acceleration algorithm mainly solves the problem that the
is computationally intensive, with the gradual development CNN has a large amount of calculation and the FPGA has

978-1-7281-8138-7/20/$31.00 ©2020 IEEE 447


DOI 10.1109/ICCST50977.2020.00092

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
limited resources on the chip, and uses the parallelism
feature of the FPGA to accelerate the CNN. The algorithm
implementation mainly includes three parts: FPGA-YOLO
network design, activation function selection, and 16bit
fixed-point optimization of weight parameters.
TABLE I. COMPARISON OF ACCURACY AND SPEED OF MAIN
ALGORITHMS

Detection method Accuracy˄mAP˅(%) FPS

Tiny-YOLO 52.7 155

YOLO 63.4 45

Figure 1. FPGA-YOLO network hierarchy architecture diagram.


R-CNN 53.5 6
In the design of FPGA hardware platform, we mainly use
Fast R-CNN 70.0 0.5 the system architecture of ZYNQ chip ARM+FPGA. As
shown in Figure 2, the FPGA programmable logic part on
Faster R-CNN 73.2 7 ZYNQ becomes the PL end, and the processor part with
ARM as the core becomes the PS end. In this system, the
DPM 16.0 100 FPGA-YOLO network is implemented on the FPGA on the
PL side, and exchanges data with the Linux on the PS side
Fastest DPM 30.4 15 through DMA. Linux uploads FPGA-YOLO network weight
parameters and input data directly to the accelerator. After
SSD 72.1 58 the CNN calculation is completed, the calculation results are
returned to Linux. In Linux, the DMA (Direct Memory
Access) driver controls DMA data reading and writing, and
II. ALGORITHM IMPLEMENTATION exchanges data through the Socket and the PC. The image
A. FPGA-YOLO Network Design data is controlled by the HDMI module for input and output,
and is buffered by the DDR module.
After continuous development, there are three versions of
the YOLO network. The original network used in this article According to the characteristics of the hardware system,
is an accelerated version of YOLO, Tiny-YOLO. As an we optimize the FPGA-YOLO network. As shown in Table
accelerated version of YOLO, Tiny-YOLO has reduced the II , compared with the original Tiny-YOLO network, the
network depth to increase the processing speed again. FPGA-YOLO network changes the input from 416*416 to
Compared with YOLO, although the recognition accuracy of 640*352, which ensures that it matches the 1280*720 format
Tiny-YOLO is lost, the processing speed is more than three video source input by the HDMI interface. We reduce the
times the original. calculation amount of image pre-processing by extraction.
The pooling layer with layer number 11 is removed to avoid
The FPGA-YOLO network is designed based on the the problem that the pooling format does not match the
Tiny-YOLO network. The hierarchical architecture is shown subsequent 16-bit quantization format. Modify the following
in Figure 1. The FPGA-POLO network is a fully two convolutional layers with 1024 filters to 512 to reduce
convolutional network with a total of 15 operating layers, the amount of calculation. The FPGA-YOLO network
including 9 convolutional layers and 5 pooling layers. The divides the input picture into 20*11 basic grids, and predicts
convolutional layer learns the characterization of the data, each grid. Each group of prediction results contains 8
and the multi-layer convolution is used to extract the low- prediction data. Finally, the network output is 20*11*5*8.
level features and high-level features of the image,
respectively. The pooling layer reduces the feature
dimensions of the convolutional layer output while
improving the learning effect, making the learned network
less prone to overfitting. Finally, replace the fully connected
layer with a convolutional layer to output the marked image
directly.

448

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
+'0,,QSXW +'0,2XWSXW
function outputs y=0. Compared with the Leaky function, the
Relu function is simple to implement in circuit and saves
time in calculation.
''5

­ x, x ≥ 0
y=® (1)
'0$ '0$GULYHU
¯kx, x < 0

$SS

&11 ­ x, x ≥ 0 (2)
)3*$<2/2 6RFNHW /LQX[ y=®
¯0, x < 0
/LQX[ 3&
=<14
3/ )3*$ 36 $50 C. Fixed-point Optimization
Figure 2. ZYNQ architecture diagram Floating-point numbers and fixed-point numbers are two
representations of data in the machine. The benefit of
TABLE II. PARAMETER CONFIGURATION OF THE FPGA-YOLO floating-point representation of data is that it has a larger
MODEL representation range and higher accuracy. However, in
hardware design, compared with floating-point operations,
Layer Type Filters Size/Stride Input Output
FPGA fixed-point operations can significantly reduce
1 C 16 3×3/1 640×352×3 640×352×16 resource consumption and shorten clock cycles, and can also
reduce power consumption .
2 M 2×2/2 640×352×16 320×176×16
After training the FPGA-YOLO network through the
3 C 32 3×3/1 320×176×16 320×176×32 Darknet open source framework, the output is a 32-bit
weight parameter file, which contains a large amount of data.
4 M 2×2/2 320×176×32 160×88×32 Caffe-Ristretto is an automatic CNN quantization tool that
can compress 32-bit floating-point networks, allowing
5 C 64 3×3/1 160×88×32 160×88×64
testing, training, and fine-tuning networks with limited
6 M 2×2/2 160×88×64 80×44×64 digital accuracy. Due to the limited internal resources of
FPGA, we chose to use the Caffe-Ristret open source tool to
7 C 128 3×3/1 80×44×64 80×44×128 perform fixed-point quantization on 32-bit weight
parameters.
8 M 2×2/2 80×44×128 40×22×128
TABLE III. COMPARISON OF THE WEIGHT PARAMETERS OF
9 C 256 3×3/1 40×22×128 40×22×256 DIFFERENT DIGIT REPRESENTATIONS

10 M 2×2/2 40×22×256 20×11×256 Fixed-point strategy Parameter size Compression ratio

11 C 512 3×3/1 20×11×256 20×11×512


floatingpoint 50.6M 1
12 C 512 3×3/1 20×11×512 20×11×512

32bit-fixedpoint 50.6M 1
13 C 512 3×3/1 20×11×512 20×11×512

14 C 40 1×1/1 20×11×512 20×11×40 25.3M 50%


16bit-fixedpoint
Note: C=Convolutional Layer, M=Maxpool Layer.
8bit-fixedpoint 12.7M 25%
B. Activation Function Selection
The traditional Tiny-YOLO network uses the Leaky
function as the activation function, and the Leaky function
formula is shown in equation (1). When x<0, the Leaky The weighted parameter files of the FPGA-YOLO
function will output Y=kx, and k is a decimal between 0 and network after training are fixed-point processed. The space
1. FPGA needs to call DSP resources and consume a lot of occupied by the weighted parameters of different number of
time when doing floating-point multiplication. Therefore, representations is shown in Table III. The resource
this paper chooses to use the Relu function instead of the requirements and processing speed of 8-bit fixed-point
Leaky function as the activation function. The Relu function numbers and 16-bit fixed-point numbers are basically the
formula is shown in equation (2). When x<0, the Relu same, but the experimental results show that after the 8-bit

449

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
fixed-point weight parameter file is deployed to the FPGA, IV. RESULTS AND ANALYSIS
the calculation error will cause the target to not be correctly We use the KITTI dataset (currently the world's largest
identified. Although the 32-bit fixed-point number has a evaluation dataset for computer vision algorithms in
good recognition effect, the size of the data is not autonomous driving scenarios) to train the original Tiny-
compressed at all. So we finally chose 16bit fixed-point.
YOLO network and the optimized FPGA-YOLO network.
III. SYSTEM IMPLEMENTATION Experiment on various platforms and compare network
performance. Analyze the experimental results from three
In the hardware system design, we chose Xilinx's high- aspects: recognition speed, recognition accuracy and power
performance ZYNQ SoC FPGA chip to implement the consumption.
system architecture. The hardware platform mainly uses the
system architecture of ZYNQ chip ARM+FPGA, which TABLE IV. MULTI-PLATFORM TINY-YOLO NETWORK
includes the smallest system module, FPGA-YOLO network PERFORMANCE TEST
IP core module, input HDMI module, output HDMI module,
Type CPU GPU FPGA
SD card drive module, DDR module and interconnection
module. The smallest system module is based on ARM for
embedded development and has a task scheduling function in Platform Inter®Core GTX Titan Virtex-7 ZYNQ7035
the entire system. The FPGA-YOLO network IP core TM
i5-6200U X
module is responsible for target recognition of the input VC707 (This Work)
image. The HDMI module is responsible for image input and
output. The SD card driver module is an external storage Frequency 2.3GHz 1GHz 200MHz 50MHz
module of the FPGA-YOLO platform, which stores the
platform startup files and application files. The DDR module Storage 8G 12G 2G 2G
is the platform's buffer. The interconnection module is
responsible for scheduling and combining various modules. DDR3 DDR5 DDR3 DDR3
The system implementation process is shown in Figure 3.
Speed(s) 2.331145s 0.008451s 0.192643 0.051915s

0HPFS\JHWVWKH Accuracy
EXIIHUHGLPDJHWREH
SURFHVVHG ˄mAP˅ 51.3 52.7 48.2 48.5
6\VWHP (%)
LQLWLDOL]DWLRQ

Pow(W) N/A 170 12.7 9


,PDJHSUHSURFHVVLQJ

/RDGZHLJKW
SDUDPHWHUILOHWR We tested the same pictures on CPU, GPU and FPGA
''5
platforms, and the test results are shown in Table IV. The
6HQGGDWDWRWKH experimental results show that the average time for the
DFFHOHUDWRU FPGA platform designed in this paper to recognize single-
,QLWLDOL]HLPDJH frame images is 0.051915s, which is 44.9 times faster than
'0$ the CPU's 2.331145s. The processing speed is still lower
than that of the dedicated graphics processor GPU, but
*HWUHVXOWVIURP compared with the GPU, the power consumption has been
DFFHOHUDWRU
greatly reduced. In terms of recognition accuracy, we use
4XHU\DFFHOHUDWRU mAP value to measure the recognition result. The average
UHJLVWHUVWDWXV detection accuracy of the FPGA platform designed in this
paper is 48.5%, which is slightly lower than that of CPU and
)3*$<2/2DOJRULWKP
SRVWSURFHVVLQJ GPU platforms, and still has a high detection accuracy.
Figure 4 shows the recognition effect of FPGA hardware
,QLWLDOL]HGDWD
FKDQQHO'0$
platform. The input image is a color image with a resolution
of 1280*720. The platform successfully identified
2SHQ&9SURFHVVHVDQG pedestrians and vehicles in the picture and marked them with
RXWSXWVLPDJHV borders in the picture.

Figure 3. System implementation process

450

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
Control and Communications (SIBCON), Tomsk, Russia, 2019, pp.
1-5.
[3] A. Jinguji, Y. Sada and H. Nakahara, "Real-Time Multi-Pedestrian
Detection in Surveillance Camera using FPGA," 2019 29th
International Conference on Field Programmable Logic and
Applications (FPL), Barcelona, Spain, 2019, pp. 424-425, doi:
10.1109/FPL.2019.00078
[4] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and J.Carrillo.
SDSoC: A Higher-level Programming Environment for Zynq SoC
 and Ultrascale+ MPSoC. Int’l Symp. on Field Programmable Gate
(a) Cars and Pedestrians Arrays (FPGA), pages 4–4, Feb 2016.
[5] Liu W, Anguelov D, Erhan D. SSD: Single shot multibox detector
[C]. In: Proc. Of European Conference on Computer Vision, 2016,
21-37.
[6] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature
Hierarchies for Accurate Object Detection and Semantic
Segmentation," 2014 IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, OH, 2014, pp. 580-587, doi:
10.1109/CVPR.2014.81.
[7] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on
Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448, doi:
10.1109/ICCV.2015.169.
[8] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks," in
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
39, no. 6, pp. 1137-1149, 1 June 2017, doi:
10.1109/TPAMI.2016.2577031.
 [9] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look
(b) Face Once: Unified, Real-Time Object Detection," 2016 IEEE Conference
Figure 4. The recognition effect of FPGA hardware platform on Computer Vision and Pattern Recognition (CVPR), Las Vegas,
NV, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
V. CONCLUSION [10] S. Zhang, J. Cao, Q. Zhang, Q. Zhang, Y. Zhang and Y. Wang, "An
FPGA-Based Reconfigurable CNN Accelerator for YOLO," 2020
In this article, we propose an improved algorithm for IEEE 3rd International Conference on Electronics Technology
deep learning YOLO network based on FPGA hardware. By (ICET), Chengdu, China, 2020, pp. 74-78, doi:
10.1109/ICET49382.2020.9119500.
redesigning the Tiny-YOLO network, replacing the
activation function and performing 16-bit fixed-point [11] X. Xu and B. Liu, "FCLNN: A Flexible Framework for Fast CNN
Prototyping on FPGA with OpenCL and Caffe," 2018 International
operations on the weight parameters, the problem of Conference on Field-Programmable Technology (FPT), Naha,
insufficient resources on the FPGA chip of the mobile device Okinawa, Japan, 2018, pp. 238-241, doi: 10.1109/FPT.2018.00043.
is solved. At the same time, we take advantage of the parallel [12] G. Wei, Y. Hou, Q. Cui, G. Deng, X. Tao and Y. Yao, "YOLO
nature of FPGA to accelerate the convolutional neural Acceleration using FPGA Architecture," 2018 IEEE/CIC
network, which greatly improves the speed of image International Conference on Communications in China (ICCC),
recognition. Experimental results show that the method Beijing, China, 2018, pp. 734-735, doi:
10.1109/ICCChina.2018.8641256.
proposed in this paper reduces power consumption and
[13] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun and F. Liang, "A RISC-V
greatly improves the speed of calculation while maintaining based hardware accelerator designed for Yolo object detection
accuracy, and has important practical value in the system," 2019 IEEE International Conference of Intelligent Applied
implementation of convolutional neural network mobile Systems on Engineering (ICIASE), Fuzhou, China, 2019, pp. 9-11,
terminals and real-time computing. doi: 10.1109/ICIASE45644.2019.9074051.
[14] Y. Liang, L. Lu, Q. Xiao and S. Yan, "Evaluating Fast Algorithms for
REFERENCES Convolutional Neural Networks on FPGAs," in IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 39,
[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with no. 4, pp. 857-870, April 2020, doi: 10.1109/TCAD.2019.2897701.
deep convolutional neural networks[C]. In: Proc. of International
Conference on Neural Information Processing Systems. 2012, 1097-
1105.
[2] I. V. Zoev, A. P. Beresnev and N. G. Markov, "Convolutional neural
networks of the YOLO class in computer vision systems for mobile
robotic complexes," 2019 International Siberian Conference on

451

Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.

You might also like