8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA
8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA
Zhenguang Li1,2,
*Corresponding author : Jintao Wang1,2
1
School of Information and Communication
1
Engineering,Communication University of China School of Information and Communication
Engineering,Communication University of China
2
Key Laboratory of Media Audio and Video
2
Key Laboratory of Media Audio and Video
(Communication University of China),
(Communication University of China),
Ministry of Education
Ministry of Education
Beijing,China
Beijing,China
[email protected]
[email protected]
Abstract—With the development of artificial intelligence, of CNN, its time complexity and space complexity gradually
convolutional neural networks (CNN) have been widely used in increase, implemented on mobile devices face enormous
image processing and other aspects due to their excellent challenges [3]. As a field programmable logic device, FPGA
performance. However, as a computationally intensive has the advantages of low power consumption, low delay,
algorithm, CNN face huge challenges in the realization of and flexibility, and it is gradually used to accelerate the
mobile devices. FPGA have the advantages of high forward reasoning process of CNN. Among them, Xilinx's
performance, reprogramming, and low power consumption, ZYNQ [4] FPGA series chips use the system architecture of
and have becoming suitable choices for CNN deployment. ARM+FPGA, including ARMCortex-A9 dual-core
Compared with various CNN algorithms, the YOLO algorithm processing system (PS) and Xilinx 28nm medium-scale
regards target detection as a regression problem. It is a one- programmable logic (PL), It supports Xilinx's SDSoC
step algorithm with fast execution speed and small amount of development process, enabling rapid deployment and
calculation. It is suitable for implementation on FPGA algorithm verification of CNN models to FPGA hardware.
hardware platforms. This paper proposes an improved
algorithm for deep learning YOLO network based on Xilinx Among the various convolutional neural networks, the
ZYNQ FPGA. By optimizing the YOLO network model and most representative ones are SSD[5], Faster R-CNNs
fixed-point, etc., the problem of large computational of CNN [6][7][8], and YOLO [9]. Table I shows the accuracy and
and limited resources on FPGA chips is solved, and the speed comparison of the main target recognition algorithms.
parallelism of FPGA is used to accelerate the CNN. Among them, the YOLO algorithm is a single neural
Experimental results show that the method proposed in this
network target detection system proposed by Joseph Redmon
paper has greatly improved the operation rate while
and Ali Farhadi. It regards target detection as a regression
problem, which is a one-step algorithm. A picture only
maintaining accuracy, and has important practical value in the
requires one calculation CNN would be able to direct output
realization of mobile terminals of CNN and real-time
categories and corresponding positioning information.
computing.
Therefore, compared with other algorithms, YOLO's
execution speed is very fast, and the amount of calculation is
Keywords-CNN, FPGA-ZYNQ, YOLO, acceleration
relatively small, suitable for implementation on the FPGA
I. INTRODUCTION hardware platform [10].
In recent years, CNN has become one of the research This paper presents an improved algorithm for deep
hotspots in the field of artificial intelligence [1], and has learning YOLO network based on Xilinx ZYNQ FPGA,
shown excellent performance in many aspects such as combined with the hardware characteristics of FPGA, to
computer vision, speech recognition, and natural language accelerate the deep learning YOLO network. The
processing [2]. Deep convolutional neural network algorithm acceleration algorithm mainly solves the problem that the
is computationally intensive, with the gradual development CNN has a large amount of calculation and the FPGA has
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
limited resources on the chip, and uses the parallelism
feature of the FPGA to accelerate the CNN. The algorithm
implementation mainly includes three parts: FPGA-YOLO
network design, activation function selection, and 16bit
fixed-point optimization of weight parameters.
TABLE I. COMPARISON OF ACCURACY AND SPEED OF MAIN
ALGORITHMS
YOLO 63.4 45
448
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
+'0,,QSXW +'0,2XWSXW
function outputs y=0. Compared with the Leaky function, the
Relu function is simple to implement in circuit and saves
time in calculation.
''5
x, x ≥ 0
y=® (1)
'0$ '0$GULYHU
¯kx, x < 0
$SS
&11 x, x ≥ 0 (2)
)3*$<2/2 6RFNHW /LQX[ y=®
¯0, x < 0
/LQX[ 3&
=<14
3/)3*$ 36$50 C. Fixed-point Optimization
Figure 2. ZYNQ architecture diagram Floating-point numbers and fixed-point numbers are two
representations of data in the machine. The benefit of
TABLE II. PARAMETER CONFIGURATION OF THE FPGA-YOLO floating-point representation of data is that it has a larger
MODEL representation range and higher accuracy. However, in
hardware design, compared with floating-point operations,
Layer Type Filters Size/Stride Input Output
FPGA fixed-point operations can significantly reduce
1 C 16 3×3/1 640×352×3 640×352×16 resource consumption and shorten clock cycles, and can also
reduce power consumption .
2 M 2×2/2 640×352×16 320×176×16
After training the FPGA-YOLO network through the
3 C 32 3×3/1 320×176×16 320×176×32 Darknet open source framework, the output is a 32-bit
weight parameter file, which contains a large amount of data.
4 M 2×2/2 320×176×32 160×88×32 Caffe-Ristretto is an automatic CNN quantization tool that
can compress 32-bit floating-point networks, allowing
5 C 64 3×3/1 160×88×32 160×88×64
testing, training, and fine-tuning networks with limited
6 M 2×2/2 160×88×64 80×44×64 digital accuracy. Due to the limited internal resources of
FPGA, we chose to use the Caffe-Ristret open source tool to
7 C 128 3×3/1 80×44×64 80×44×128 perform fixed-point quantization on 32-bit weight
parameters.
8 M 2×2/2 80×44×128 40×22×128
TABLE III. COMPARISON OF THE WEIGHT PARAMETERS OF
9 C 256 3×3/1 40×22×128 40×22×256 DIFFERENT DIGIT REPRESENTATIONS
32bit-fixedpoint 50.6M 1
13 C 512 3×3/1 20×11×512 20×11×512
449
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
fixed-point weight parameter file is deployed to the FPGA, IV. RESULTS AND ANALYSIS
the calculation error will cause the target to not be correctly We use the KITTI dataset (currently the world's largest
identified. Although the 32-bit fixed-point number has a evaluation dataset for computer vision algorithms in
good recognition effect, the size of the data is not autonomous driving scenarios) to train the original Tiny-
compressed at all. So we finally chose 16bit fixed-point.
YOLO network and the optimized FPGA-YOLO network.
III. SYSTEM IMPLEMENTATION Experiment on various platforms and compare network
performance. Analyze the experimental results from three
In the hardware system design, we chose Xilinx's high- aspects: recognition speed, recognition accuracy and power
performance ZYNQ SoC FPGA chip to implement the consumption.
system architecture. The hardware platform mainly uses the
system architecture of ZYNQ chip ARM+FPGA, which TABLE IV. MULTI-PLATFORM TINY-YOLO NETWORK
includes the smallest system module, FPGA-YOLO network PERFORMANCE TEST
IP core module, input HDMI module, output HDMI module,
Type CPU GPU FPGA
SD card drive module, DDR module and interconnection
module. The smallest system module is based on ARM for
embedded development and has a task scheduling function in Platform Inter®Core GTX Titan Virtex-7 ZYNQ7035
the entire system. The FPGA-YOLO network IP core TM
i5-6200U X
module is responsible for target recognition of the input VC707 (This Work)
image. The HDMI module is responsible for image input and
output. The SD card driver module is an external storage Frequency 2.3GHz 1GHz 200MHz 50MHz
module of the FPGA-YOLO platform, which stores the
platform startup files and application files. The DDR module Storage 8G 12G 2G 2G
is the platform's buffer. The interconnection module is
responsible for scheduling and combining various modules. DDR3 DDR5 DDR3 DDR3
The system implementation process is shown in Figure 3.
Speed(s) 2.331145s 0.008451s 0.192643 0.051915s
0HPFS\JHWVWKH Accuracy
EXIIHUHGLPDJHWREH
SURFHVVHG ˄mAP˅ 51.3 52.7 48.2 48.5
6\VWHP (%)
LQLWLDOL]DWLRQ
/RDGZHLJKW
SDUDPHWHUILOHWR We tested the same pictures on CPU, GPU and FPGA
''5
platforms, and the test results are shown in Table IV. The
6HQGGDWDWRWKH experimental results show that the average time for the
DFFHOHUDWRU FPGA platform designed in this paper to recognize single-
,QLWLDOL]HLPDJH frame images is 0.051915s, which is 44.9 times faster than
'0$ the CPU's 2.331145s. The processing speed is still lower
than that of the dedicated graphics processor GPU, but
*HWUHVXOWVIURP compared with the GPU, the power consumption has been
DFFHOHUDWRU
greatly reduced. In terms of recognition accuracy, we use
4XHU\DFFHOHUDWRU mAP value to measure the recognition result. The average
UHJLVWHUVWDWXV detection accuracy of the FPGA platform designed in this
paper is 48.5%, which is slightly lower than that of CPU and
)3*$<2/2DOJRULWKP
SRVWSURFHVVLQJ GPU platforms, and still has a high detection accuracy.
Figure 4 shows the recognition effect of FPGA hardware
,QLWLDOL]HGDWD
FKDQQHO'0$
platform. The input image is a color image with a resolution
of 1280*720. The platform successfully identified
2SHQ&9SURFHVVHVDQG pedestrians and vehicles in the picture and marked them with
RXWSXWVLPDJHV borders in the picture.
450
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.
Control and Communications (SIBCON), Tomsk, Russia, 2019, pp.
1-5.
[3] A. Jinguji, Y. Sada and H. Nakahara, "Real-Time Multi-Pedestrian
Detection in Surveillance Camera using FPGA," 2019 29th
International Conference on Field Programmable Logic and
Applications (FPL), Barcelona, Spain, 2019, pp. 424-425, doi:
10.1109/FPL.2019.00078
[4] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and J.Carrillo.
SDSoC: A Higher-level Programming Environment for Zynq SoC
and Ultrascale+ MPSoC. Int’l Symp. on Field Programmable Gate
(a) Cars and Pedestrians Arrays (FPGA), pages 4–4, Feb 2016.
[5] Liu W, Anguelov D, Erhan D. SSD: Single shot multibox detector
[C]. In: Proc. Of European Conference on Computer Vision, 2016,
21-37.
[6] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature
Hierarchies for Accurate Object Detection and Semantic
Segmentation," 2014 IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, OH, 2014, pp. 580-587, doi:
10.1109/CVPR.2014.81.
[7] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on
Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448, doi:
10.1109/ICCV.2015.169.
[8] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks," in
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
39, no. 6, pp. 1137-1149, 1 June 2017, doi:
10.1109/TPAMI.2016.2577031.
[9] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look
(b) Face Once: Unified, Real-Time Object Detection," 2016 IEEE Conference
Figure 4. The recognition effect of FPGA hardware platform on Computer Vision and Pattern Recognition (CVPR), Las Vegas,
NV, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
V. CONCLUSION [10] S. Zhang, J. Cao, Q. Zhang, Q. Zhang, Y. Zhang and Y. Wang, "An
FPGA-Based Reconfigurable CNN Accelerator for YOLO," 2020
In this article, we propose an improved algorithm for IEEE 3rd International Conference on Electronics Technology
deep learning YOLO network based on FPGA hardware. By (ICET), Chengdu, China, 2020, pp. 74-78, doi:
10.1109/ICET49382.2020.9119500.
redesigning the Tiny-YOLO network, replacing the
activation function and performing 16-bit fixed-point [11] X. Xu and B. Liu, "FCLNN: A Flexible Framework for Fast CNN
Prototyping on FPGA with OpenCL and Caffe," 2018 International
operations on the weight parameters, the problem of Conference on Field-Programmable Technology (FPT), Naha,
insufficient resources on the FPGA chip of the mobile device Okinawa, Japan, 2018, pp. 238-241, doi: 10.1109/FPT.2018.00043.
is solved. At the same time, we take advantage of the parallel [12] G. Wei, Y. Hou, Q. Cui, G. Deng, X. Tao and Y. Yao, "YOLO
nature of FPGA to accelerate the convolutional neural Acceleration using FPGA Architecture," 2018 IEEE/CIC
network, which greatly improves the speed of image International Conference on Communications in China (ICCC),
recognition. Experimental results show that the method Beijing, China, 2018, pp. 734-735, doi:
10.1109/ICCChina.2018.8641256.
proposed in this paper reduces power consumption and
[13] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun and F. Liang, "A RISC-V
greatly improves the speed of calculation while maintaining based hardware accelerator designed for Yolo object detection
accuracy, and has important practical value in the system," 2019 IEEE International Conference of Intelligent Applied
implementation of convolutional neural network mobile Systems on Engineering (ICIASE), Fuzhou, China, 2019, pp. 9-11,
terminals and real-time computing. doi: 10.1109/ICIASE45644.2019.9074051.
[14] Y. Liang, L. Lu, Q. Xiao and S. Yan, "Evaluating Fast Algorithms for
REFERENCES Convolutional Neural Networks on FPGAs," in IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 39,
[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with no. 4, pp. 857-870, April 2020, doi: 10.1109/TCAD.2019.2897701.
deep convolutional neural networks[C]. In: Proc. of International
Conference on Neural Information Processing Systems. 2012, 1097-
1105.
[2] I. V. Zoev, A. P. Beresnev and N. G. Markov, "Convolutional neural
networks of the YOLO class in computer vision systems for mobile
robotic complexes," 2019 International Siberian Conference on
451
Authorized licensed use limited to: UNIVERSITY OF WESTERN ONTARIO. Downloaded on May 27,2021 at 22:58:07 UTC from IEEE Xplore. Restrictions apply.