NeuralNetworkforReal-TimeObjectDetectiononFPGA
NeuralNetworkforReal-TimeObjectDetectiononFPGA
net/publication/352338213
CITATIONS READS
10 808
3 authors, including:
All content following this page was uploaded by Edward Rzaev on 17 October 2021.
Abstract—Object detection is one of the most active research of the De10-Nano board makes it possible to integrate it in
and application areas of neural networks. In this article we various embedded systems.
combine FPGA and neural networks technologies to solve the
real-time object recognition problem. The article discusses the The developed neural network is able to determine the
integration of the YOLOv3 neural network on the DE10-Nano boundaries of identified objects. Due to the ability of the HPS
FPGA. Slightly worse indicators of the main metrics (mAP, FPS, core to locally implement powerful data processing algorithms
inference time) when operating a neural network on a De10- and parallelize their execution at the hardware level, for stable
Nano board in comparison with more expensive solutions based operation of the robot, it was decided to create a server that
on GPUs, are offset by differences in the cost and dimensions of will perform this computer vision task. It also gives an
the FPGA board used. Based on the results of the study of opportunity to combine several neural networks on one
various methods for converting neural networks to FPGA, it was platform, for example, to be used in conjunction with a neural
concluded that this architecture is applicable for solving network to recognize speech commands [2]. Implementation
problems of detecting objects on a video stream in real time. on FPGA most accurately conveys the parallel architecture of
neural layers and provides the flexibility to reconfigure the
Keywords—FPGA, neural networks, YOLOv3, object detection, entire neural network and its components – artificial neurons.
object recognition, CNN In addition, the configuration of FPGA-based neural networks
is easy to change.
I. INTRODUCTION
So, the main goal of this project is multiclass recognition of
The meaning of this work is to make a smart system that objects on the FPGA. Possibilities of application vary
works in real time and is able to analyze the surrounding space. depending on the requirements and desires of the customer.
Using the example of this project, we want to demonstrate the Thus, changing the target data, the system adapts to the
possibilities of using FPGAs for processing a video data solution of the task without changing the hardware basis.
stream. We offer a lightweight neural network in HDL Examples of tasks: institution security; counting the number of
implementation, which can be used to solve a wide range of people in a queue, counting cars in a stream, detecting non-
tasks, for example, to detect and recognize people, animals, standard behavior of people in public places, detecting animals
vehicles and other objects based on the computer vision and birds in dangerous places, etc. In light of recent events,
algorithm. To solve the problem, we use a board with a chip of proposals for solving problems of identifying coronavirus in
the Cyclone V family (De10-Nano). The FPGA device allows potentially infected people based on x-ray images of their
to parallelize all the necessary calculations, thereby fully respiratory tract or analysis of data from thermal imagers in
utilizing all its hardware resources [1]. It is also worth adding public places will also be relevant. With the global placement
the low power consumption of the FPGA with its high of cameras in a particular country, it is possible to search for
performance. Because of this, FPGAs are an excellent tool for wanted people. In addition, this development may be useful for
solving these kinds of problems for embedded systems. production. Intelligent video surveillance systems are able to
Significant advantages of De10-Nano are its low price and recognize in advance signs of an impending accident in a
the presence of an ARM core, which allows to reduce the factory or warehouse. Thus, it allows you to correct the causes
development time of the project due to the possibility of of the accident before its immediate occurrence.
connecting peripherals and controlling the board at a higher Based on all of the above, it can be concluded that success
level. Thus, most of the time can be devoted directly to the in solving problems that affect the detection of objects in real
development and testing of the neural network. In addition, the time is limited only by the collection of data for a specific task.
De10-Nano consumes significantly less power than, for If the necessary data is available, then it is possible to train the
example, Nvidia video cards, which are inferior to FPGAs in neural network. The necessary settings can be adjusted by
terms of computing power per unit of electricity. changing the hyperparameters.
For processing images in real time, a computing base was The result of the project is a FPGA board that recognizes
chosen, which has such advantages as low power consumption the surrounding space from the camera, the output of the
and high speed of work with information, which makes it results is displayed on the laptop.
possible to use a neural network. Also, the relatively small size
The article [7] introduces the REQ-YOLO architecture, YOLOv3 is a rather heavyweight neural network and
which is based on the YOLO architecture. In fact, REQ-YOLO requires a large amount of video memory and computing
is a highly compressed version of the YOLO architecture for resources to be able to recognize objects with high accuracy
improving FPGA performance. A special feature of REQ- and quality. Therefore, for the limited resources of De10-Nano,
YOLO is its simplicity at the software and hardware levels it was decided to use a lighter version - Tiny YOLOv3.
when detecting objects. In both works, quantization of weights Reducing the resolution of the input image, reducing the layers
is used, which makes it possible to significantly reduce the both in terms of feature selection and in terms of object
number of calculations, and, therefore, the amount of memory classification and regression of the location of objects made it
used by the neural network. Unlike the work [7], our project possible to significantly facilitate the neural network, however,
provides an accurate assessment of the quality of recognition the quality of object detection also deteriorated.
and the speed of image processing by a neural network.
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
TABLE II. TINY YOLOV3 ARCHITECTURE [8]. it became possible to significantly accelerate the neural
network and reduce the amount of energy consumed, albeit by
Layer Type Filters Size/Stride Input Output
reducing it by 15– 20% [9] accuracy of the model.
0 Convolutional 16 3×3/1 416 × 416 × 3 416 × 416 × 16 As a dataset for training, a set of images
1 Maxpool 2×2/2 416 × 416 × 16 208 × 208 × 16
OpenImagesV4 [10] from Google was selected. This is an open
dataset in which there are almost 2 million tagged images with
2 Convolutional 32 3×3/1 208 × 208 × 16 208 × 208 × 32 a hierarchical structure of classes (their number is about 600).
To train the neural network, a data subset with 18 classes was
3 Maxpool 2×2/2 208 × 208 × 32 104 × 104 × 32 used, including classes such as people, various types of
4 Convolutional 64 3×3/1 104 × 104 × 32 104 × 104 × 64
furniture, transportation, and various office, kitchen and other
accessories. In total, the dataset has about 28,600 drawings.
5 Maxpool 2×2/2 104 × 104 × 64 52 × 52 × 64 They were downloaded using the OIDv4 Toolkit [11].
6 Convolutional 128 3×3/1 52 × 52 × 64 52 × 52 × 128 To train the neural network, the BlueOil [12] framework is
used, which allows you to solve various machine learning
7 Maxpool 2×2/2 52 × 52 × 128 26 × 26 × 128 problems using FPGAs.
8 Convolutional 256 3×3/1 26 × 26 × 128 26 × 26 × 256 The first step is to prepare a server with a GPU for training
a neural network. It worth mentioning that newer generation of
9 Maxpool 2×2/1 26 × 26 × 256 13 × 13 × 256 Nvidia GPUs are prefered for solving this problem, since the
10 Convolutional 512 3×3/1 13 × 13 × 256 13 × 13 × 512 vast majority of libraries for developing and training neural
networks are written specifically for CUDA kernels in
11 Maxpool 1×1/1 13 × 13 × 512 13 × 13 × 512 languages C or C++. The server can be either a local computer
or a remote device with a Linux operating system on board.
12 Convolutional 1024 3×3/1 13 × 13 × 512 13 × 13 × 1024 The development of the project was carried out on the Ubuntu
13 Convolutional 256 1×1/1 13 × 13 × 1024 13 × 13 × 256 18.04 distribution. Also, GPU drivers higher than 410 are
needed. It is recommended to have about 50 GB of free space
14 Convolutional 512 3×3/1 13 × 13 × 256 13 × 13 × 512 for the development. Docker must be installed on the server to
get started with the project. Used hardware for training neural
15 Convolutional 255 1×1/1 13 × 13 × 512 13 × 13 × 255
network is a local computer with an Nvidia GeForce 940MX
16 YOLO video card with 2 GB of video memory.
17 Route 13
The ability to develop and train neural networks bypassing
the process of creating an environment in which many of all
18 Convolutional 128 1×1/1 13 × 13 × 256 13 × 13 × 256 software components do not conflict with each other due to the
difference in the versions of the modules and libraries used and
19 Up-sampling 2×2/1 13 × 13 × 128 26 × 26 × 128 the portability of developments in general is especially
20 Route 19 8
convenient. That is why for the successful operation of the
entire project it was decided to create a Docker Container. It
21 Convolutional 256 3×3/1 13 × 13 × 384 13 × 13 × 256 allows to reproduce project even on a completely new device
or server. The developed Docker Container is built based on
22 Convolutional 255 1×1/1 13 × 13 × 256 13 × 13 × 256 the Linux operating system, the Ubuntu 18.04 distribution kit,
23 YOLO
and contains both hardware and software modules necessary
for project development.
There are many different factors to take into account during The architecture of the trained neural network using Blueoil
the training of a neural network. The number of classes, the was converted into a binary file for the firmware of the DE10-
specifics of the problem, the size of the bounding rectangles Nano board. A configuration file with neural network weights
and others are to be considered. Data-independent factors also was added to it. After that, the SD Card image was modified to
have a big impact. For example, the choice of the correct provide Blueoil support. Then the received files were added to
training step, the algorithm for calculating the backpropagation Ubuntu on the board, and the board was reprogrammed and the
of the error, the number of processed pictures per one update of necessary packages for Python were installed on the FPGA
the weights. board. This made it possible, with proper connection of the
camera and the rest of the periphery, to launch a neural
Quantization was applied to the layers of the neural network to detect objects on the DE10-Nano.
network, that is, a reduction in the number of bits that are
allocated to represent one network parameter. So, instead of Based on the experience of working with the DE10-Nano
using 32 bits for one floating-point number, 8 bits are allocated board, it was decided to develop a cooling system for the
for one parameter. Since the model weights occupy board's chip. This board has an industrial Cyclone V chip that
approximately 2–3 times less RAM space and the calculations requires additional cooling or overheating protection, which
themselves use approximately 2–2.5 times less execution time, the developers did not implement when creating the board.
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Thus, it was decided to use the development from our previous The horizontal axis represents the number of pictures. The
project [13], which solves the problem. vertical axis shows input resolution of the picture. The rest of a
training parameters were the same in each experiment.
As a result, the most optimal value was the resolution of
128x128 pixels, as it is excellent in terms of learning speed and
preserving the maximum amount of information during data
preprocessing.
The training step was 0.003, which decreased by a factor of
10 every 1000 updates of the parameters. There were 20,000
iterations in total.
REFERENCES
[1] T. V. Huynh, “Deep neural network accelerator based on FPGA,” 2017
4th NAFOSTED Conf. on Information and Computer Science, pp. 254–
257, 2017. DOI: 10.1109/NAFOSTED.2017.8108073.
[2] R. A. Solovyev, “Deep Learning Approaches for Understanding Simple
Speech Commands,” 2020 IEEE 40th Int. Conf. on Electronics and
Nanotechnology, pp. 688–693, 2020. DOI:
Fig. 4. An example of a demonstration of the operation of a neural network. 10.1109/ELNANO50318.2020.9088863.
[3] M. B. Ullah, “CPU Based YOLO: A Real Time Object Detection
Algorithm,” 2020 IEEE Region 10 Symposium, pp. 552–555, 2020.
DOI: 10.1109/TENSYMP50017.2020.9230778.
[4] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
Tech Rep., pp. 1–6, 2018.
[5] R. Girshick, Fast R-CNN, 2015.
[6] YOLO: Real-Time Object Detection.
[7] [C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, “REQ-
YOLO: A resource-aware, efficient quantization framework for object
detection on FPGAS,” FPGA 2019 – Proc. 2019 ACM/SIGDA Int.
Symp. Field-Programmable Gate Arrays, pp. 33–42, 2019. DOI:
10.1145/3289602.3293904.
[8] W. He, Z. Huang, Z. Wei, C. Li, and B. Guo, “TF-YOLO: An improved
incremental network for real-time object detection,” Appl. Sci., vol. 9,
no. 16, 2019. DOI: 10.3390/app9163225.
[9] B. Jacob, Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference, 2018.
[10] A. Kuznetsova, “The Open Images Dataset V4: Unified image
classification, object detection, and visual relationship detection at
scale,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956–1981, 2018. DOI:
Fig. 5. Different examples of the operation of a neural network. 10.1007/s11263-020-01316-z.
[11] GitHub - EscVM/OIDv4_ToolKit: Download and visualize single or
The video stream from the camera is used as input to the multiple classes from the huge Open Images v4 dataset.
neural network. [12] GitHub - blue-oil/blueoil: Bring Deep Learning to small devices.
[13] InnovateFPGA|EMEA|EM029 - Anthropomorphic robot on FPGA.
To demonstrate the operation of the neural network, the
[14] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, and A. Ramirez,
processed video stream is transmitted via SSH to the working “The low power architecture approach towards exascale computing,” J.
machine in real time. Comput. Sci., vol. 4, no. 6, pp. 439–443, 2013. DOI:
10.1016/j.jocs.2013.01.002.
Depending on the initial set of classes of objects to be
[15] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
detected, the recognition quality and the speed of the Speed and Accuracy of Object Detection,” arXiv, 2020.
algorithms change. An increase in the complexity of the [16] W. Liu, “SSD: Single Shot MultiBox Detector,” Lect. Notes Comput.
detection task leads to a deterioration in its characteristics. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 9905 LNCS, pp. 21–37, 2015. DOI: 10.1007/978-
3-319-46448-0_2.
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, and U. San Diego,
Aggregated Residual Transformations for Deep Neural Networks.