0% found this document useful (0 votes)
9 views

NeuralNetworkforReal-TimeObjectDetectiononFPGA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

NeuralNetworkforReal-TimeObjectDetectiononFPGA

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/352338213

Neural Network for Real-Time Object Detection on FPGA

Conference Paper · May 2021


DOI: 10.1109/ICIEAM51226.2021.9446384

CITATIONS READS
10 808

3 authors, including:

Edward Rzaev Aleksandr Amerikanov


National Research University Higher School of Economics National Research University Higher School of Economics
5 PUBLICATIONS 16 CITATIONS 16 PUBLICATIONS 96 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Edward Rzaev on 17 October 2021.

The user has requested enhancement of the downloaded file.


2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)

Neural Network for Real-Time Object Detection on


FPGA
Edward Rzaev Anton Khanaev Aleksandr Amerikanov
HSE University HSE University HSE University
Moscow, Russian Federation Moscow, Russian Federation Moscow, Russian Federation
[email protected] [email protected] [email protected]

Abstract—Object detection is one of the most active research of the De10-Nano board makes it possible to integrate it in
and application areas of neural networks. In this article we various embedded systems.
combine FPGA and neural networks technologies to solve the
real-time object recognition problem. The article discusses the The developed neural network is able to determine the
integration of the YOLOv3 neural network on the DE10-Nano boundaries of identified objects. Due to the ability of the HPS
FPGA. Slightly worse indicators of the main metrics (mAP, FPS, core to locally implement powerful data processing algorithms
inference time) when operating a neural network on a De10- and parallelize their execution at the hardware level, for stable
Nano board in comparison with more expensive solutions based operation of the robot, it was decided to create a server that
on GPUs, are offset by differences in the cost and dimensions of will perform this computer vision task. It also gives an
the FPGA board used. Based on the results of the study of opportunity to combine several neural networks on one
various methods for converting neural networks to FPGA, it was platform, for example, to be used in conjunction with a neural
concluded that this architecture is applicable for solving network to recognize speech commands [2]. Implementation
problems of detecting objects on a video stream in real time. on FPGA most accurately conveys the parallel architecture of
neural layers and provides the flexibility to reconfigure the
Keywords—FPGA, neural networks, YOLOv3, object detection, entire neural network and its components – artificial neurons.
object recognition, CNN In addition, the configuration of FPGA-based neural networks
is easy to change.
I. INTRODUCTION
So, the main goal of this project is multiclass recognition of
The meaning of this work is to make a smart system that objects on the FPGA. Possibilities of application vary
works in real time and is able to analyze the surrounding space. depending on the requirements and desires of the customer.
Using the example of this project, we want to demonstrate the Thus, changing the target data, the system adapts to the
possibilities of using FPGAs for processing a video data solution of the task without changing the hardware basis.
stream. We offer a lightweight neural network in HDL Examples of tasks: institution security; counting the number of
implementation, which can be used to solve a wide range of people in a queue, counting cars in a stream, detecting non-
tasks, for example, to detect and recognize people, animals, standard behavior of people in public places, detecting animals
vehicles and other objects based on the computer vision and birds in dangerous places, etc. In light of recent events,
algorithm. To solve the problem, we use a board with a chip of proposals for solving problems of identifying coronavirus in
the Cyclone V family (De10-Nano). The FPGA device allows potentially infected people based on x-ray images of their
to parallelize all the necessary calculations, thereby fully respiratory tract or analysis of data from thermal imagers in
utilizing all its hardware resources [1]. It is also worth adding public places will also be relevant. With the global placement
the low power consumption of the FPGA with its high of cameras in a particular country, it is possible to search for
performance. Because of this, FPGAs are an excellent tool for wanted people. In addition, this development may be useful for
solving these kinds of problems for embedded systems. production. Intelligent video surveillance systems are able to
Significant advantages of De10-Nano are its low price and recognize in advance signs of an impending accident in a
the presence of an ARM core, which allows to reduce the factory or warehouse. Thus, it allows you to correct the causes
development time of the project due to the possibility of of the accident before its immediate occurrence.
connecting peripherals and controlling the board at a higher Based on all of the above, it can be concluded that success
level. Thus, most of the time can be devoted directly to the in solving problems that affect the detection of objects in real
development and testing of the neural network. In addition, the time is limited only by the collection of data for a specific task.
De10-Nano consumes significantly less power than, for If the necessary data is available, then it is possible to train the
example, Nvidia video cards, which are inferior to FPGAs in neural network. The necessary settings can be adjusted by
terms of computing power per unit of electricity. changing the hyperparameters.
For processing images in real time, a computing base was The result of the project is a FPGA board that recognizes
chosen, which has such advantages as low power consumption the surrounding space from the camera, the output of the
and high speed of work with information, which makes it results is displayed on the laptop.
possible to use a neural network. Also, the relatively small size

978-1-7281-4587-7/21/$31.00 ©2021 IEEE


2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
II. RELATED WORKS Below is a table of comparisons of YOLO state-of-the-art
Thanks to the use of a small resource-intensive (SoTA) neural networks. It can be seen that the Titan X
infrastructure YOLO made it possible to use powerful devices processes images at a speed of 40-90 frames per second, with
in real time with a camera using a processor [3] and a GPU. It MAP (mean Average Precision) indicators for Visual Object
uses a reduced number of layers and can significantly increase Classes (VOC) of 2007 78.6% and MAP 48.1% for COCO
the speed of the neural network. test-dev.

Many neural networks designed for preliminary detection


TABLE I. COMPARISON OF NEURAL NETWORKS ON VARIOUS DATA ON
of objects in an image modify classifiers or localizers to THE TITAN X GRAPHICS CARD.
perform detection. They apply the model to the image at
multiple locations and scales. Areas with a high image score Model Train mAP FPS
are considered detections.
Old YOLO VOC 2007+2012 63.4 45
The article [4] provides a comparison of different meta-
architectures, which reflects the advantage of YOLOv3 in SSD300 VOC 2007+2012 74.3 46
comparison with analogs. The table II in the article shows that SSD500 VOC 2007+2012 76.8 19
YOLOv3 with the same or better recognition quality (metric
mAP@50) has a significantly shorter image processing time YOLOv2 VOC 2007+2012 76.8 67
(about 4-5 times). For the task of detecting objects on a video YOLOv2 544x544 VOC 2007+2012 78.6 40
stream, high image processing speed is one of the key
advantages of the YOLOv3 architecture compared to other Tiny YOLO VOC 2007+2012 57.1 207
architectures. SSD300 COCO trainval 41.2 46
In proposed neural network, a completely different SSD500 COCO trainval 46.5 19
approach is used. In this case, one neural network is used for
YOLOv2 608x608 COCO trainval 48.1 40
the complete image. This network divides the image into
regions and predicts bounding boxes and probabilities for each Tiny YOLO COCO trainval - 200
region. These bounding boxes are weighted by the predicted
probabilities. This model has several advantages over III. METHODOLOGY
classifier-based systems. The neural network processes the The diagram of connected devices is represented in Figure
entire image during testing, so its predictions are based on the 2.
part of the image. It also makes predictions with a single
network estimate, as opposed to systems like the Region-based
Convolutional Network (R-CNN), which require thousands of
estimates for a single image. All of the above combined makes
it extremely fast, over 1000 times faster than R-CNN and 100
times faster than Fast R-CNN [5].
Figure 1 graphically depicts the process of bounding boxes
building.

Fig. 1. The process of detecting objects in the image [6].


Fig. 2. The block diagram of the project.

The article [7] introduces the REQ-YOLO architecture, YOLOv3 is a rather heavyweight neural network and
which is based on the YOLO architecture. In fact, REQ-YOLO requires a large amount of video memory and computing
is a highly compressed version of the YOLO architecture for resources to be able to recognize objects with high accuracy
improving FPGA performance. A special feature of REQ- and quality. Therefore, for the limited resources of De10-Nano,
YOLO is its simplicity at the software and hardware levels it was decided to use a lighter version - Tiny YOLOv3.
when detecting objects. In both works, quantization of weights Reducing the resolution of the input image, reducing the layers
is used, which makes it possible to significantly reduce the both in terms of feature selection and in terms of object
number of calculations, and, therefore, the amount of memory classification and regression of the location of objects made it
used by the neural network. Unlike the work [7], our project possible to significantly facilitate the neural network, however,
provides an accurate assessment of the quality of recognition the quality of object detection also deteriorated.
and the speed of image processing by a neural network.
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
TABLE II. TINY YOLOV3 ARCHITECTURE [8]. it became possible to significantly accelerate the neural
network and reduce the amount of energy consumed, albeit by
Layer Type Filters Size/Stride Input Output
reducing it by 15– 20% [9] accuracy of the model.
0 Convolutional 16 3×3/1 416 × 416 × 3 416 × 416 × 16 As a dataset for training, a set of images
1 Maxpool 2×2/2 416 × 416 × 16 208 × 208 × 16
OpenImagesV4 [10] from Google was selected. This is an open
dataset in which there are almost 2 million tagged images with
2 Convolutional 32 3×3/1 208 × 208 × 16 208 × 208 × 32 a hierarchical structure of classes (their number is about 600).
To train the neural network, a data subset with 18 classes was
3 Maxpool 2×2/2 208 × 208 × 32 104 × 104 × 32 used, including classes such as people, various types of
4 Convolutional 64 3×3/1 104 × 104 × 32 104 × 104 × 64
furniture, transportation, and various office, kitchen and other
accessories. In total, the dataset has about 28,600 drawings.
5 Maxpool 2×2/2 104 × 104 × 64 52 × 52 × 64 They were downloaded using the OIDv4 Toolkit [11].
6 Convolutional 128 3×3/1 52 × 52 × 64 52 × 52 × 128 To train the neural network, the BlueOil [12] framework is
used, which allows you to solve various machine learning
7 Maxpool 2×2/2 52 × 52 × 128 26 × 26 × 128 problems using FPGAs.
8 Convolutional 256 3×3/1 26 × 26 × 128 26 × 26 × 256 The first step is to prepare a server with a GPU for training
a neural network. It worth mentioning that newer generation of
9 Maxpool 2×2/1 26 × 26 × 256 13 × 13 × 256 Nvidia GPUs are prefered for solving this problem, since the
10 Convolutional 512 3×3/1 13 × 13 × 256 13 × 13 × 512 vast majority of libraries for developing and training neural
networks are written specifically for CUDA kernels in
11 Maxpool 1×1/1 13 × 13 × 512 13 × 13 × 512 languages C or C++. The server can be either a local computer
or a remote device with a Linux operating system on board.
12 Convolutional 1024 3×3/1 13 × 13 × 512 13 × 13 × 1024 The development of the project was carried out on the Ubuntu
13 Convolutional 256 1×1/1 13 × 13 × 1024 13 × 13 × 256 18.04 distribution. Also, GPU drivers higher than 410 are
needed. It is recommended to have about 50 GB of free space
14 Convolutional 512 3×3/1 13 × 13 × 256 13 × 13 × 512 for the development. Docker must be installed on the server to
get started with the project. Used hardware for training neural
15 Convolutional 255 1×1/1 13 × 13 × 512 13 × 13 × 255
network is a local computer with an Nvidia GeForce 940MX
16 YOLO video card with 2 GB of video memory.

17 Route 13
The ability to develop and train neural networks bypassing
the process of creating an environment in which many of all
18 Convolutional 128 1×1/1 13 × 13 × 256 13 × 13 × 256 software components do not conflict with each other due to the
difference in the versions of the modules and libraries used and
19 Up-sampling 2×2/1 13 × 13 × 128 26 × 26 × 128 the portability of developments in general is especially
20 Route 19 8
convenient. That is why for the successful operation of the
entire project it was decided to create a Docker Container. It
21 Convolutional 256 3×3/1 13 × 13 × 384 13 × 13 × 256 allows to reproduce project even on a completely new device
or server. The developed Docker Container is built based on
22 Convolutional 255 1×1/1 13 × 13 × 256 13 × 13 × 256 the Linux operating system, the Ubuntu 18.04 distribution kit,
23 YOLO
and contains both hardware and software modules necessary
for project development.

There are many different factors to take into account during The architecture of the trained neural network using Blueoil
the training of a neural network. The number of classes, the was converted into a binary file for the firmware of the DE10-
specifics of the problem, the size of the bounding rectangles Nano board. A configuration file with neural network weights
and others are to be considered. Data-independent factors also was added to it. After that, the SD Card image was modified to
have a big impact. For example, the choice of the correct provide Blueoil support. Then the received files were added to
training step, the algorithm for calculating the backpropagation Ubuntu on the board, and the board was reprogrammed and the
of the error, the number of processed pictures per one update of necessary packages for Python were installed on the FPGA
the weights. board. This made it possible, with proper connection of the
camera and the rest of the periphery, to launch a neural
Quantization was applied to the layers of the neural network to detect objects on the DE10-Nano.
network, that is, a reduction in the number of bits that are
allocated to represent one network parameter. So, instead of Based on the experience of working with the DE10-Nano
using 32 bits for one floating-point number, 8 bits are allocated board, it was decided to develop a cooling system for the
for one parameter. Since the model weights occupy board's chip. This board has an industrial Cyclone V chip that
approximately 2–3 times less RAM space and the calculations requires additional cooling or overheating protection, which
themselves use approximately 2–2.5 times less execution time, the developers did not implement when creating the board.
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Thus, it was decided to use the development from our previous The horizontal axis represents the number of pictures. The
project [13], which solves the problem. vertical axis shows input resolution of the picture. The rest of a
training parameters were the same in each experiment.
As a result, the most optimal value was the resolution of
128x128 pixels, as it is excellent in terms of learning speed and
preserving the maximum amount of information during data
preprocessing.
The training step was 0.003, which decreased by a factor of
10 every 1000 updates of the parameters. There were 20,000
iterations in total.

B. Converting a model as a firmware to an FPGA


The process of transferring developed and trained neural
Fig. 3. Board cover with cooler. network to De10-Nano can be conditionally divided into 2
parts. The first is a computational graph, in which the entire
In the research, the use of De10-Nano gives on average the object recognition algorithm is written, and the second is the
following indicators on the OpenImagesV4 sample from parameters of the neural network involved in the calculations.
Google: FPS ≈ 28-33; mAP ≈ 29.1%. Based on these data, we It is advisable to load the model parameters once from the main
can conclude that the De10-Nano copes well with the task memory of the board, while the computational graph can be
when compared with the top-class Titan X video card, the cost represented as a binary FPGA firmware file.
of which is more than 10 times higher than the cost of the used
De10-Nano board. For computations of the neural network, the De10-Nano
crystal is used directly, while the ARM core is utilized for the
The computing power of De10-Nano [14] is aimed at high-level control of the board. Also, it deals with connecting
solving the problem: and configuring peripherals. It is possible to update the board
configuration directly from terminal without powering it off
• Cyclone V FPGA: 0.16 GFLOPS;
using Bash and Python scripts.
• Dual Core ARM Cortex-A9 MPCore: 2 GFLOPS.
C. Testing and debugging the project in real time
Table 2 shows the data on the use of the computing
resources of the board. During the development of the project, its debugging and
testing were successfully carried out. The quality of the mAP
metric = 29.4%. The average FPS fluctuates around 30 frames
TABLE III. RESOURCES USED BY DE10-NANO. per second, which makes it possible to successfully analyze the
Estimates Resource Usage Summary environment in real time. With the input resolution of the image
Resource Usage 224×224, the FPS dropped to 10 frames per second. Therefore, there
Logic utilization 59% was no point in taking less than the resolution of 128×128 pixels
ALUTs 39% since the quality of object recognition drops with faster
Dedicated logic registers 25% rendering of frames to 19.7%.
Memory blocks 57%
DSP blocks 43%
V. RESULTS
IV. EXPERIMENTAL RESULTS AND ANALYSIS This project shows that the Cyclone V chips are able to
handle the processing of a 128×128 video stream by a neural
A. Hyperparameter Tuning network in real time.
The neural network was trained on a GPU with 2 GB of
The developed chip cooling system in one of our previous
GPU memory. On average, it took about 8 hours to train the
projects [13] was perfect for improving the chip's performance
model. The capacity of the GPU memory was enough for a
when processing a video stream. This system allows to get
maximum of 4 pictures to update the model weights. It did not
stable FPS indicators over a long period of time while using
make sense to take a smaller amount, since the training time of
FPGA in active mode.
the neural network will increase significantly, and gradient
computation becomes less stable. However, choosing the right This project is notable for the fact that it is much cheaper
input resolution of the pictures got the following result: than similar solutions [15]–[17] using expensive video cards,
but at the same time important indicators (mAP, FPS, inference
TABLE IV. ERROR FUNCTION VALUE FOR DIFFERENT INPUT DATA
time) are acceptable within the framework of the object
FORMAT detection problem, that is, this project has an applied character
and is workable in real-life tasks.
1 2 4
96х96 3.28 2.95 2.8 As a result of the project, the following metrics were
128х128 3.1 2.78 2.6 achieved: mAP = 29.4% and FPS in the range [28.3, 33.4].
168х168 2.91 2.72 MemoryError
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM)
Below are examples of how the neural network operates on VI. CONCLUSIONS
the De10-Nano board. In this paper, the implementation of a lightweight neural
network in HDL is presented. It is applicable to solving a wide
range of tasks, for example, such as problems of detecting and
recognizing people, animals, vehicles and other objects, based
on computer vision algorithms. This development works on the
De10-Nano FPGA board and has good FPS, mAP metric,
which make it efficient and applicable in various tasks.

REFERENCES
[1] T. V. Huynh, “Deep neural network accelerator based on FPGA,” 2017
4th NAFOSTED Conf. on Information and Computer Science, pp. 254–
257, 2017. DOI: 10.1109/NAFOSTED.2017.8108073.
[2] R. A. Solovyev, “Deep Learning Approaches for Understanding Simple
Speech Commands,” 2020 IEEE 40th Int. Conf. on Electronics and
Nanotechnology, pp. 688–693, 2020. DOI:
Fig. 4. An example of a demonstration of the operation of a neural network. 10.1109/ELNANO50318.2020.9088863.
[3] M. B. Ullah, “CPU Based YOLO: A Real Time Object Detection
Algorithm,” 2020 IEEE Region 10 Symposium, pp. 552–555, 2020.
DOI: 10.1109/TENSYMP50017.2020.9230778.
[4] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
Tech Rep., pp. 1–6, 2018.
[5] R. Girshick, Fast R-CNN, 2015.
[6] YOLO: Real-Time Object Detection.
[7] [C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, “REQ-
YOLO: A resource-aware, efficient quantization framework for object
detection on FPGAS,” FPGA 2019 – Proc. 2019 ACM/SIGDA Int.
Symp. Field-Programmable Gate Arrays, pp. 33–42, 2019. DOI:
10.1145/3289602.3293904.
[8] W. He, Z. Huang, Z. Wei, C. Li, and B. Guo, “TF-YOLO: An improved
incremental network for real-time object detection,” Appl. Sci., vol. 9,
no. 16, 2019. DOI: 10.3390/app9163225.
[9] B. Jacob, Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference, 2018.
[10] A. Kuznetsova, “The Open Images Dataset V4: Unified image
classification, object detection, and visual relationship detection at
scale,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956–1981, 2018. DOI:
Fig. 5. Different examples of the operation of a neural network. 10.1007/s11263-020-01316-z.
[11] GitHub - EscVM/OIDv4_ToolKit: Download and visualize single or
The video stream from the camera is used as input to the multiple classes from the huge Open Images v4 dataset.
neural network. [12] GitHub - blue-oil/blueoil: Bring Deep Learning to small devices.
[13] InnovateFPGA|EMEA|EM029 - Anthropomorphic robot on FPGA.
To demonstrate the operation of the neural network, the
[14] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, and A. Ramirez,
processed video stream is transmitted via SSH to the working “The low power architecture approach towards exascale computing,” J.
machine in real time. Comput. Sci., vol. 4, no. 6, pp. 439–443, 2013. DOI:
10.1016/j.jocs.2013.01.002.
Depending on the initial set of classes of objects to be
[15] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
detected, the recognition quality and the speed of the Speed and Accuracy of Object Detection,” arXiv, 2020.
algorithms change. An increase in the complexity of the [16] W. Liu, “SSD: Single Shot MultiBox Detector,” Lect. Notes Comput.
detection task leads to a deterioration in its characteristics. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 9905 LNCS, pp. 21–37, 2015. DOI: 10.1007/978-
3-319-46448-0_2.
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, and U. San Diego,
Aggregated Residual Transformations for Deep Neural Networks.

View publication stats

You might also like