0% found this document useful (0 votes)
5 views

04_abstract (1)

The document discusses advancements in image and video processing for various applications, emphasizing the challenges of real-time processing and the need for hardware acceleration using Xilinx Zynq®-7000 SoC. It highlights the advantages of tracking over detection in object tracking and motion detection, and introduces a multi-dimensional Kalman filter algorithm and CNN implementations for improved performance. The research focuses on utilizing FPGAs for efficient image processing, showcasing the capabilities of the PYNQ-Z2 SoC in achieving significant resource utilization and performance enhancements.

Uploaded by

Ieee Xpert
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

04_abstract (1)

The document discusses advancements in image and video processing for various applications, emphasizing the challenges of real-time processing and the need for hardware acceleration using Xilinx Zynq®-7000 SoC. It highlights the advantages of tracking over detection in object tracking and motion detection, and introduces a multi-dimensional Kalman filter algorithm and CNN implementations for improved performance. The research focuses on utilizing FPGAs for efficient image processing, showcasing the capabilities of the PYNQ-Z2 SoC in achieving significant resource utilization and performance enhancements.

Uploaded by

Ieee Xpert
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

ABSTRACT

Advancements in image and video processing are growing over the years for

industrial robots, autonomous vehicles, cryptography, surveillance, medical imaging and

computer-human interaction applications. One of the major challenges in real-time image

and video processing is the execution of complex functions and high computational tasks.

To overcome this issue, a hardware acceleration of different filter algorithms for both

image and video processing is implemented on Xilinx Zynq®-7000 System on-Chip (SoC)

device consists of Dual-core Cortex™-A9 processors which provides computing ability to

perform with the help of software libraries using Vivado® High-Level Synthesis (HLS).

There are a few reasons why tracking is preferable over detecting objects in

each frame. Tracking facilitates in identifying the identity of various items across frames

when there are several objects. Object detection may fail in some instances, but tracking

may still be achievable which takes into account the location and appearance of the

object in the previous frame. The key hurdles in real-time image and video processing

applications are object tracking and motion detection. Some tracking algorithms are

extremely fast because they perform a local search rather than a global search. Tracking

algorithms such as mean- shift, multiple hypothesis tracking (MHT), probabilistic data

association, particle filter, nearest neighbor, Kalman filter and interactive multiple

model (IMM) are available to estimate and predict the state of a system.

For linear models, Kalman filter is the most widely used prediction algorithm

as it is very simple, efficient and easy to implement. However, these types of filter

algorithms are customized on hardware platforms such as Field-Programmable Gate

Arrays (FPGAs) and Graphic Processing Units (GPUs) to achieve design requirements

for embedded applications. The research work also proposed a multi-dimensional Kalman

filter (MDKF) algorithm for object tracking and motion detection.

Further the research is fueled by the application of deep learning (DL)

algorithms in the image processing applications. With the technological improvements


in artificial intelligence, particularly deep learning is providing effective outcomes along

with FPGAs in various domains. FPGAs with their reconfigurable architectures provide

flexibility, better performance and high levels of parallelism. However, many

applications require accuracy and rapid processing for better results.

For real-time conditions, CNN algorithms can be implemented on hardware

accelerators. This research also proposes the implementation of CNN algorithm on

PYNQ-Z2 SoC that will be suitable for real-time object detection. The proposed work

shows better resource utilization of about 174 (79%) DSPs, , 115 (82.17% ) BRAMs,

45.8k (43.04%) of Flip-flops and 23.2k (43.6%) of Look-up tables (LUTs) at 100 MHz

frequency with better performance of 9.14 GOP/s which is 2x more efficient on comparing

with state-of-the art methods.


CHAPTER 1

INTRODUCTION

The introduction of high speed computational hardware platforms deployed for


image processing applications are increasing in recent years. It is no surprising that image
processing units are already included in cellphones and cameras, but the demand of image
processing algorithms remains a barrier to bringing a wide variety of theoretical advances
into the reality of real-time implementation. On the other hand, the large number of
resources available on FPGAs, along with the freedom it provides for testing and
developing Application-Specific Integrated Circuits (ASICs), has made FPGA a best
platform for implementing image processing algorithm in real time and autonomous
vehicles. FPGAs are well-suited in providing trade-off between parallelism and flexibility
when compared with other hardware platforms shown in Figure 1.1. The brief survey of
hardware architectures reveals that, despite considerable advances in designing new
algorithms or enhancing current ones, only a small amount of attention is paid to the
realization.

Figure 1.1. Overview of Hardware Platforms Functionality.


FPGAs have long been employed for their versatility in creating
reprogrammable designs that target actual hardware for boosting application performance.
Because of their incapability to integrate functional changes in their hardware designs,
ASICs become a burden for high-level workloads with regularly varying parameters.
FPGAs support developers to construct algorithms and architectures that can be updated
and reconfigured in configurable logic blocks (CLBs) to achieve near-true hardware
acceleration without the cost of development and reassigning physical ASIC chips. FPGAs
have been used for years, but they have lately made a significant rebound as a result of the
increasing amount of high-level applications that require both accelerated implementation
and efficient designs.

Most current FPGAs have multi-core hard processors, GPUs and different I/O
ports that may interface with the low-level architectures in the FPGA fabric. As a result,
current System-on-Chip (SoC) FPGAs are more powerful than ever before for processing
high-end applications. Based on the programming technology, FPGAs can be classified
into three categories mainly SRAM based, flash based and anti-fuse FPGAs. The
comparison of these FPGAs in terms of various parameters is shown in Table 1.

Table 1.1. Comparison of Types of FPGAs

Anti-fuse-based Flash-based SRAM-based


Parameters
FPGAs FPGAs FPGAs
Reprogrammability No Yes Yes
Volatility No No Yes
Area utilization Low Moderate High
Switching resistance (Ω) ≈100 ≈1000 ≈1000
Switching capacitance (fF) ≤1 1–2 1–2
In system programmable No Yes Yes

Table 1.2 lists the power consumption analysis between SRAM-based and
flash- based FPGAs. SRAM-based FPGAs have the limitation of consuming larger static
power when compared with flash-based FPGAs. Apart from this limitation, SRAM-based
are widely used for their significant performance and their ability to reprogram. Hence,
these types of FPGAs provide support for the integration of embedded systems for various
applications. The major vendors of FPGAs and SoCs are Xilinx® and Intel®. Some of the
target applications of prominent FPGAs from Xilinx® are shown in Figure 1.2.
Since the research is focused on image processing applications, PYNQ-Z2 SoC is chosen
as primary hardware.

Table 1.2. Comparison of Power Consumption For FPGAs

Parameters (mW) Flash-based FPGAs SRAM FPGAs

Active Power Low High

Standby or Static Power Ultra-low High

Configuration power Low High

Low power (Sleep) Ultra-low Low

Figure 1.2. Xilinx FPGAs and Their Target Applications.

1.1 FPGA ARCHITECTURE

FPGAs are composed of a collection of programmable logic blocks (PLBs)


embedded in a programmable interconnect. Basic computational and storage aspects are
provided by the programmable logic block, which can be used in electronic designs. The
logic blocks in FPGA architecture are made up of a few logic cells that include look-up
tables (LUTs), multiplexers, D-flip flops and various combinations of memory and logic
blocks are found in recent FPGAs. To connect these logic blocks, routing channels and I/O
blocks are required [1]. In the programmable routing architecture, pre-
fabricated switches and programmable wires are positioned in vertical and lateral routing
channels. It allows logic and I/O blocks to communicate with one another. Around the
FPGA chip, the programmable routing network connects I/O blocks. The functional
components and routing framework are connected to peripherals using configurable I/O
pads. Logic circuits surround the I/O pads, forming I/O cells that take up a lot of space on
the chip. The simplest form of FPGA Architecture is shown in Figure 1.3.

Figure 1.3. Generic FPGA Architecture

FPGAs are reprogrammable platforms that enable the reuse of hardware


components and software libraries. Xilinx creates SoCs that combine the software
programmability of a processing unit with the device fully programmable of an FPGA.
They offer a variety of boards to their potential consumers who want SoC platforms for
design, which are grouped into three categories: cost-optimized, mid-range, and high-end.
Devices in the cost-optimized category include the Artix® and PYNQ-Z2 series. These
boards offer developers a low-cost way to construct programs that do not take substantial
software processing. As a result, these devices are available with either single-core or dual-
core ARM Cortex-A9 processors.
1.1.1. Xilinx PYNQ-Z2 SoC

PYNQ-Z2 APSoCs are exclusive and typical from all other Xilinx FPGA
families. It is built with a dual-core ARM Cortex-A9 Processing System (PS), Advanced
Microcontroller Bus Architecture (AMBA) Interconnects and a variety of peripheral
devices including a USB JTAG interface, Quad SPI flash memory, UART, CAN and
Ethernet as well as Xilinx Programmable Logic (PL) of Artix 7-series [2]. Figure 1.4
shows the schematic view of Xilinx SoC device. Figure 1.5 gives the overview structure of
FPGA hardware which is chosen as primary hardware for proposed design to meet
prerequisites. The significant features of PYNQ-Z2 SoC are listed below [3].

Memory
 Support 32 data width
 IIC - 1 KB EEPROM
 16MB Quad SPI Flash
 DDR3 Component Memory 1GB
Configuration
 USB JTAG configuration port (Digilent)
 16MB Quad SPI Flash
Communication
 USB OTG 1 (PS) - Host USB
 USB UART (PS)
 IIC Bus Headers/HUB (PS)
CLB
 Look-up tables (LUTs)
 Adders
 Flip-flops (FFs)
DSP blocks
 48-bit adder/accumulator
 18 x 25 signed multiply
 25-bit pre-adder
Application Processor Unit (APU)
 CoreSight™ and Program Trace Macrocell (PTM)
 NEON™ media-processing engine
 Coherent multiprocessor support
 Vector Floating Point Unit (VFPU)
 CPU frequency: Up to 1 GHz
Clocking
 156.25MHz I2C Programmable Oscillator (Differential LVDS)
 200MHz Fixed PL Oscillator (Differential LVDS)
 33.33MHz Fixed PS System Oscillator (Single-Ended CMOS)
Secure Digital (SD) connector

HDMI codec
Status LEDs
SoC PS Reset
Pushbuttons IIC bus
multiplexing
 FMC1 LPC connector
 ADV7511 HDMI codec
 RTC-8564JE real time clock
 PMBUS data/clock
 FMC2 LPC connector
 M24C08 EEPROM (1 kB)
Interfacing PS to PL
 2x AXI 32b Master 2x AXI 32-bit Slave
 4x AXI 64-bit/32-bit Memory
 16 Interrupts
 AXI 64-bit ACP
8 DMA Channels
L1 cache (32 kB)
On-chip
memory (256 kB) L2 cache
(512 kB)
Security
 AES and SHA
 RSA Authentication
Figure 1.4. Schematic View of PL and PS Portions of ZYNQ Z2-

The programmable logic section comprises of CLBs, LUTs, BRAMs, DSP


slices and FFs and so on. LUTs can be configured as a single 6-input LUT (64-bit ROMs)
with a single output or as two 5-input LUTs (32-bit ROMs) with distinct outputs but shared
addresses or logic inputs. Each LUT output can be registered in a flip-flop if desired. A
slice is formed by four such LUTs and their eight flip-flops, as well as multiplexers and
arithmetic carry logic, and two slices comprise a customizable logic block (CLB). Four of
the eight flip-flops per slice (one flip-flop each LUT) can be configured as latches if
desired. The block diagram of Zynq SoC is shown in Figure 1.6.

PYNQ-Z2 SoC series of devices enables designers to target both cost-sensitive


and high-performance applications from a single platform using industry-standard tools.
While all devices in the PYNQ-Z2 family have the identical PS, the PL and I/O resources
may vary. As a result, the PYNQ-Z2 SoCs can execute plethora of applications such as
networking, cryptography, wireless networks, video and surveillance, tracking and
detection, medical imaging, autonomous systems and industrial applications. The resources
available in PYNQ-Z2 SoC are listed in Table 1.3.
Table 1.3 List of Hardware Resources Available in ZYNQ XC7Z020 SoC [2].

Entity Availability / Details


DSP slices 220
FFs 106, 400
LUTs 53, 200
BRAMs 140 (4.9 MB)
Programmable Logic cells 85 K
Performance of DSPs 276 GMACs
I/O Pins 128

Figure 1.5. Overview of PYNQ-Z2 SoC Block Diagram [2]


Figure 1.6. Block Description of PYNQ-Z2 SoC [3]

1.2 XILINX VIVADO

Vivado Design Suite, created by Xilinx, is used for HDL design synthesis and
analysis [4]. Vivado is an IDE that allows users to create low-level hardware designs for
Xilinx FPGAs. This suite includes a plethora of Xilinx-developed intellectual property (IP)
that may be included into designs to minimise development time. Users can also create
their own HDL-based IP for application modification with Vivado. The hardware designs
can be developed as a set of HDL files that are linked together, or by utilising the built-in
block diagram GUI, which allows users to drop in IP blocks and manually connect signal
in Vivado. When a design is finished, Vivado can output a bitstream file that can be used to
configure the FPGA.

Before simulation or synthesis, the tool provides design validation, which


allows the user to ensure that the developed hardware design is correctly configured and
free of major design flaws. Users can build testbenches for their designs to emulate the
functionality of their applications. When a simulation is done, a testbench is an HDL-based
framework that wraps around the hardware design and provides it with a sequence of
inputs that will be executed and outputted to the user. Running simulations within Vivado
is a useful tool for
users to assess the correctness and functionality of their designs prior to
synthesis. The complete design flow of Vivado HLS is illustrated in Figure 1.7.

Simulation is merely a technique for functional testing of a design; it does not


ensure that the design will pass synthesis. The most significant feature that Vivado offers is
synthesis. The synthesis process will convert the user's design, which may be in the form
of HDL code or a schematic, into a netlist. This step is crucial since the netlist is the
component in charge of mapping and connecting logic gates and FFs throughout the fabric.
In general, synthesis is the process of converting a software design into the hardware
components required to physically represent the application. When the netlist is aimed at
FPGA hardware, it ensures that when an output signal is generated, it can transmit the data
to the input of the next component in the time required to transport the data physically.
This concept is known as setup and hold slack in static timing analysis, and it is defined as
the difference between the data required time and the data arrival time.

Figure 1.7 VIVADO HLS Design Flow [4]


To transition from one state to the next, each custom IP produced for this
project implements an FSM that is reliant on the system clock and the defined state
variable. This architecture enables a function to be divided into numerous states, each of
which requires one clock cycle. This design enables a function to be divided into numerous
states, each of which requires one clock cycle to perform. This has the advantage of
allowing a timing issue to be tracked back to a specific state within an IP block when a
static timing analysis report is generated. Once the source of the timing error has been
identified, it can be rectified by providing it with additional states to complete its
execution.

The Vivado 2020.1 SDK tool was used for this research work to build high-
level software designs that operate on the FPGA processors and interface with the
hardware design in the FPGA fabric. These software designs are in charge of retrieving
parameter and frame data from the FPGA's I/O ports and writing it to BRAM. The SDK
included a graphical user interface (GUI) for developing applications directly on the
MicroBlaze® soft-processor found in ZC702 FPGAs and the dual-core ARM Cortex-A9
CPU. It differs from the standard Eclipse IDE in that it can import Vivado-generated
hardware designs, create and configure Board Support Packages (BSPs), support single-
processor and multi-processor development for FPGA-based software applications, and
includes off-the-shelf software reference designs that can be used to test the applications
hardware and software functionality.

SDK is the first application IDE to provide genuine heterogeneous


multiprocessor design, debugging and performance analysis for Xilinx FPGAs. The
compilers that optimise C/C++ code and generate assembly code are the main feature that
the Vivado SDK offered for this project. These compilers are in charge of allowing high-
level software designs to be aimed at FPGA processors. Apart from Vivado, TeraTerm is
used in this research work. It is an open-source Telnet and SSH tool used to connect to
serial ports of FPGAs to give a terminal-like interface for troubleshooting the high-level
algorithms that operated on FPGA platforms.

1.3 NEED FOR HARDWARE ACCELERATION

Hardware acceleration refers to the utilization of hardware resources to


accomplish certain activities more quickly than software execution on platforms, such as
FPGAs and general processing units (GPUs). High performance, lower power
consumption, lower latency, improved parallelism and bandwidth, and better utilization
of space and
functional components available on an integrated circuit are all positive aspects
of hardware acceleration.

FPGAs are often considered as first option towards true hardware acceleration
since they have a reconfigurable fabric that can express a software programme as logic
gates. The trade-off between flexibility, performance, and power consumption is constantly
examined when considering hardware platforms to accelerate domain-specific applications.
FPGAs, on the other hand, fall somewhere in between the two and provide a good balance
between these three measures [5].

1.4 APPLICATIONS OF HARDWARE ACCELERATION

Hardware acceleration provides high performance by offloading particular


operations from CPU to FPGAs or any specialized hardware. This process can be enabled
in Google Chrome and any dedicated graphics card. Some of the hardware accelerators are
shown in Figure 1.6.

Figure 1.8. List of Few Hardware Accelerators


1.5 MOTIVATION AND PROBLEM STATEMENTS

FPGA based hardware acceleration for image and video processing techniques
provide high performance and parallelism. The major concern in implementation process is
that effective utilization of hardware resources such as BRAMs, DSP slices, LUTs, FFs and
PLBs. The challenging task in real-time image processing is tracking of multiple objects.
For object tracking algorithms, accuracy and speed are considered as the primary parameters
for evaluation and validation. CNN-based tracking algorithms are not time efficient, and
feature extraction involves a multi-layer network to perform the operation. These
characteristics prompted the researchers to choose FPGAs to implement tracking and
prediction.

1.6 RESEARCH OBJECTIVES

To implement hardware acceleration of edge detection algorithms with effective


resource utilization on Zynq SoC.

To accelerate FPGA based CNN acceleration with minimum prediction time when
compared with other hardware accelerations.
CHAPTER 3

IMPLEMENTATION OF DEEP-LEARNING BASED TECHNIQUE FOR


OBJECT DETECTION

With the technological improvement in artificial intelligence, particularly deep


learning is providing effective outcomes along with hardware platforms in various
domains. FPGAs with their reconfigurable architectures provide flexibility, better
performance and high levels of parallelism. Object detection is one of the prominent area
of research in the fields of computer vision and image processing applications. CNN (You
Only Look Once) is a state of the art object detection algorithm which is fast and accurate.
However, many applications require accuracy and rapid processing for better results. For
such conditions, these algorithms can be implemented on hardware accelerators. This work
proposed the idea of implementing CNN algorithm on Xilinx ® Zynq -7000 SoC which is is
suitable for real-time object detection.

3.1 CNN ALGORITHM

CNN uses Darknet-53 framework for object detection which is fast and reliable
state-of-art algorithm [62]. As the name suggests, Darknet has 53 convolutional layers.
CNN v4 is faster than its previous version CNNv3. CNN consists of Darknet as its
backbone and CNNv3 at top of it. The middle layer includes Path Aggregation Network
(PAN) [60], Spatial Pyramid Pooling (SPP) and Feature Pyramid Network (FPN) [61]. The
generic architecture for CNN is shown in Figure 3.1. To extract features from images, it
uses convolutional layers and prediction of bounding boxes and regression from anchor
boxes with k x k kernel size to generate feature map. It also consists of pooling layers
which is for down sampling each input map. A fully connected layer (FC) can perform as
classifiers. There are some non-linearity layers to enhance the fitting ability of neural
networks which has activation functions. Most commonly used are Rectified Linear Unit
(ReLU), Leaky ReLU, sigmoid and hyperbolic tangent function. The important
measurements for CNN object detection are discussed in section 3.3.

Figure 3.1. CNN Architecture.

3.2 DATASETS

In this work, both CNN network is used for object detection on MS-COCO
benchmark dataset which has more number of categories and instances than PASCAL
VOC and ImageNet [101]. MS-COCO contains images with 91 different types of objects
over 2.5 million annotated illustrations. On comparing with ImageNet benchmark datasets
and PASCAL VOC, MS-COCO dataset contains over 10% of single category per image.

3.3 EVALUATION METRICS

In this section, evaluation parameters are discussed for proposed hardware


acceleration.

3.3.6 Precision, Recall and Intersect over Union

Prediction or accuracy can be simplified as the ratio of true positives (TP) to


total sum of positive predictions such as False positive (FP) and TP as specified by
equation (3.1). Recall is determined as the ratio of true positive predictions to the total
number of objects in the image is given in Equation (3.2). Intersect over Union (IoU) is a
universal standard measure to find similarities between predicted and the ground-truth area
[102]. IoU can be expressed as Equations (3.3) and (3.4). It is also given by the ratio of
area of overlap to the area of union. The detailed representation of these measures are
shown in Figure 3.2. Other important measure is mean Average Precision (mAP) is given
in Equation (3.5). Before predicting bounding boxes, it is necessary to calculate the IOU
measure between the predicted bounding box and the ground truth box which is to be ~1.
IoU is widely used as an estimation parameter for applications such as 2D/3D object
detection, computer vision and so on.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 (
𝐹𝑃+𝑇𝑃
3.1)

𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃

𝐹𝑁+𝑇𝑃 (
3.2)

𝐼𝑜𝑈 =
𝑇𝑃

𝐹𝑃+𝑇𝑃+𝐹𝑁 (
3.3)

𝐼𝑜𝑈𝑝𝑟𝑒𝑑
= ∩ 𝐴𝑟𝑒𝑎𝑔𝑡 = =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝐼 ( 𝑋)
𝐴𝑟𝑒𝑎 (
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑈(
3.4)
𝐴𝑟𝑒𝑎𝑝𝑟𝑒𝑑 ∪ 𝐴𝑟𝑒𝑎𝑔𝑡
𝑈𝑛𝑖𝑜𝑛 𝑋)

𝑚 𝑇𝑃
𝐴𝑃 =
1 ∑
𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠 (3.5)
|𝑐𝑙𝑎𝑠𝑠𝑒𝑠| 𝑇𝑃+𝐹𝑃

Figure 3.2. Precision, Recall and IoU measures.

3.3.7 Bounding box prediction

CNN network operates on the horizontal bounding box to locate the position of
the target images is shown in Figure 3.3. Normally, dimensional vectors for bounding box
are denoted by bx, by, bw and bh where (bx, by) represents the centre of the bounding box,
and bw and bh are width and height of bounding box, are given in Equations (3.6), (3.7),
(3.8) and (3.9) respectively. The dimensions of the predefined box are given by pw and ph.
For locating bounding boxes, Greedy non-maximum suppression (NMS) is used. The
activation function is denoted as fa. (Cx, Cy) are the coordinates of the top left corner of the
anchor box.

𝑏𝑥 = 𝑓𝑎(𝑡𝑥) + 𝐶𝑥 (3.6)

𝑏𝑦 = 𝑓𝑎(𝑡𝑦) + 𝐶𝑦 (3.7)
𝑏𝑤 = 𝑝𝑤 . 𝑒𝑡𝑤 (3.8)

𝑏ℎ = 𝑝ℎ . 𝑒𝑡ℎ (3.9)

Figure 3.3. Bounding box prediction.

3.3.8 Loss Function

The multi-part loss function is given in Equation (3.10). It is calculated by the


sum of co-ordinate loss (Coord_Loss), bounding loss (Bound_Loss), confidence loss
(Confid_Loss) and category loss (Category_Loss) given in equations (3.11). (3.12), (3.13)
and (3.14) respectively. Equation (3.11) gives co-ordinate loss of the bounding box in
which
(x,y) are predicted coordinates and ( 𝑥̂ , 𝑦̂ ) are the ground truth coordinates.
𝑖 𝟙𝑜𝑏𝑗
represents 𝑗

object in cell i is ‘responsible’ for that object in bounding box j.

The width and height of box is given by w and h respectively. The number of
cells and number of classes are denoted as S2 and C respectively. B is number of bounding
boxes predicted by each grid. The probability is given by p. 𝜆𝑛𝑜𝑜𝑏𝑗 and 𝜆𝑐𝑜𝑜𝑟𝑑 are the loss
parameters to control the stability.
Algorithm: Bounding Box
Inputs: Coordinates: x and y. Dimensions: img_w-width and img_h-height.
Output: Draw boxes over predicted region
Step-1: Get output layers, Initialization-1: Set confidence threshold value to
0.5 and non- maximum suppression(greedy) value to 0.4
1. def get_output_layers(darknet)
2. output_layers = [layer_names[i]
3. for i in darknet.getUnconnectedOutLayers()]
4. Initialize begin
5. outs = net.forward(get_output_layers(net))
6. Calculate number of classes, class_ids = [n]
7. Calculate number of confidences for each layers, confid = [k]
8. confid_thres = 0.5; nms_thres = 0.4
9. Neglecting the layers with confidence threshold < 0.5
10. if box_classes[i]>0.5
11. center_x = int(detection[i(0)] * img_w); center_y = int(detection[i(1)] * img_h)
12. Estimation of bounding box parameters:
13. x = center_x - w / 2
14. w = int(detection[i(2)] * img_w)
15. y = center_y - h / 2
16. h = int(detection[i(3)] * img_h)
17. class_ids.append(class_id)
18. confid.append(float(confid))
19. boxes.append([x, y, img_w, img_h])
20. Non-maximum Suppression(NMS)
21. indices = cv2.dnn.NMSBoxes(boxes, confidences,
confid_threshold, greedynms_threshold)
22. To draw bounding box,
23. draw_bounding_box(img, class_ids[n], anchors[i], confidences[k], round[x, y, x+
img_w]
24. Display output image, cv2.imshow("object detection", out_img)
25. end if
26. end
3.4 PROPOSED ACCELERATION OF CNN ALGORITHM

CNN algorithm is implemented on Zynq platform. Zynq™ 7000 SoC


XC7Z020-CLG484-1 is used for the implementation which has Advanced Microcontroller
Bus Architecture (AMBA) Interconnects, DDR3 component memory ARM Cortex-A9
Processing System (PS), 7-series Xilinx Programmable Logic (PL) and several peripherals
such as Ethernet, Quad SPI flash memory USB JTAG interface, CAN controller and
UART [2]. Xilinx Inc. provides development tools such as Vivado High Level Synthesis
(HLS) for deployment of deep learning algorithms with some Targeted Reference Design
(TRD) [103]. Vivado also includes a high-level synthesis (HLS) tool for C-based IP
generations in a high- level language (C, C++, or SystemC). Vivado 2020.1 provides
different types of libraries for deploying various deep learning algorithms. The overview of
proposed acceleration is shown in Figure 3.4.

Figure 3.4. Synoptic of Proposed Architecture

The design flow of CNN model to hardware acceleration is shown in Figure


3.5. Likewise, in proposed hardware acceleration, every layers are processed serially. As
far as the CONV layer is concerned, kernel size is 3 x 3 which can be modified. The timing
optimization for each layers is important in the streaming design. By the reason of large
amount of input data, loop pipelining is used to enhance system throughput in high-level
synthesis. The network model consists of Leaky ReLU as an activation function. The main
purpose of Leaky ReLU and is to overcome the gradient vanish problem of
ReLU is zero when the output is less than zero. The Data Scatter module is instrumental in
creating the write address and distributing information read to the on-chip buffers from the
DRAM memory. The proposed hardware implementation is shown in Figure 3.3.

Figure 3.5. Design Flow of Hardware Acceleration of CNN Models


The pixel buffers are responsible for the processing of the maximum pooling
layer (Pool), convolutional layer (Conv and Leaky ReLU), and the reorg layer (Reorg).
The Data Gather module generates the DRAM write-back addresses and writes the data to
the DRAM from the output buffer. The accelerated hardware has two AXI4 master
interfaces and one AXI4-Lite slave interface for data transformation. In this proposed
method, fully connected (FC) layer is connected to weighted buffers.

Figure 3.3. Proposed architecture of CNN acceleration

In proposed FPGA implementation, PS consists of Direct memory Access


(DMA), general purpose input/output (GPIO), interrupt controller and so on. PL comprises
of decoding and data reordering modules, parameter and processing array, etc. DDR
memory in both sections are used to communicate with SoC through interface called as
Memory Generator Interface for storing network parameters and mapping features shown
in Figure
3.7. While processing, each layer’s configuration instruction is done with ARM
and delivered to PL through GPIO. Control signals are transmitted to the required modules
after the decoding the instruction. DMA retrieves the original picture from PS-DDR and
sends it to PL.
The input data reordering module rearranges the pixels before passing them on
to the processing array. The DDR controller in PL retrieves model parameters from the PL-
DDR and stores them in the parameters buffer, which then gives parameters to the
processing array (PA). The suggested design employs processing elements (PEs) to
construct PAs for parallel processing. These parallel PEs use the same input feature map
and perform calculations for various output channels. These parallel PEs use the same
input feature mapping and perform calculations for various output channels. These PEs
finish each layer’s calculation in concurrently. Finally, the PS-DDR transfers the output
feature maps from the last layer back to the host PC. The host PC uses the final feature
maps to perform Non- Maximum Suppression (NMS) to achieve the object detection
results.

Figure 3.7 Dataflow of Proposed Architecture

3.5 RESULTS AND DISCUSSION

The object detection algorithm is trained on MS-COCO dataset which contains


91 objects types with 2.5 million labeled instances in 328 k images [48]. The implemented
results are shown in Table 3.1. The proposed implementation achieves better resource
utilisation with 23.2 k of LUTs, 45.8 of flip flops, 115 BRAMs and 174 DSPs under
optimum frequency of 100 MHz. The timing delay for every frames is shown in Figure
3.8. The
implementation and synthesis can be obtained from Vivado HLS and IP core
subsystems are shown in Figure 3.9. Since CNN is one of the fastest object detection
algorithm, performance of the overall system is 9.14 GOP/s is and the power consumption
is 10.36 W. The achieved throughput is restricted by resource constraints and data
dependencies in the design.

Figure 3.8 An Illustration of Frame Rate Processing.

The total delay is given by,

� 𝐷�
𝑛 = 𝐷 + 𝐷
𝑛 𝑛
(3.15)
� �

However, because the queue’s size is finite, the delay in reaching its maximum
queue size occurs. However, because the proposed accelerator serves the most recent
frame, this cumulative delay does not occur, demonstrating that each frame can be handled
inside the deadline. In the context of the CNN algorithm, the waiting time for the frames
kept in the queue is compounded as time passes. For example, when Ta is larger than or
equal to Ds, the system is often operating in a high-performance hardware system, and the
object detection service time is faster than the input rate.
Figure 3.9. IP Core Subsystems of Accelerated CNN Algorithm Using Vivado
2020.1
Table 3.1. Comparison of Resources Utilized for CNN on Zynq XC7Z020 Platform.

Parameters K.Guo et al. [78], 2016 Proposed work


Target Platforms XC7Z020 XC7Z020
Target networks Angel-Eye (Modified CNN) CNN v4
LUTs 27 k 23.2 k
Flip Flops 24 k 45.8 k
BRAMs 68 115
DSPs 198 174
Clock Freq. (MHz) 100 140
Dynamic Power (W) N/A 10.36

From Table 3.1, it is understood that LUTs and DSPs are effectively utilized
for the proposed implementation but FFs and BRAMs are utilized higher because of huge
memory occupied for high resolution images in MS-COCO dataset [101]. This limitation
can be avoided by optimizing the convolutional layers in the algorithm for high resolution
images. The existing implementations on different platforms for object detection is
illustrated in Table 3.2. As this chapter focused on the real-time implementation,
optimizing these layers are not attempted.
Table 3.2. Overview of Resources Utilized For Object Detection on Different Platforms in Existing Works.

Clock
Target
Parameters Target networks LUTs Flip Flops BRAMs DSPs Freq. Power (W)
Platforms
(MHz)
XC7Z020 27k 24k 68 198 100 NA
Angel-Eye
K. Guo et al. [78], (2016) (only for face
XC7Z 182.6K 127.6k 486 780 150 9.63
detection)
7045
Z. Yu et al. [77], (2020) CNNv3 tiny Zedboard 25.9k 43.7k 185 160 663.7 3.36

D.T. Nguyen et al. [76], Sim-CNNv2 tiny Virtex-7


155k 115k 1144 272 200 18.29
(2019) VC707

H. Nakahara et al. [75], Lightweight


ZCU102 135k 370k 1706 377 N/A 4.5
(2018) CNNv2
G. Wei et al. [79], (2018) FPGA CNN Zynq 47k 40k 787 409 N/A 7.518

7035
3.6 SUMMARY

The hardware acceleration of CNN algorithm for object detection is


implemented on PYNQ-Z2 SoC. Unlike previous investigations focusing on changing
network structure of CNN, this chapter explained the real-time acceleration of CNN
algorithm trained on MS-COCO benchmark datasets for object detection. The proposed
acceleration achieved the performance of 30 fps with minimum power consumption with
effective resource utilization.

Vivado HLS tool is used for the synthesis and implementation of object
detection algorithm on SoC which achieved better resource utilization of 43.6% of LUTs,
43.04% of Flip-flops, 82.17% BRAMs and 79% DSPs while comparing with other
hardware implementation. The real-time acceleration of CNN algorithm for object
detection has taken 10.125 ms for detection. The results prove that the performance is 2x
more efficient than previous works and the prediction time is also very low when CNN
algorithm is implemented on FPGA hardware.
CHAPTER 4

CONCLUSION AND FUTURE

PERSPECTIVES

4.1 CONCLUSION

Object tracking and detection are important tasks in computer vision


applications. The real-time performance of these processes can be achieved by
implementing on FPGAs, GPUs and multi-core architectures. The primary objective of this
research work provides different hardware accelerations for image processing techniques
such as object tracking and detection to achieve high performance, effective utilization
FPGA resources and prediction accuracy.

We have implemented edge detection filter algorithms such as Sobel-Feldman,


posterize and threshold on Zynq SoC for 1920 x 1080 image resolutions with effective
resource utilization provides primary objective for other accelerations. The synthesis and
simulations were carried out using Vivado 2018.2 and libraries were utilized from
OpenCV functions. The outcomes of this work were compared with similar
implementations in terms of hardware resource utilization.

In case of tracking, multiple object tracking (MOT) is the challenging


processes when it is performed in real-time mode for high degree of accuracy and speed.
MOT aims to estimate trajectories of all objects under various factors for example
occlusions and distracters. To overcome this issue, MDKF algorithm is proposed and it is
trained on various benchmark datasets such as OTB-100, UAVDT and MOT and ablation
studies proved that MDKF achieved with 70.3% of precision and 44.7% of AUC. These
results demonstrated that the proposed tracking algorithm outperformed other state-of-the
art trackers.

Further the research is fueled by the application of deep learning in the image
processing applications. Among several deep learning based object detections algorithms,
CNN algorithm is chosen and acceleration was attempted. In addition, a hardware based
neural network for object detection was designed. The model is trained on MS-COCO
benchmark dataset and outcomes are compared with existing implementations. For
proposed acceleration of real-time detection, the prediction time is about 10.12 ms which is
faster than existing SoC accelerations.

4.2 FUTURE PERSPECTIVES

The research work can be extended with the intention to incorporate full
reconfiguration or partial reconfiguration (PR) which are the main features for FPGAs. The
most difficult problems are programming for reconfigurable architectures and effective
virtualization of FPGA resources for PR. With the development of FPGA technology, it is
possible to implement reconfigurability for real-time image processing applications.

Modern FPGAs have greater capacity and faster memory speeds than in the
past, allowing for more design space. In our research, we discovered that there may be a
performance difference of up to 95% between two different solutions that use the identical
logic resource of an FPGA. It is not trivial to settle with one optimal solution, particularly
when the computation resource and memory bandwidth of an FPGA platform are taken
into account. Therefore, if an accelerator structure is not designed properly, its compute
performance will be insufficient to satisfy the memory band-width requirements enabled
by FPGAs. It denotes that performance has suffered as a result of insufficient usage of
either logic resources or memory bandwidth.

Finally, the future work is focused on implementing object detection


algorithms on different types of hardware platforms to analyze various parameters like
power consumption, speed and resource utilization. However, FPGA implementations
surpasses software implementations in terms of timing accuracy and efficiency, their
deployment is complex and time-consuming. Moreover, developers with specialised
expertise are required for development and customization of the FPGA algorithm. The
upcoming platforms that can generate system configuration from software requirements
could be a solution.
REFERENCES

[1] S.A. Fahmy, K. Vipin, FPGA dynamic and partial reconfiguration: A survey of
architectures, methods, and applications. Comput. Surveys 51, pp. 1-39, 2018.

[2] Xilinx, ZC702 Evaluation Board for the PYNQ-Z2 XC7Z020 SoC: User Guide
(2017).[Online].https://www.xilinx.com/support/documentation/boards_and_kit
s/ zc702_zvik/ug850-zc702-eval-bd.pdf. (Accessed on 23rd March, 2022)

[3] Xilinx Inc.: PYNQ-Z2 all programmable SoC technical reference manual. (2021).
(Accessed on 23rd March, 2022) Available
at: https://www.xilinx.com/support/documentation/user_guides/ug585-PYNQ-
Z2-
TRM.pdf

[4] Xilinx Inc, “Vivado Design Suite tutorial high level synthesis, UG871 (v 2014.1)
May 6, 2014,” UG871 (v 2014.1) May 6, 2014. [Online]. Available at:
https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/
ug871- vivado-high-level-synthesis-tutorial.pdf. (Accessed on 23rd March, 2022)

[5] P. Babu, E. Parthasarathy, “Reconfigurable FPGA Architectures: A Survey and


Applications,” J. Inst. Eng. India Ser. B 102, pp. 143–156, 2021.

[6] J. C. Mora, E. C. Gallego and S. S. Solano, “Hardware/software co-design of video


processing applications on a reconfigurable platform,” in Int. Conf. on Industrial
Technology (ICIT), Seville, Spain: IEEE, pp. 1694–1699, 2015.

[7] K. F. Kong Wong, V. Yap and T. P. Chiong, “Hardware accelerator implementation


on FPGA for video processing,” in IEEE Conf. on Open Systems (ICOS), Kuching,
Malaysia, pp. 47–51, 2013.

[8] A. L. Sangiovanni-Vincentelli et al. “Defining Platform-Based Design,” In EEDesign.


Available at www.eedesign.com/story/OEG20020204S0062). (Accessed on 23rd
March, 2022)
[9] L. Kechiche, L. Touil and B. Ouni, “Real-time image and video processing: Method
and architecture,” in 2nd Int. Conf. on Advanced Technologies for Signal and
Image Processing (ATSIP), IEEE, Monastir, Tunisia, pp. 194–199, 2016.

[10] J. G. Pandey, A. Karmakar and S. Gurunarayanan, “Architectures and algorithms


for image and video processing using FPGA-based platform,” in 18th Int. Sym. on
VLSI Design and Test (VDAT), IEEE, pp. 1, 2014.

[11] J. Rettkowski, A. Boutros and D. Göhringer, “HW/SW co-design of the HOG


algorithm on a Xilinx Zynq SoC,” Journal of Parallel and Distributed Computing,
vol. 109, pp. 50–62, 2017.

[12] S. Madhava Prabhu and S. Verma, "A Comprehensive Survey on Implementation


of Image Processing Algorithms using FPGA," 2020 5th IEEE International
Conference on Recent Advances and Innovations in Engineering (ICRAIE), 2020,
pp. 1-6, doi: 10.1109/ICRAIE51050.2020.9358384.

[13] Ali Azarian, Mahmood Ahmadi, “Reconfigurable Computing Architecture: Survey


and introduction,” in 2nd International Conference on Computer Science and
Information Technology, IEEE, Beijing, China, pp. 269–27, 2009.

[14] A. DeHon, "Reconfigurable Architectures for General-Purpose Computing",


Technical Report Massachusetts Institute of Technology, 1996.

[15] I. Kuon, R. Tessier and J. Rose, "FPGA Architecture: Survey and Challenges", J.
Found. and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135-253,
2008.

[16] K. Compton and S. Hauck, "Reconfigurable computing: A survey of systems and


software", ACM Computing Surveys, vol. 34, no. 2, pp. 171-211, 2002.

[17] R. Cumplido, M. Gokhale and M. Huebner, “Guest Editorial: Special issue on


Reconfigurable Computing and FPGA technology,” Journal of Parallel and
Distributed Computing, vol. 133, pp. 359–361, 2019.

[18] C. Claus, W. Stechele and A. Herkersdorf, “Autovision – A run-time reconfigurable


MPSoC architecture for future driver assistance systems,” IT- Information
Technology, vol. 49, no. 3, pp. 181–187, 2007.
[19] C. Khongprasongsiri, P. Kumhom, W. Suwansantisuk, T. Chotikawanid, S.
Chumpol et al., “A hardware implementation for real-time lane detection using
high-level synthesis,” in International Workshop on Advanced Image Technology
(IWAIT), Chiang Mai, Thailand: IEEE, pp. 1–4, 2018.

[20] D. G. Bailey, “Image processing using FPGAs,” Journal of Imaging, vol. 5, no. 53,
pp. 1–4, 2019.

[21] M. Kowalczyk, D. Przewlocka and T. Krvjak, “Real-time implementation of


contextual image processing operations for 4K video stream in Zynq UltraScale+
MPSoC,” in Conf. on Design and Architectures for Signal and Image Processing
(DASIP), Porto, Portugal, pp. 37–42, 2018.

[22] A. B. Amara, E. Pissaloux and M. Atri, “Sobel edge detection system design and
integration on an FPGA based HD video streaming architecture,” in 11th Int.
Design & Test Sym. (IDT), Hammamet, Tunisia, pp. 160–164, 2016.

[23] E. Onat, “FPGA implementation of real time video signal processing using Sobel,
Robert, Prewitt and Laplacian filters,” in 25th Signal Processing and
Communications Applications Conf. (SIU), Antalya, Turkey, pp. 1–4, 2017.

[24] R. Tessier, I. Kuon, J. Rose, “FPGA architecture: survey and challenges,” Found.
Trends Electron. Des. Autom. 2(2), pp. 135–253, 2008.

[25] Y. Fang, L. Yu and S. Fei, "An Improved Moving Tracking Algorithm With Multiple
Information Fusion Based on 3D Sensors," in IEEE Access, vol. 8, pp. 142295-
142302, 2020, doi: 10.1109/ACCESS.2020.3008435.

[26] Y. Wang and X. Mu, "Dynamic Siamese Network With Adaptive Kalman Filter for
Object Tracking in Complex Scenes," in IEEE Access, vol. 8, pp. 222918-222930,
2020, doi: 10.1109/ACCESS.2020.3043878.

[27] S. Yang and M. Baum, "Extended Kalman filter for extended object tracking,"
2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017, pp. 4386-4390, doi: 10.1109/ICASSP.2017.7952985.

[28] C. G. Prevost, A. Desbiens and E. Gagnon, "Extended Kalman Filter for State
Estimation and Trajectory Prediction of a Moving Object Detected by an
Unmanned Aerial Vehicle," 2007 American Control Conference, 2007, pp. 1805-
1810, doi: 10.1109/ACC.2007.4282823.
[29] I.A. Iswanto, T. Choa, B. Li, “Object tracking based on meanshift and particle-
Kalman filter algorithm with multi features,” Procedia Computer Science, vol.
157, pp. 521–529, 2019.

[30] F. Farahi, H.S. Yazdi, “Probabilistic Kalman filter for moving object tracking,”
Signal Processing: Image Communication, vol. 82 no. 10, pp.115751, 2020.

[31] E. Gundogdu, A. A. Alatan, “Good features to correlate for visual tracking,” IEEE
Transactions on Image Processing, vol. 27, no. 5, pp. 2526–2540, 2018.

[32] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr, “Fully-


convolutional Siamese networks for object tracking,” in Proc. ECCV Workshop,
pp. 850–865, 2016.

[33] D. S. Bolme, J. R. Beveridge, B. A. Draper and Y. M. Lui, "Visual object tracking


using adaptive correlation filters", Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., pp. 2544-2550, 2010.

[34] J.-M. Jeong, T.-S. Yoon, J.-B. Park, Kalman filter based multiple objects detection-
tracking algorithm robust to occlusion, in: 2014 Proceedings of the SICE Annual
Conference (SICE), 2014, pp. 941–
946, https://doi.org/10.1109/SICE.2014.6935235.

[35] M. Heimbach, K. Ebadi and S. Wood, "Improving Object Tracking Accuracy in


Video Sequences Subject to Noise and Occlusion Impediments by Combining
Feature Tracking with Kalman Filtering," 2018 52nd Asilomar Conference on
Signals, Systems, and Computers, 2018, pp. 1499-1502, doi:
10.1109/ACSSC.2018.8645175.

[36] Z. Zhou, X. Gao, J. Xia, Z. Zhu, D. Yang, J. Quan, “Multiple instance learning
tracking based on Fisher linear discriminant with incorporated priors,” Int. J. Adv.
Robotic Syst. vol. 15(1), pp. 1–19, 2018. doi:
https://doi.org/10.1177/1729881417750724.

[37] Z. Zhou, J. Wang, Y. Wang, Z. Zhu, J. Du, X. Liu, J. Quan, “Visual tracking using
improved multiple instance learning with co-training framework for moving
robot,” KSII Trans. Internet Inf. Syst. vol. 12 (11), pp. 5496–5521, 2018.

[38] E. Gundogdu, A.A.Alatan, Good features to correlate for visual tracking, IEEE
Trans. Image Process. 27(5) (2018) 2526–2540. doi:10.1109/TIP.2018.2806280..
[39] J. S. Bergstra, R. Bardenet, Y. Bengio and B. Kégl, "Algorithms for hyper-
parameter optimization", Proc. 24th Int. Conf. Neural Inf. Process. Syst., pp.
2546- 2554, 2011.

[40] H. Li, Y. Li, F. Porikli, Deeptrack: Learning discriminative feature representations


online for robust visual tracking, Proc. BMVC, pp. 1–12, 2014.

[41] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, ECO: Efficient convolution


operators for tracking, in: Proc. IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017, pp. 6931–6939.

[42] D. Yuan, X. Chang, P.Y. Huang, Q. Liu, Z. He, Self-supervised deep correlation
tracking, IEEE Trans. Image Process. 30, pp. 976–985, 2021.
https://doi.org/10.1109/TIP.2020.3037518.

[43] X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, F. Porikli, Hyperparameter


Optimization for Tracking with Continuous Deep Q-Learning, in: Proc. 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
pp. 518–527.

[44] J.V. Fonseca, R.C.L. Oliveira, J.A.P. Abreu, E. Ferreira, M. Machado, Kalman filter
embedded in FPGA to improve tracking performance in ballistic rockets, in:2013
UKSim 15th International Conference on Computer Modelling and Simulation,
2013, pp. 606–610, https://doi.org/10.1109/UKSim.2013.149.

[45] Al-Rababah, A.A. Qadir, Embedded architecture for object tracking using Kalman
filter, J. Comput. Sci. 12(5) pp. 241–245. 2016 doi:10.3844/jcssp.2016.241.245.

[46] W. Liu, H. Chen, L. Ma, Moving object detection and tracking based on Zynq
FPGA and ARM SoC, IET International Radar Conference, pp. 1–4, 2015
https://doi.org/10.1049/cp.2015.1356.

[47] A. Sudarsanam, Analysis of Field Programmable Gate Array-based Kalman Filter


Architectures, url:http://digitalcommons.usu.edu/etd/788. (Accessed on 14th
March, 2022)

[48] P. Rao, M.A. Bayoumi, An efficient vlsi implementation of real-time Kalman filter,
IEEE International Symposium on Circuits and Systems pp. 2353–2356, 1990.
https://doi.org/10.1109/ISCAS.1990.112482.
[49] L. Bossuet, G. Gogniat, J. Diguet, J. Philippe, A modeling method for
Reconfigurable Architectures, in: System-on-Chip for Real-Time Applications. The
Kluwer International Series in Engineering and Computer Science, vol. 711, 2003.
doi:10.1007/978-1-4615-0351-4_16.

[50] A. Mills, P.H. Jones, J. Zambreno, Parameterizable FPGA-based Kalman Filter


Coprocessor using Piecewise Affine Modeling, in: IEEE International Parallel and
Distributed Processing Symposium Workshops (IPDPSW), 2016, pp. 139–147,
https://doi.org/10.1109/IPDPSW.2016.101.

[51] A. Jarrah, A. Al-Tamimi, T. Albashir, Optimized parallel implementation of


extended kalman filter using FPGA, J. Circuits Syst. Comput. vol. 27(1), 2017
1850009(1–22). doi:10.1142/S0218126618500093.

[52] J. Soh, X. Wu, An fpga-based unscented Kalman filter for System-On-Chip


Applications, IEEE Trans. Circuits Syst. II: Express Briefs 64(4) (2017) 447–451.
doi:10.1109/TCSII.2016.2565730.

[53] P. Zhang, W. Li, X. Yang, Efficient implementation of recursive multi-frame track-


before-detect algorithm based on FPGA, in: Proc. 2019 International Conference
on Control, Automation and Information Sciences (ICCAIS), 2019, pp. 1–6.

[54] Q. Iqbal et al., Design and fpga implementation of an adaptive video subsampling
algorithm for energy-efficient single object tracking, in: Proc. 2020 IEEE
International Conference on Image Processing (ICIP), 2020, pp. 3065–3069.

[55] Intel, FPGA vs GPU for deep learning [Online] Available:


https://www.intel.com/content/www/us/en/artificialintelligence/programmable
/fpg a-gpu.html (Accessed on 23rd March, 2022)

[56] W. Liu, D. Anguelov, D. Erhan, C. Szegedy et al., “SSD: Single Shot MultiBox
Detector,” in European Conference on Computer Vision, Cham, Switzerland, pp.
21-37, 2016.
[57] K. He, X. Zhang, S. Ren, J. Sun, “Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition,” in European Conference on Computer Vision,
Cham, Switzerland, pp. 346-361, 2014.
[58] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference
on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
[59] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 2017.

[60] S. Liu, et al., "Path Aggregation Network for Instance Segmentation," 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8759-
8768, 2018.

[61] T. -Y. Lin et al., "Feature Pyramid Networks for Object Detection," 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936-944,
2017.

[62] J. Redmon, A. Farhadi, “CNN9000: Better, Faster, Stronger,” In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 7263
—7271, 2017.

[63] A. Bochkovskiy, C.Y. Wang, H.Y.M Liao, “CNN: Optimal Speed and Accuracy of
Object Detection,” arXiv:2004.10934, 2020.

[64] Y. Zhou and J. Jiang, “An FPGA-based accelerator implementation for deep
convolutional neural networks. In Proceedings of the 2015 4th International
Conference on Computer Science and Network Technology, ICCSNT 2015,
Harbin, China, vol. 1, pp. 829–832, 2015.

[65] A. Shawahna, S.M Sait and A. El-Maleh, “FPGA Based Accelerators of Deep
Learning Networks for Learning and Classification: A Review. IEEE Access, 7, pp.
7823–7859, 2019.

[66] Wang, E., Davis, J., Zhao, R., Ng, H-C., et al.: Deep neural network approximation
for custom hardware. where we have been, where we are going. ACM Comput.
Surv. 52(2), pp. 1–39, 2019.

[67] M. A Dias and D. A. P Ferreira, “Deep Learning in Reconfigurable Hardware: A


Survey. In: IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW), Rio de Janeiro, Brazil, pp. 95–98, 2019.

[68] Blaiech, A.G., Khalifa, K.-B., Valderrama, CV., et al.: A Survey and Taxonomy of
FPGA-based Deep Learning Accelerators. J. Syst. Architect. 98, 331–345, 2019.
[69] HajiRassouliha, A., Taberner, A.J., Nash, M.P., Nielsen, P.M.F.: Suitability of
recent hardware accelerators (DSPs, FPGAs, and GPUs) for computer vision and
image processing algorithms. Signal Process. Image Comm. 68, 101–119, 2018.

[70] Tong, K., Wu, Y., Zhou, F.: Recent advances in small object detection based on
deep learning: A review. Image Vis. Comput. 97, 103910, 2020.

[71] El-Shafie, A.-H.A., Habib, S.E.: Survey on hardware implementations of visual


object trackers. IET Image Process. 13, pp. 863–876, 2019.

[72] C. Ding, S. Wang, N. Liu, N., Xu, K., et al., “REQ-CNN: A Resource-Aware, Efficient
Quantization framework for Object Detection on FPGAs,” In: 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA,
pp. 33–42, 2019.

[73] J. Wang, J. Lin, Z. Wang, “Efficient Hardware Architectures for Deep


Convolutional Neural Network,” IEEE Trans. Circuits Syst. I: Regul. Pap. 65(6),
1941–1953, 2018.

[74] Q. Mao et al, “Mini-CNNv3: Real-Time Object Detector for Embedded


Applications,” IEEE Access 7, 133529–133538, 2019.

[75] H. Nakahara, et al. “A Lightweight CNNv2: A Binarized CNN with A Parallel


Support Vector Regression for an FPGA,” In:2018 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, pp. 31–40
2018.

[76] D.T Nguyen, T.N Nguyen, H. Kim, H-J. Lee, “A High-Throughput and Power-
Efficient FPGA Implementation of CNN CNN for Object Detection,” IEEE Trans.
Very Large Scale Integr. (VLSI) Syst. 27(8), 1861—1873, 2019.

[77] Z. Yu, CS Bouganis, “A Parameterisable FPGA Tailored Architecture for CNNv3-


Tiny,” Proc. International Symposium on Applied Reconfigurable Computing,
Cham, Switzerland, pp. 330–344, 2020.

[78] K. Guo, L. Siu, J. Qiu, S. Yao, et al. “Angel-Eye: A Complete Design Flow for
Mapping CNN onto Customized Hardware,” In: IEEE Computer Society Annual
Symposium on VLSI (ISVLSI), Pittsburgh, PA, USA, pp. 24–29, 2016.
[79] G. Wei, Y. Hou, Q. Cui, G. Deng, et al., “CNN Acceleration using FPGA
Architecture,” In: IEEE/CIC International Conference on Communications in China
(ICCC), Beijing, China, pp. 734–735 (2018)

[80] C. Zhang, P. Li, G. Sun, Y. Guan, et al. “Optimizing FPGA-based Accelerator Design
for Deep Convolutional Neural Networks,” In: 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, pp. 161–
170 , 2018.

[81] Cambay, V.Y., Uc¸ar, A., and Arserim, M. A.: Object Detection on FPGAs and
GPUs by Using Accelerated Deep Learning. In: International Artificial Intelligence
and Data Processing Symposium (IDAP), Malatya, Turkey, pp. 1–5, 2019.

[82] D. Pestana, et al. “A Full Featured Configurable Accelerator for Object Detection
With CNN. IEEE Access, 9, pp. 75864–75877, 2021.

[83] N. Zhang, X. Wei, H. Chen and W. Liu, “FPGA Implementation for CNN-Based
Optical Remote Sensing Object Detection,” Electronics, 2021.

[84] R.E. Kalman, A new approach to linear filtering and prediction problems, Trans.
ASME–J. Basic Eng. 82 (1) (1960) 35–45.

[85] Q. Li, R. Li, K. Ji, W. Dai, Kalman filter and its application, in: 2015 8th International
Conference on Intelligent Networks and Intelligent Systems (ICINIS), IEEE, 2015,
pp. 74–77.

You might also like