0% found this document useful (0 votes)
9 views

Accelerating Deep Neural Networks Implem

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Accelerating Deep Neural Networks Implem

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

IET Computers & Digital Techniques

Received: 18 July 2020

DOI: 10.1049/cdt2.12016

REVIEW
- -
Revised: 5 December 2020 Accepted: 11 December 2020

-
Accelerating Deep Neural Networks implementation: A survey

Meriam Dhouibi | Ahmed Karim Ben Salem | Afef Saidi | Slim Ben Saoud

Advanced Systems Laboratory, Tunisia Polytechnic Abstract


School, University of Carthage, BP 743La Marsa,
2078Tunisia
Recently, Deep Learning (DL) applications are getting more and more involved in
different fields. Deploying such Deep Neural Networks (DNN) on embedded devices is
Correspondence still a challenging task considering the massive requirement of computation and storage.
Meriam Dhouibi, Advanced Systems Laboratory, Given that the number of operations and parameters increases with the complexity of the
Tunisia Polytechnic School, University of Carthage, model architecture, the performance will strongly depend on the hardware target re-
BP 743, 2078, La Marsa, Tunisia,
Email: [email protected]
sources and basically the memory footprint of the accelerator. Recent research studies
have discussed the benefit of implementing some complex DL applications based on
different models and platforms. However, it is necessary to guarantee the best perfor-
mance when designing hardware accelerators for DL applications to run at full speed,
despite the constraints of low power, high accuracy and throughput. Field Programmable
Gate Arrays (FPGAs) are promising platforms for the deployment of large‐scale DNN
which seek to reach a balance between the above objectives. Besides, the growing
complexity of DL models has made researches think about applying optimization tech-
niques to make them more hardware‐friendly. Herein, DL concept is presented. Then, a
detailed description of different optimization techniques used in recent research works is
explored. Finally, a survey of research works aiming to accelerate the implementation of
DNN models on FPGAs is provided.

1 | INTRODUCTION data which is expensive in terms of power. Application Specific


Integrated Circuits (ASICs) can achieve even higher perfor-
Recently, DL technology has been used successfully for a va- mance and can improve the energy efficiency, which is a key
riety of tasks in several fields of applications related to signal, factor in embedded systems. However, the deployment of DL
information and image processing such as computer vision [1], model on a customised ASIC requires high investments due to
Natural Language Processing ( NLP) [2], medical [3], video a long and complex design cycle. Recently, FPGAs become a
games [4] and all areas of science and human activity. DL promising solution to accelerate inference, they offer the per-
models like Convolutional Neural Networks (CNNs) and formance advantages of reconfigurable logics with the high
Recurrent Neural Networks (RNNs) continue to make great degree of flexibility. Specific hardware design on such plat-
progress in solving complex problems. However, the deploy- forms could be more efficient in speed and energy compared
ment of such models is a hard task considering the massive to other platforms. Moreover, the deployment of large‐scale
amount of computation and the big storage requirements. DNNs with large numbers of parameters is still a daunting
Therefore, the performance of the model depends on the task, because the large dimensionality of such models increases
target hardware resources. The training and the inference the computation and data movement. So, to deploy such so-
phases of DL models are being executed on powerful phisticated models in embedded platforms and to obtain a
computation machines using advanced technologies such as more robust model, the internal operations and number of
new multicore Central Processing Unit (CPU), Graphics Pro- parameters can be reduced by optimising the network archi-
cessing Unit (GPU) or clusters of CPUs and GPUs. Usually, tecture. Several optimizations techniques were discussed in
GPU platforms are better on supporting training and inference literature. One of the most popular optimization approaches
of more sophisticated models. GPU technology offers a high that makes models faster, energy efficient and more hardware
computation capacity but ensures the interdependence of the friendly is model compression which includes the low data

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2021 The Authors. IET Computers & Digital Techniques published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Comput. Digit. Tech. 2021;15:79–96. wileyonlinelibrary.com/journal/cdt2

- 79
80 DHOUIBI ET AL.
-
precision, pruning network, low‐rank approximation, etc. CNN models have demonstrated impressive performance in
Furthermore, for efficient implementation of an optimised DL computer vision applications such as autonomous car vision
model, further acceleration improvement is required. Indeed, it systems [10], drone navigation, robotics [11], etc. CNNs, have
is necessary to maximise the utilization of all offered oppor- also proved to be more effective in medical field and specially
tunities at several levels of hardware/software codesign to in image recognition. It have been adopted at detecting a
achieve high performance in terms of precision, energy con- tumour or any other type of lesion than the most experienced
sumption and throughput. This survey takes a deep dive into radiologists [12]. In Ref. [13], an image extracted from Mag-
DL implementation on advanced and dedicated computation netic Resonance Imaging (MRI) of a human brain was pro-
platforms and reveals its bottlenecks. In addition, it is focus- cessed to predict Alzheimer's disease using CNN. DL models
sing on hardware and software techniques to optimise the are also used in drug research by predicting molecular prop-
implementation of DNNs and also provides a summary of erties such as toxicity or binding capacity. In particular, DL can
recent research work. There are some surveys that have been be used to simulate biological or chemical processes of
published dealing with DL implementation. However, those different molecules without the need for expensive software
papers have not discussed the state of the art in different simulators and is 30,000 times faster [14]. Moreover, RNN
hardware platforms. Most of the recent surveys have focussed models have exiled in natural language processing including
on FPGA‐based CNN acceleration without pointing out the automatic speech recognition, recommendation systems, audio
choice of FPGA over other platforms. Another strong aspect recognition, machine translation, social media filtering, etc. For
of our work is that we discussed the optimization of DNNs in example, various LSTM models have been proposed for
both levels' software and hardware. Furthermore, we presented sequence to sequence mapping that are suitable for machine
a classification of advanced hardware acceleration techniques translation [15]. Furthermore, CNNs and RNNs were com-
based on throughput and energy optimizations. An investiga- bined to add sounds to silent movies [16] and to generate
tion of the algorithmic side and its effect on designing accel- captions that describes the contents of images [17]. Besides, it
erators is also included in this survey. Additionally, we exposed is important to note that the effective implementation of DL
the tools that can automatically generate hardware design from models on embedded platforms is behind this diffusion of
software that are used for implementing and evaluating deep such applications. The performance of such AI algorithms
learning approaches. Herein, using DL models lies on the capacity of processors in sup-
porting the DNN with its varied number of layers, neurons per
‐ Section 2 presents the basics of DL and its popular models layer, multiple filters, filter sizes and channels while treating
and architectures currently in use and turns the lights on the large dataset. Indeed, DL workloads are both computation and
complexity of these models. memory intensive. For example, the well‐known CNN network
‐ Section 3 describes the various hardware platforms used to ResNet50 [18] requires up to 7.7 billion floating point opera-
implement DNNs. tions (FLOPs) and 25.6 million model parameters to classify a
‐ Section 4 exposes the optimization techniques that can be 224 � 224 � 3 image. As shown in Figure 1, the complex and
applied to make the model more efficient in terms of speed larger model VGG16 [19] with 138.3 million parameters model
and power. size, requires up to 30.97 Giga FLOPs (GFLOPs). Thus, the
number of operations and parameters increases with the
Finally, synthesis of different acceleration techniques complexity of the model architecture. Table 1 presents the
explored in recent research works is given and analysed. state‐of‐the‐art models' sizes and complexities (Table 2).
VGG models were developed by the Visual Geometry
Group from University of Oxford and are the most preferred
2 | BACKGROUND AND MOTIVATIONS choices in the community for extracting features from images.
They are widely used in many applications despite the expen-
Currently, DL represents the leading‐edge solution in virtually sive architecture in both terms of parameters number and
all relevant machine learning tasks in a large variety of fields computational requirements (Figure 1). The large dimension-
[5,6]. DL algorithms are showing significant improvement over ality of these models increases the computation and data
traditional machine learning algorithms based on manual movement. More precisely, it increases the amount of gener-
extraction of relevant features (handcrafted features) [7]. DL ated data which its movement considered more expensive than
models perform a hierarchical feature extraction and show also computation, in terms of power on hardware platforms [21].
better performance with the increase of the amount of data [8]. At this inflection point, it is therefore necessary to benefit
There are different methods and architectures of DL such as from new design methodology, to make good use of new
Multi‐Layer Perceptron (MLP), Autoencoder (AE), Deep design opportunities and to explore some optimization tech-
Belief Network (DBN), Convolutional Neural Network niques to reduce the network size and to enhance the imple-
(CNN), Recurrent Neural Network (RNN) including Long mentation performance in terms of throughput and energy
Short‐Term Memory (LSTM) and Gated Recurrent Units consumption. Besides, the choice of suitable hardware plat-
(GRU), Generative Adversarial Network (GAN), Deep Rein- form to implement a DL model is of paramount importance
forcement Learning (DRL), etc. [9]. These models have [24]. In the next section, we will explore different computation
covered several fields with a variety of applications. Particularly, platforms of DL implementation.
DHOUIBI ET AL. 81
-
F I G U R E 1 Computational cost of most popular
models: inference on ImageNet dataset [20]

T A B L E 1 Size and complexity of state‐


VGG [19] ResNet [18]
of‐the‐art models
Model AlexNet [22] VGG16 VGG19 ResNet50 ResNet152 GoogLeNet [23]
Operations(GFLOPs) 1.4 30.97 39 7.7 22.6 1.57

Parameters(M) 58.3 138.3 144 25.6 57 6

3 | COMPUTATION PLATFORM OF DL meet the needs of DL: Intel has tweaked the CPUs of its
IMPLEMENTATION servers to improve its performance with DL [25]. Google has
developed a chip to perform DL tasks more economically [26].
The employment of DL into daily applications of different However, it is still very difficult for CPUs, even with multicore
fields will depend on the ease with which it will be possible to architecture to support the high computation and the storage
deploy DL model on small, low‐power devices rather than complexity of large DNN models.
large servers. In majority of cases, the training phase is per-
formed in the cloud. However, the inference phase is less
demanding, it can happen locally or in the cloud depending on 3.2 | Graphics processing units
the application [24]. Research is underway on the two phases
implementation using parallel architectures on different hard- A GPU excels in parallel computing. CPU has typically be-
ware targets and computing devices. Four major types of tween one and eight cores, and high‐end GPUs have thousands
technology are being used to accelerate DNNs: CPU, GPU, of cores (e.g. GeForce GTX TITAN Z included 5760 cores,
FPGA and ASIC. the last one is Geforce RTX 2080...). GPUs are slow during
sequential operations, but shine when given tasks that can run
in parallel. Since the operations required to run a DL algorithm
3.1 | Central processing units can be done in parallel, GPUs became extremely valuable tools.
Furthermore, by using OpenCL [27], an open standard for
Traditionally, DNNs were mainly tested on the CPU of a portable parallelisation, compute kernels written using a limited
computer. The CPU works by sequentially performing the subset of the C programing language can be lunched on GPUs.
computations that are sent to it. Sometimes, a programme has In this perspective, NVIDIA has invested much in its CUDA
different tasks that can be calculated independently of each (Compute Unified Device Architecture) language to make it
other. To optimise the time required to complete all tasks, support the most DL development frameworks. Similar to
many processors have multiple threads or cores that can OpenCL CUDA affords an environment of general‐purpose
perform parallel calculations. Some manufacturers have sought programing and enables parallel processing over NVIDIA
to optimise the hardware architectures of their processors to GPU's cores. NVIDIA GPUs are currently the most used for
82 DHOUIBI ET AL.
-
TA B L E 2 Reduce precision effect on DNN models

Bitwidth Results
Float 32‐Bit After
Baseline Reduction
Accuracy
Reduce Precision technique Reference DL Model Input Weight Activation Gradient Top‐1 Accuracy Loss
Reduce weight [49] MobileNetV1 ‐ 8‐bit 32‐bit 32‐bit 70.77% 2.74%

INQ [50] ResNet‐18 ‐ 5‐bit 32‐bit 32‐bit 68.27% −0.71%

‐ 4‐bit 32‐bit 32‐bit −0.62%

‐ 3‐bit 32‐bit 32‐bit 0.19%

‐ 2‐bit 32‐bit 32‐bit 2.25%


(ternary)

TWN [51] ResNet‐18 ‐ 2‐bit 32‐bit 32‐bit 68.27% 6.47%


(ternary)

BWN [52] ResNet‐18 ‐ 1‐bit 32‐bit 32‐bit 68.27% 7.47%

BWNH [64] ResNet‐18 ‐ 1‐bit 32‐bit 32‐bit 68.27% 3.97%

Reduce weight and activation FFN [65] AlexNet ‐ 2‐bit 32‐bit 32‐bit 57.20% 1.70%

Ristretto [53] CaffeNet ‐ 8‐bit 8‐bit 32‐bit 56.90% 0.90%

Balanced quantization GoogLeNet ‐ 8‐bit 8‐bit 32‐bit 71.50% 4.90%


[54]

‐ 4‐bit 4‐bit 32‐bit 3.80%

QNN [55] GoogLeNet ‐ 4‐bit 4‐bit 32‐bit 71.50% 5.10%

HWGQ [56] GoogLeNet ‐ 1‐bit 2‐bit 32‐bit 68.70% 5.70%

Reduce input, weight and BNN [57] AlexNet 1‐bit 1‐bit 1‐bit 32‐bit 57.20% 30.10%
activation

XNOR‐Net [52] AlexNet 1‐bit 1‐bit 1‐bit 32‐bit 56.60% 12.40%

Reduce weight, activation and DoReFa‐Net [58] AlexNet ‐ 8‐bit 8‐bit 8‐bit 55.90% 2.90%
gradient

‐ 1‐bit 4‐bit 6‐bit 7.70%

‐ 1‐bit 3‐bit 6‐bit 8.80%

‐ 1‐bit 2‐bit 8‐bit 9.60%

‐ 1‐bit 2‐bit 6‐bit 9.80%

‐ 1‐bit 1‐bit 8‐bit 16.40%

‐ 1‐bit 1‐bit 6‐bit 16.40%

implementing DL algorithms. Most lately, NVIDIA [28] 3.3 | FPGA


invented NVDLA, a scalable and highly configurable open
source accelerator for DL inference to simplify integration and When evaluating the acceleration of hardware platforms, the
portability. In late 2018, AMD announced the first 7 nm trade‐off between flexibility and performance must inevitably
(nanometer) GPU specifically designed for DL. The company's be taken into consideration. FPGAs serve as a good
new Radeon deliver up to 7.4 TFLOPS (one trillion floating compromise between flexibility and performance. They are a
point operations per second). AMD revealed also a software to reconfigurable integrated circuit with programmable processor
improve performance [29]. This proves the interest shown by cores. They offer the performance advantages of integrated
manufacturers in choosing the right hardware that best suits circuits, with the high degree of flexibility. At a low level,
the deployment of these DNNs. The Nvidia Tesla V100 for FPGAs can implement sequential logic using Flip‐Flops (FFs)
example, embeds 640 hearts ‘Tensor'. These units offer neural and combinational logic using Look‐Up Tables (LUTs).
networks a high computing capacity of over 100 teraflops and FPGAs also contain hardened components for functions
are particularly suited to popular development frameworks. that are commonly used, such as full processor cores,
DHOUIBI ET AL. 83
-
communication cores, arithmetic cores, and RAM blocks. In accelerators have showed impressive performance in terms of
addition, the adaptation of the System‐on‐Chip (SoC) design parallel computing and power consumption. Microsoft Azure
approach, in which the ARM coprocessors and FPGAs logic cloud computing platform integrates Altera Stratix FPGA [38].
cells are generally located on the same chip have enhanced the Amazon AMS provided Elastic Compute Cloud (EC2) F1 [39],
flexibility of such devices. The FPGAs current market is a compute instance based to accelerate data centre workloads
dominated by Intel (nee Altera) and Xilinx, representing a including DL inference. Equipped with eight Virtex Ultra-
combined market share of 85% [30]. On FPGAs, program- Scale + VU9P FPGAs, F1 instance can perform up to 170
mable logic cells can be used to implement the data and control TOPs (tera operation per second) with INT8 data represen-
path. They are also able to exploit the distributed memory on tation. An FPGA computing instance provides an easy way to
chip, and the pipelines parallelism, which are naturally part of create FPGA design with dedicated and customised hardware
the methods of deep feed‐forward networks. FPGAs also accelerators, based on the cloud elastic computing Framework.
support partial dynamic reconfiguration, which may have im- Alibaba Cloud [40], Huawei Cloud [41], Tencent Cloud [42],
plications for large DL models, where individual layers could Baidu Cloud [43] and many others have also launched FPGA
be reconfigured on the FPGA without disrupting the current services. Alibaba Cloud's F1 instance is based on Intel Arrira10
calculation in the other layers. To speed up hardware designs GX 1150 computing card. The instance introduced by Tencent
FPGA platforms could be a promising perspective compared cloud is based on Xilinx Kintex UltraScale KU115 FPGA. This
to GPUs. With fixed architectures like GPUs, a software progress on the hardware devices side goes hand in hand with
execution model is tracked and structured around the execu- progress on software side. To make FPGAs much easier to use,
tion of tasks in parallel on independent computing units: the Xilinx provided Vitis Unified Software Platform which is a
goal of developing DL techniques for GPUs is to adapt the development environment to design and deploy accelerated
algorithms to follow this architecture, where the computation applications on Xilinx platforms such as ACAP, FPGA‐in-
is carried out in parallel and where the interdependence of the stances in the cloud, Alveo cards and embedded platforms.
data is ensured. However, when developing DL techniques for Vitis AI, which is an integral part of Vitis that allows the ac-
FPGAs, it is less important to adapt algorithms for a fixed celeration of DL applications. It supports Tensorflow and
computation structure, which allows more flexibility to explore Caffe frameworks and provides tools and APIs to optimise
algorithm optimizations. Techniques that require many com- pre‐trained DL models by applying pruning and quantization
plex low‐level hardware control operations that are difficult to techniques. With specific designed hardware, FPGA exceeds
implement in high‐level software languages are of particular GPU not only in energy efficiency but also in speed.
interest for FPGA implementations. Recently, software‐level
programing models for FPGA have been adopted, including
OpenCL, High‐level synthesis (HLS), C, C++, making it a 3.4 | ASIC
more attractive option [31]. In this perspective, Xilinx invented
PYNQ to design embedded systems with their Zynq SoCs on ASICs are designed for a specific fixed functionality or appli-
easier way. It uses the Python language and libraries, which cation. During its operating life, a customised ASIC has a fixed
offers the benefits of programmable logic and microprocessors logic function because its digital circuitry is made up of gates
in Zynq that help building high performance embedded DL and flip‐flops permanently connected in silicon. Several
applications [32,33]. Furthermore, to accelerate DL inference research works have focussed on building customised ASICs to
with optimised and tuned hardware and software, Xilinx un- accelerate DL model training and inference [44]. Compared to
veiled an adaptive compute acceleration platform (ACAP), FPGA, ASIC platforms with a customised architecture are
Versal [34] a new heterogeneous compute architecture. Versal more efficient in terms of power and speed. An ASIC can
delivered higher performance (8�) than high‐end GPUs. More perform fixed operations extremely fast since the entire chip's
recently, Xilinx designed an integrated IP block for Zynq SoC logic area can be devoted to a set of narrow functions. Despite
and MPSoC devices which is a programmable engine dedicated its high performance, designing an ASIC can be highly
for CNNs called DL Processor Unit (DPU) [35]. Lately, expensive due to its construction process complexity. A cus-
FPGA‐based accelerators like Xilinx Alveo cards [36] with new tomised ASIC needs verification and frequent updates to keep
architecture appeared more often. They offer FPGAs ready to abreast of new techniques. Moreover, the rapid evolving of DL
programme on the accelerator cards which can be directly modes requires design changing which is costly in terms of
plugged into servers and allow reconfigurable acceleration to time and price for ASICs. Even with a lack of flexibility, ASICs
adapt to continuous optimization of DL algorithms. For are still an attractive solution for dealing efficiently with the
example, when executing inference, the Alveo U250 reduces massive workloads of DL models. Currently, more than 100
latency by 3� over GPUs. Another FPGA‐based multi‐accel- companies are building ASICs targets towards DL applications
erator platform, Maxeler's MPC‐X 2000 [37], that supports including Google, Facebook, etc. Google designed Tensor
reconfigurable designs is widely used. It is comprised with Data Processing Unit (TPU), a 28 nm customised ASIC to accelerate
Flow Engines (DFE) each using a Xilinx Virtex‐6 FPGA. DL applications. The 700 MHz TPU performed 95 TFLOPs
Currently, the Cloud represents a simple and efficient solution and 23 TFLOPs for 8‐bit and 16‐bit calculations respectively,
for using FPGAs without investing in specific hardware. In whilst drawing only 40W. TPU v2, announced in May 2017, is a
major cloud platforms and modern data centres, FPGA based four ASIC board that can do 180 TFLOPs of performance. A
84 DHOUIBI ET AL.
-
year later, Google announced TPU v3 and improves the peak data bandwidth requirements. It optimises the computing ef-
performance to 420 TFLOPs. In February 2018, cloud TPUs ficiency and improves performance. However, special attention
that powers Google products like Translate, Search, Assistant, must be payed to the possible degradation of accuracy. From
and Gmail became available for use in Google Cloud Platform the algorithmic perspective, recent research work can be
(GCP) [45]. TPU can handle both training inference and it has divided into three categories: weights precision reduction,
the highest training throughput. More recently, Habana Labs precision reduction of both weights and activations and pre-
startup developed the HL‐1000 a 16 nm custom ASIC chip cision reduction of inputs, weights and activations. Many re-
[46]. The designed architecture is very similar to that of searchers targeted weights precision reduction, since weights
Google's TPU using a large on‐chip Static Random Access can reduce directly the network size. In Ref. [49], a friendly
Memory (SRAM) and large matrix‐multiply accelerator. The quantization applied on MobileNetV1 model, reached an ac-
only difference is that Habana includes eight programable CPU curacy of 68.03% in 8‐bit weights representation, which almost
cores to handle non convolutional layers, whereas Google closed the gap to the float point representation (70.77%). Zhou
implements these layers in fixed‐function logic. The startup et al. presented INQ [50], a generalized quantization frame-
Gyrfalcon Technology Inc (GTI) [47] introduced Lightspeeur work to convert any pre‐trained full‐precision CNN model
2801S, Lightspeeur 2802M and Lightspeeur 2803S edge‐based with 32 bit floating point into a lossless low‐precision version
ASICs for the deployment of AI application. Lightspeeur of weights with 5‐bit, 4‐bit, 3‐bit and even 2‐bit. The use of
2801S, a 28 nm neural accelerator with no external memory this framework on ResNet‐18 improved accuracy for 5‐bit and
and 28,000 parallel computing cores performs up to 2.8 TOPs 4‐bit quantization by 0.71% and 0.62% respectively. Li et al.
and 9.3 TOPs/W. Based on 16‐chip server, a 2803S performs squeezed the representation to 2‐bit in Ref. [51] which resulted
271 TOPs at 28W [48]. ASICs are still more efficient than in 6.47% accuracy degradation. Also, Rastegari et al. proposed
FPGAs. However, the combination of GPUs training perfor- a fully binarised neural networks called BWN in Ref. [52].
mance and FPGAs efficiency and flexibility for inference can BWN gained 32� memory saving with 12.4% accuracy
be an alternative and promising solution. degradation. Another recent research works, applied the pre-
While running DNNs, it is still difficult for CPUs to ach- cision reduction technique on weights and activations. Indeed,
ieve high performance levels compared to GPUs, FPGAs and in Ref. [53], CaffeNet inference is successfully performed with
ASICs due to the massive computation and memory band- 8‐bit fixed‐point representation of weights and activations and
width requirements. However, GPUs with their high memory resulting in less than 1% degradation of accuracy. A Balanced
bandwidth and throughput are the most widely used for Quantization method is introduced in Ref. [54]. It performed
training DNNs. GPUs' high performance is due to their par- 66.6% top‐1 accuracy when applying 8‐bit representation of
allel processing. However, they consume a large amount of weights and activations on GoogLeNet which is less than 5%
power. FPGAs and ASICs can also offer very high bandwidth degradation compared to the float 32‐bit baseline. Moreover, a
by being directly connected to inputs. Moreover, compared quantized version of GoogLeNet with 4‐bit weights and acti-
with GPUs, FPGAs and ASICs can provide higher perfor- vations in Ref. [55] achieved 66.5% top‐1 accuracy which is
mance with lower power consumption while running DL al- 5.1% drop in accuracy. Cai et al. [56] introduced Half‐Wave
gorithm. As DL models rapidly evolve and change, FPGAs Gaussian Quantization (HWGQ), that reduced the precision
offer more flexibility and reconfigurability than ASICs. Addi- by 5.7% on GoogLeNet with binary weights and ternary ac-
tionally, FPGAs are using new tools that make programing tivations. Some other researches have shown that quantized
DNNs applications much easier. inputs, weights and activations can achieve better computa-
For further improvement of performance, various opti- tional efficiency. The binarisation of inputs weights and acti-
mization techniques have been proposed. In the next section, vation is explored in Ref. [57]. The authors proposed a fully
we give an overview of some of the most used techniques. Binarised Neural Networks (BNN) that drastically reduced
memory size and accesses. Based on BWN, Rastegari et al. [52]
presented XNOR‐Net by binarising all activations resulting in
4 | OPTIMIZATION TECHNIQUES 58� faster convolutional operations. XNOR‐Net performed
better accuracy than BNN [57]. In Ref. [58], the authors pro-
There are several techniques focussed on modifying DL al- posed DoReFa‐Net, a method that used low bitwidth param-
gorithms to make them more hardware‐friendly with minimal eter gradients to train CNN with low bitwidth inputs, weights
loss of accuracy. Many approaches have been explored to and activations. DoReFa‐Net performed comparable accuracy
effectively digest the redundancy of models and provide as the 32‐bit baseline on SVHN and ImageNet datasets.
improved computing efficiency such as the low data precision, Detailed results are summarised in Table 2. From the hardware
pruning network and Low‐Rank Approximation (LRA). perspective, a lot of work applied fixed point representation to
implement DNNs and substantially reduced the bitwidth for
energy and area savings, and throughput increasing. In Ref.
4.1 | Precision reduction [59], LSTM models (Google LSTM and Small LSTM) with 16‐
bit fixed‐point data type were implemented on two FPGA
The use of lower precision in representing data to run DNNs platforms resulting in only 1.23% precision degradation.
reduces the storage demand of the DNN models lowering the Moss et al. presented an FPGA‐based customisable matrix
DHOUIBI ET AL. 85
-
multiplication framework to run DNNs [60]. It allowed the inference cost of VGG‐16 and ResNet‐110 respectively by
runtime switching between static‐precision bit‐parallel and 34%, 38% while maintaining nearly the original accuracy. The
dynamic‐precision bit‐serial MAC (Multiply and Accumulate) study by Huang et al. [70] suggested a ‘try‐and‐learn’ learning
implementations. The experimental results on AlexNet, algorithm to prune filters in CNN while maintaining the per-
VGGNet and ResNet reached up 50� throughput increases formance. The proposed algorithm removes 63.7% redundant
versus FP32 baselines. In Ref. [61], the authors implemented filters in FCN‐32s and accelerated the inference by 37.0% on
Google LSTM in Xilinx FPGA using 12‐bit fixed point which GPU and 49.1% on CPU. Recently, a new method for filter
resulting better performance and only 0.3% precision degra- pruning was explored in Ref. [71] which is based on the
dation. In Ref. [62], Shen et al. implemented VGG‐16 and sparsity induction of weights. The proposed technique achieves
C3D across multiple FPGA platforms with DSPs (Digital FLOPs reduction on VGG‐16 on two datasets CIFAR10 and
Signal Processor) that supports one 16‐bit fixed‐point multiply GTSRB respectively by 90.50% and 96.6% without accuracy
and add. It achieved an end‐to‐end performance 7.3� better loss. Channel pruning reduces the model size by removing the
than software implementation. Following the same strategy, channels and the related filters as well as the corresponding
Zhang et al. [63] achieved a 3.1� throughput speedup with the feature maps. Several channel pruning methods were pro-
implementation of a long‐term recurrent convolutional posed, for instance, Ref. [72] investigated a method for channel
network LRCN on a Xilinx FPGA using a fixed‐point quan- selection called Discrimination‐aware Channel Pruning (DCP).
tization. Although, the use of this technique offers a substantial Experiments of this method on ResNet‐50 showed that with
gain in throughput and energy efficiency. But less than 8 bits 30% reduction of channels it outperforms several state‐of‐the‐
representation of the data values in large DNNs can increase art methods by 0.39% in top‐1 accuracy. The study by Liu and
the accuracy degradation (Table 4). Wu [73] proposed a new channel pruning criterion based on
the mean gradient of feature maps which reduces effectively
the network FLOPs. Using this approach on VGG‐16 and
4.2 | Pruning ResNet‐110 achieves respectively 5.64� and 2.48� reduction
in FLOPs, with less than 1% and 0.08% decrease in accuracy,
Neural networks are considered over‐parametrised, as there is a respectively. Liu et al. [74] enforced a scaling factor during the
large amount of a redundant parameters that are with small training for channel pruning. The effectiveness of this
influence on the accuracy, which costs in computation as well approach was evaluated with several CNN models (VGGNet,
in memory footprint. These parameters can be removed ResNet and DenseNet). For VGGNet, it achieves 20�
through a process called pruning which is often followed by reduction in model size and 5� reduction in computing op-
some fine tuning to improve the accuracy. Recently, several erations. More details are presented in Table 2. To achieve
research studies [75,76] have shown the effectiveness of this speedup, pruning can be combined with other techniques used
technique on model size reduction, the computations amount for optimization. The work in Ref. [77] investigated the ben-
and indirectly the energy consumption with minimal accuracy efits and costs of quantization and pruning as well as the
degradation. There are many pruning methods in terms of combination of the both. The evaluation of the approach on
weights, filters channels and feature maps. The core idea of NVIDIA Jetson TX2 showed that when using pruning, the
weight pruning is to remove the redundancy of some weight by inference time and energy consumption was reduced, respec-
setting them to zero. Rather than searching exhaustively for the tively by 28% and 22.5% with little saving in storage size.
weights to be pruned per layer, Ref. [66] explored a technique However, when using quantization, the model storage size was
to find automatically the possible pruned weights sets while reduced by 75% while the inference time and energy was
minimising the loss over all weights. The test error of this reduced respectively by 1.41� and 1.19�. The combination of
method on ResNet110 and ResNet56 was respectively 6.50% these techniques leads to a reduced model storage size (76%)
and 6.67%. To guarantee the weight reduction ratio, Zhang with a little decrease in the top‐1 prediction accuracy (less than
et al. [67] proposed a systematic framework for weight pruning 7%). This work showed that the combination of techniques
of DNN based on the alternating direction method of multi- depends on the architecture of neural network and the reason
pliers (ADMM). This approach achieves weight reduction on of optimization: it shows positive impact on the inference time
LeNet‐5 and AlexNet models respectively with 71.2� and 21� for VGG‐16, but it results in a longer inference time for
with and no accuracy loss. Yang et al. [68] proposed the En- ResNet50 so less benefit in energy consumption for ResNet50
ergy‐Aware Pruning (EAP) technique for weight pruning using over VGG‐16. Tung et al. [78] explored the incorporation of
the energy consumption estimation of CNN. This method network pruning and weight quantization in a single learning
achieves an energy consumption reduction for GoogLeNet framework named CLIP‐Q where both performs in joint and
and AlexNet respectively by 1.6� and 3.7�, compared to their parallel manner. Comparing to the state‐of‐the‐art results, the
original models with less than 1% top‐5 accuracy loss. For filter CLIP‐Q technique achieves an improvement in compression
pruning, the basic idea is to remove the unimportant filters by rate for AlexNet, GoogLeNet, and ResNet‐50 respectively with
an estimation of the filter's importance. Li et al. [69] reported a 51�, 10� and 15�. Several studies have investigated this
methodology to prune whole filters and their related feature compression technique from the hardware perspective. For
maps by using as a measurement of filter importance the sum instance, Faraone et al. [79] suggested a filter pruning frame-
of the absolute values of filters. This approach reduced the work that utilise efficiently FPGA resources without accuracy
86 DHOUIBI ET AL.
-
degradation. The evaluation of this approach on the Xilinx speedup with 5.0� weights reduction for AlexNet, with low
KU115 board showed that the pruned AlexNet and TinyYolo than 0.4% accuracy degradation. The implementation of
networks achieved 2� speedup and 2� reduction in resources DNNs can be more effective when using layer decomposition
(LUTs, DSP, BRAM) without accuracy loss compared to the method. Wen et al. designed a new LRA to train a DNN
original networks. Posewsky et al. [80] proposed an FPGA‐ model with lower ranks and higher computation efficiency [87]
based accelerator of the pruned DNN inference. This accel- (Table 4). This method gained 2� speedup on GPU when
erator was implemented on ZedBoard for evaluation. maintained the accuracy and 4.05� speedup on CPU with low
Compared to the software implementation, this approach degradation in accuracy. To accelerate the CNN inference
achieves an improvement with 10� in energy efficiency and computation, Wang et al. proposed an approach based on low
3� in runtime. The hardware implementation of the pruned rank and group sparse tensor decomposition [88]. On VGG‐
and non‐pruned network achieves an accuracy loss with less 16, this method achieved 6.6� speedup on CPU with less than
than 0.5%. The study by Zhang et al. [81] proposed a 1% degradation on top‐5 error. In Ref. [93], the authors
compression strategy for CNN based on pruning and quanti- proposed a framework to accelerate DNNs based on lower‐
zation and an FPGA‐based accelerator for the compressed rank approximation. On FPGA, it achieved an average
CNN. The evaluation of the proposed system on Xilinx computation efficiency of 64.5%. LRA can obtain a compact
ZCU104 for AlexNet showed an improvement in terms of and approximate network model. However, to learn an accu-
latency and throughput on convolutional layers compared with rate network structure, LRA needs the reiterations of
CPU and GPU respectively with 182.3� and 1.1� and an decomposing, finetuning, etc., resulting in extra computation
improvement in terms of energy efficiency with 822.0� and overhead.
15.8�, respectively. The aim of using the optimization techniques is to reduce
model size while maintaining good performance. Lower pre-
cision in representing data (quantization) usually improves la-
4.3 | Low‐rank approximation tency and reduces accuracy especially when dealing with large
scale DNNs. Pruning the network also reduces the size of the
Layer decomposition or LRA has been expensively explored to model and is able to improve accuracy but usually not latency.
reduce computation complexity to improve efficiency. This However, weight quantization is more hardware friendly than
method decomposes the model to a compact and approximate weight pruning. LRA techniques are efficient for model
one with more lightweight layers by matrix decomposition. compression but the necessity of expensive decompression
Denton et al. [89] applied a LRA of kernels to reduce operations makes it difficult for the implementation. Further-
computation in convolutional layers. The proposed model more, LRA techniques cannot perform global compression of
performed 2.5� speedup with little drop in accuracy (¡1%). In parameters as they are performed layer by layer.
Ref. [85], Wang et al. proposed a factorised convolutional To improve the efficiency and achieve further compression
layer, that outperforms the standard ones on performance and optimization techniques such as pruning and precision reduc-
complexity ratio. The factorised network achieved similar tion or quantization and LRA can be combined.
performance to VGG‐16 while requiring 42� less computa-
tion. The authors in Ref. [90] proved that the low‐rank
approximation technique can also be applied to decompose 5 | HW ACCELERATION APPROACHES
the weights in the FC layers, which resulted in up to 50%
reduction in number of parameters. Following the same DNNs have been successful in a wide range of applications
strategy, Qui et al. [91] applied LRA on FC layer to reduce the thanks to the rapid development of custom hardware to speed
number of weights. with 63% less parameters, this method up the training phase as well as the inference phase. Among the
performed 87.96% accuracy On VGG16‐SVD. Also, to different hardware targets previously presented in section 3,
decompose pretrained weights, a tucker decomposition is used FPGA platforms with reconfigurable integrated circuits and
in Refs. [82,92]. In Ref. [86], LRA was adopted for weights and embedded hardcodes make it easy to design dedicated hard-
inputs. Zhang et al. used a deneralized Singular Value ware accelerators for complex DNNs. In this section, we re-
Decomposition (GSVD) to reduce multiple layers accumulated view many recent research works and summarise acceleration
error. By applying this method on VGG‐16, the model ach- methods based on FPGA.
ieved 4� speedup with only 0.3% increase in top5 error. Chen
et al. proposed a Layer Decomposition‐Recomposition
Framework (LDRF) [86], in which they applied a Singular 5.1 | Throughput optimization
Value Decomposition (SVD) to weights matrices. During the
SVD decomposition, they lowered the rank of each layer to Throughput optimization is one of the objectives to design an
estimate the layer valid capacity. On VGG‐16, the proposed efficient DNN based accelerator. Several techniques have been
method reached 5.13� speedup with only 0.5% top‐5 accuracy explored to achieve higher throughput. However, the most
reduction. In Ref. [83], the authors showed that low rank used techniques have included loop optimization, systolic array
tensor decompositions can speedup large CNNs while main- architecture and Single Instruction Multiple Data (SIMD)
taining performance. The proposed approach achieved 1.82� based computation.
DHOUIBI ET AL. 87
-
5.1.1 | Loop optimization Xilinx FPGA using VGG performed 636 GOPs. A 1‐D sys-
tolic array architecture described in OpenCL is proposed in
To achieve high throughput, loop optimization techniques Ref. [101]. This approach is only suitable for small models
such as loop unrolling, loop tiling and loop interchange have because all input feature maps are stored in on on‐chip
been widely used. They have reduced the overheads associated memory. The implementation of AlexNet on FPGA resulted in
with the massive nested loops which has increased the 1382 GFLOPS. In this work the DSP utilization is improved
execution speed. These techniques are based on making by adopting Winograd transformation. In Ref. [100], Wei et al.
effective use of parallel processing capabilities. In Ref. [94], the implemented CNN on Intel FPGA using systolic array archi-
authors exhaustively analysed loop optimizations and data tecture which achieved up to 1171 GOPs. In their work they
movement patterns in CNN loops. They provided a new provided an analytical model for resource utilization and per-
dataflow and architecture, in which they leveraged loop tiling, formance and developed an automatic design space explora-
unrolling, interchange to minimise data communication. Their tion framework. Besides, the use of current FPGA Computer
design achieved 645.25 GOPs of throughput on Intel FPGA Aided Design (CAD) tools to synthesise and layout systolic
using VGG model. Loop tiling is used in Ref. [91] to fit large‐ arrays resulted in frequency degradation. In Ref. [102], a 2 D
scale CNN models into limited on‐chip buffers. The proposed systolic architecture is analysed to identify causes, and two
approach demonstrated higher acceleration on VGG16‐SVD methods are proposed to improve frequency of systolic array
when applying quantization method, and performed 137 designs which is directly related to throughput. The evaluation
GOPs. Also, to explore the design space of dataflow across results attained 1500 GOPs for VGG inference on Xilinx
layers, the authors in Ref. [95] used loop tiling and developed a FPGA platform (and achieved 1.29� higher frequency).
fused‐layer CNN accelerator. The implementation of the Table 3 summarises some results.
proposed approach on a Xilinx FPGA minimised off‐chip
feature map data transfer by 95% and reached up 61.62 GOPs
in throughput. Based on unrolling and tiling loops, Rahman 5.1.3 | SIMD‐based computation
et al. [96] presented ICAN, a 3D compute tile for convolu-
tional layers. With optimization of on‐chip buffer sizes for To perform high throughput, an SIMD‐based computation
FPGAs, the proposed technique outperformed [95] by 22%. In technique has been used in several recent research works. The
Ref. [97], loop unrolling is used to define the computation authors in Ref. [105] designed a system architecture based on
pattern and the data flow. The paper also proposed an RTL heterogeneous FPGA with DSPs, supporting SIMD paradigm
compiler ALAMO to automatically integrate the computing to efficiently process parallel computation for CNNs layers
primitives to accelerate the operation on FPGA. On AlexNet, (Convolution and fully connected layers). The proposed ar-
the accelerator reported a computational throughput of 114.5 chitecture required lower computational time (47%) over
GOPs. In Ref. [98], the authors designed DLAU, an accelerator non‐SIMD computational implementation. Furthermore, to
architecture for large‐scale DNNs by exploiting data reuse in accelerate CNNs computation rate on FPGAs, Nguyen et al.
order to reduce the memory bandwidth requirements. It proposed Double MAC [103], which is an approach for
included three pipelined processing units to improve the packing two SIMD MAC operations into a single DSP block
throughput and loop tiling technique to improve locality and with reduced Bitwidth. This work improved the computation
minimise data transfer operations. On Xilinx FPGA, the pro- throughput by 2 times with the same resource utilization.
posed accelerator achieved up to 36.1� speedup with 234mW Zhong et al. designed Synergy [104], a hardware‐software co‐
power consumption. designed pipelined framework based on heterogeneous FPGA
to accelerate CNN inference. Supporting multi‐threading,
Synergy leveraged all the available on‐chip compute resources
5.1.2 | Systolic array architecture including CPU, FPGA and NEON SIMD engines. FPGA and
the NEON engines are used to accelerate convolutional layers
Systolic array architecture is another technique that employs while the CPU cores execute the fully‐connected layers and the
high degree of parallelism to improve throughput. It consists preprocessing functions. Additionally, a workload accelerator
of placing, in an organised structure, thousands of Processing balancing was provided to adopt various networks at runtime
Elements (PEs) and connecting them directly to each other to without the need to change the hardware or the software
form a large physical matrix of these operators. Each PE has implementations. The evaluation of Synergy resulted in higher
its limited private memory. In Refs. [99–101], systolic array throughput and energy‐efficiency over implementations on the
architecture is applied to FPGA‐based CNNs. To accelerate same platform. Likewise, an architecture based on SIMD
CNN/DNN on FPGA, C. Zhang et al. designed and imple- technique was presented in Ref. [106] to accelerate DNN for
mented Caffeine [99], a HW/SW co‐designed library which speech recognition. SIMD and MIMD (Multiple Instructions
decreased underutilised memory bandwidth. The authors Multiple Data) modes are mixed in Ref. [107] to accelerate DL
proposed a massive number of parallel PEs and organised models. In addition, a SIMD like architecture is adopted in Ref.
them as a systolic array to mitigate timing issues for large de- [108] to minimise the energy consumption, which is another
signs. The implementation of the proposed accelerator on important key to further improving accelerator efficiency.
88 DHOUIBI ET AL.
-
TA B L E 3 Pruning effect on DNN models

Results
With Pruning
Without Pruning
Pruning technique Reference DL Model Top‐5 Accuracy Accuracy Loss Reduction
Weight pruning [66] ResNet110 93.50% 0% 90% weight
ResNet56 93.33%
[67] LeNet‐300‐100 98.40% 0% 22.9� weight
LeNet‐5 99.20% 71.2� weight
AlexNet 80.20% 21� weight
Energy‐aware AlexNet 80.43% 0.87% 3.7� energy
Pruning [68] Consumption
GoogLeNet 88.26% 0.98% 1.6� energy
Consumption

Filter pruning ICLR0 17 [69] VGG‐16 ‐ <1% 34.2% FLOP


ResNet‐110 ‐ 38% FLOP
[70] VGG‐16 92.77% 0.60% 45% FLOP
ResNet‐18 93.52% 0.30% 24.3% FLOP
FCN‐32s 90.48% 1.30% 55.4% FLOP
SegNet 86.50% −2.10% 63.9% FLOP
Pruned [71] VGG‐16 on CIFAR10 93.23% −0.78% 90.50% FLOP
VGG‐16 on GTSRB 99.31% 0.55% 96.6% FLOP

Channel pruning DCP [72] ResNet‐50 92.93% −0.14% 30% channels

[73] VGG‐16 ‐ <1% 5.64� FLOP

ResNet‐110 ‐ 0.08% 2.48� FLOP

[74] VGGNet on CIFAR‐10 93.66% −0.14% 51.0% FLOP

VGGNet on CIFAR‐100 73.26% −0.22% 37.1% FLOP

VGGNet on SVHN 97.83% −0.11% 50.1% FLOP

5.2 | Energy optimization accelerator, that can be implemented on FPGA is proposed. It


compressed the data and weights from the hardware
Reducing the energy consumption is a key challenge for perspective and eliminate the data redundancy. With Mem-
designing an efficient DNN‐based accelerator. Therefore, squeezer buffers, CNN accelerators achieve 80% energy
various techniques have been explored by researchers to obtain reduction over conventional buffer designs with the same area
high throughput with low energy consumption. budget. Reducing data transfer between on‐chip memory and
off‐chip memory can also minimise the energy consumption. It
is in this context, Shen et al. [111] realized a CNN accelerator
5.2.1 | Reducing the memory bandwidth with a flexible data buffering scheme Escher. This latter
reduced the bandwidth requirements by 2.4� on FPGA using
Many recent researchers focussed on reducing the on‐chip and AlexNet. The study by Li et al. [112] observed that for CNN
off‐chip memory bandwidth. In [109] Zhang et al., presented a accelerators, over 80% of the energy are consumed by DRAM
2‐D interconnection between PEs and local memory to accesses. The authors proposed SmartShuttle, an adaptive
minimise the on‐chip memory bandwidth. The authors also, layer, to minimise the off‐chip memory accesses by investi-
increased the data locality, which reduced the off‐chip memory gating the impact of sparsity and reusability of data on the
requirements. Using OpenCL, the design implementation of memory. The evaluation on AlexNet showed that SmartShuttle
VGG on FPGA achieved a 1790 GOPs throughput and an reduced up to 47.6% DRAM access volume, and reached up
energy efficiency of 47.78 GOPs/W. Also, Memsqueezer [110], 36% of energy savings. In the same context, Ref. [113]
an on‐chip memory subsystem for low‐overhead DL designed an algorithm called block convolution to completely
DHOUIBI ET AL. 89
-
T A B L E 4 Low‐rank approximation
Results
effect on DNN models
With LRA
Without LRA
Reference DL Model Top‐5 Accuracy Accuracy Loss Reduction Speed up On CPU
[82] AlexNet 80.03% 1.70% 5.46� weight 2.72�
2.67� FLOP
VGG‐S 84.60% 0.55% 7.40� weight 3.68�
4.80� FLOP
GoogLeNet 88.90% 0.24% 1.28� weight 1.42�
2.06� FLOP
VGG‐16 89.90% 0.50% 1.09� weight 3.34�
4.93� FLOP

[83] AlexNet 80.03% 0.34% 5.00� weight 1.82�


VGG‐16 90.60% 0.29% 2.75� weight 2.05�
GoogLeNet 92.21% 0.42% 2.84� weight 1.20�

[84] VGG‐16 89.90% 0.30% ‐ 3.80�

[85] VGG‐16 90.10% 0% 22� params 14�


42� FLOP

[86] VGG‐16 89.90% 0.50% ‐ 5.13�

[87] AlexNet in Caffe 80.03% 1.71% ‐ 4.05�

[88] VGG‐16 ‐ <1% ‐ 6.6�

avoid the off‐chip intermediate data transfers. It performed Table 4 illustrates several accelerations techniques that opti-
high throughput on FPGA using VGG‐16. The fused‐layer mised low energy consumption.
CNN accelerator proposed by Alwani et al. [95] also minimised
the off‐chip data transfer by 95%.
5.3 | Algorithmic optimization
5.2.2 | Other approaches Recent works [59,118,119] have demonstrated that applying
mathematical optimizations such as Fast Fourier (FF) and
Many other methods have been used to reduce power con- Winograd algorithms to DNNs accelerators can improve
sumption. In Ref. [114], Zhang et al. proposed a deeply resource productivity and efficiency. These transformations
pipelined FPGA architecture to leverage the design space for can decrease the required MAC operations in the layer's
energy efficiency. The evaluation results on VGG‐16 achieved network by reducing the model arithmetic complexity. For
8.28 GOPs/J of energy efficiency. The study by Zhu et al. example, each element in the output feature map of a CNN
[115] showed that using low‐rank approximation, a 31% to model is computed individually. Contrariwise, FF and Wino-
53% energy reduction can be reached. The low data repre- grad algorithms transform the input feature map and filter to
sentation can also reduce energy consumption, the binarised corresponding domain ( Winograd or frequency) and then
neural networks in Ref. [116] attained 44.2 GOPs/W. In Ref. perform element‐wise matrix multiplication [120]. To get the
[61], Han et al. presented an Efficient Speech Recognition final output an inverse transformation is applied. The reduc-
Engine (ESE), a SW/HW co‐design framework which works tion of the model arithmetic complexity depends on the pa-
directly on compressed LSTM model. It achieved up to 428 rameters of the algorithm. With 8�8 input tile size; FF
FPs/W which is 40� more energy efficient than CPU. Li et al. Transform (FFT) algorithm can reduce the multiplication by
[117] presented the Efficient RNN (E‐RNN) framework for 3.45 times for 3�3 filters. On the other hand, with 6�6 input
FPGA implementation of the Automatic Speech Recognition tile size, Winograd algorithm can reduce the multiplication by 4
(ASR) application. For more accurate block‐circulant training, times for 3�3 filters. In Ref. [120], Liang et al. investigated
they used the Alternating Direction Method of Multipliers both Winograd and Fast Fourier transformations and proved
(ADMM) technique. This approach achieves 37.4� energy their considerable effect in reducing arithmetic complexity and
efficiency improvement respectively compared with ESE [61]. improving CNNs performance on FPGAs.
90 DHOUIBI ET AL.
-
5.3.1 | Fast Fourier transform 5.4 | HW design automation

FFT is a well‐known approach that reduces the computational Design automation frameworks have been explored to
complexity. The study by Lin et al. [118] presented a frame- accelerate DNNs by automatically map their models onto
work based on FFT that achieved significant processing speed hardware platforms. The use of such frameworks can
and reduction in storage requirement. Zhang et al, exploited significantly simplify the development and speedup the
FFT also to deal with the complexity of the convolutional automatic generation of the hardware accelerator. Some ap-
layers computation [119]. The proposed design performed proaches have focussed on using the HLS which is an
123.48 GFLOPs on Intel Quick‐Assist QPI FPGA Platform automated design process that generates high‐performance
using VGG. To accelerate operations in each convolutional FPGA hardware from software. The study by Zhang et al.
layer too, a tile‐based FFT algorithm (tFFT) is presented in [99] designed Caffeine, a HW/SW co‐designed library based
Ref. [121]. Another proposed framework, C‐LSTM [59], used on HLS tools. Kim et al. [124] analysed the efficiency of the
FFT to accelerate the LSTM inference by reducing the HLS implementation and designed a CNN based FPGA
computational and storage complexities. The latter performed accelerator using LegUp HLS tool. The proposed accelerator
18.8� and 33.5� gains for performance and energy efficiency performed 138 GOPs on VGG‐16. SDAccel, OpenCL, HLS
compared with the state‐of‐the‐art ESE [61], respectively. tools are applied in Ref. [122] to synthesise a CNN acceler-
ator that reached 55 GFLOPS on VGG. In Ref. [63], Zhang
et al. applied HLS for the implementation of Long‐term
5.3.2 | Winograd algorithm Recurrent Convolution Network (LRCN) on Xilinx FPGA
based on their designed resource allocation scheme REALM.
Very similar to FFT, Winograd, the fast matrix multiplication More recently, the authors in Ref. [125] presented an
algorithm is applied to DNNs to minimise the multiplication implementation of Neural Machine Translation ( NMT)
requirement. By adopting Winograd transformation in Ref. model on FPGA. It used HLS to build parameterised IPs.
[101], the DSP utilization is improved. Lu et al. [120] used Many other approaches have used Register Transfer Level
Winograd algorithm to accelerate CNNs by reducing the (RTL) which describes the design as the transfers that occur
multiplication operations and saving DSP resources. On VGG, between registers every clock cycle. RTL leveraging offers
the proposed design attained 2479.6 GOPs of throughput. higher performance. In Ref. [97], Ma et al. proposed an RTL‐
Additionally, the study by Di Cecco et al. [122] implemented a level CNN compiler that automatically generates customised
Winograd convolution engine on FPGA which performed 55 FPGA accelerator. The VGG implementation gained 2.7�
GOPs when executing VGG. More recently, Huang et al. [123] throughput improvement over [99]. The study by Ma et al.
designed an accelerator based on Winograd algorithm. In this [126] developed an RTL FPGA‐based accelerator which
work, the authors evaluated Winograd algorithm with different achieved 720.15 GOPs using VGG‐16. Using RTL codes, the
tile sizes. When using VGG, the design achieved 943 GOPs on designed accelerator in Ref. [127] achieved 638.9 GOPs on
FPGA. More details are presented in Table 5. VGG‐16. The study by Zeng et al. [128] used RTL IPs to

TA B L E 5 FPGA‐based CNN accelerators optimising throughput

Acceleration technique DL Model Design Tool DSP Utilization Throughput (GOPs) Reference
Loop optimization VGG‐16 Verilog/ 1518 645.25 [94]
Quartus Prime

VGG16‐SVD HDL 780 136.97 [91]

AlexNet C++/HLS 2401 61.62 Fused‐layer [95]

AlexNet Verilog 2594 75.16 ICAN [96]

AlexNet RTL 256 114.5 ALAMO [97]


DNN (MNIST) ‐ 167 ‐ DLAU [98]

systolic‐Like architecture VGG‐16 HLS 2833 636 Caffeine [99]

VGG‐16 C/C++/HLS 1500 1171.3 [100]

AlexNet OpenCL 1476 1382 DLA [101]


VGG‐16 C/HLS 1368 1495 [102]

SIMD VGG‐16 Verilog RTL 2240 ‐ Double MAC [103]

CNN (MNIST) C/HLS ‐ 2.15 Synergy [104]

(96.2 frames/s)

VGG‐16 ‐ 880 425.32 [105]


DHOUIBI ET AL. 91
-
TA B L E 6 FPGA‐based CNN accelerators optimising low energy consumption

BRAM DSP Energy


Acceleration technique DL Model Design Utilization Utilization Efficiency Throughput Reference
‐2‐D multi‐cast VGG‐16 OpenCL 1450/2713 2756 47.78 1790 GOPs [109]
interconnection

Between PEs and local


memory

‐Increase the data locality

Memory subsystem CNN RTL ‐ ‐ 80% 2� Memsqueezer [110]

Optimising off‐chip memory AlexNet Verilog/ 47.6% DRAM ‐ 36% ‐ SmartShuttle [112]
Synopsys Access reduction ‐

Avoiding off‐chip data VGG‐16 ‐ 1090 900 ‐ 374.98 GOPs Block convolution
transfers: [113]
Multi‐layer fusion

loop tiling

Pipelined FPGA cluster VGG‐16 HDL ‐ ‐ 8.28 (GOPs/J) 290 GOPs [114]

Flexible data buffering AlexNet RTL/HLS 1745 (59%) 2182 2.4�peak 135 GOPs Escher [111]
scheme bandwidth
Reduction

Low‐rank approximation DNN Verilog ‐ ‐ 31% to 53% 22% to 43% LRADNN [115]
(SVHN) RTL/ energy
Synopsys Reduction Throughput
increase

Binarised neural networks CNN C++/HLS 86‐94/140 3/220 44.2 GOPs/W 207.8 GOPs BNN [116]

Compression LSTM ‐ 1080 1504 428 FPs/W 282 GOPs ESE [61]
(quantization + pruning)

HW/SW codesigned
framework

create a reconfigurable framework for deploying CNN‐RNN memory bandwidth reduction and model compression. For
models on FPGAs. On LRCN network, the designed hard- further improvement, algorithmic optimization approaches like
ware system performed up to 690.76 GOPs throughput and Fast Fourier and Winograd algorithms can be used. Further-
achieved 86.34 GOPs/W energy efficiency. More results are more, the automatic generation of high‐performance hardware
provided in Table 6. Some other approaches combined the accelerator from software can significantly simplify the devel-
finer level optimization of RTL and the flexibility of HLS to opment and speed up the process (e.g. HLS). Reducing the
design DNNs accelerators which achieved 114.5 GOPs in energy consumption and improving throughout are key chal-
Ref. [129]. Based on RTL‐HLS hybrid library, Guan et al. lenges for designing an efficient DNN‐based accelerator.
designed FP‐DNN [130], a framework to automatically Therefore, various acceleration techniques can be combined
generate optimised DNNs implementations on FPGA. The along with the optimization approaches.
evaluation results reached 364.36 GOPs on CNN model and
315.85 GOPs on RNN model.
The acceleration methods aim to speed up DNNs while 6 | CONCLUSION
improving throughput and reducing energy consumption.
Several techniques have been explored to achieve higher Herein, DL concept was initially presented through the
throughput such us loop optimization, systolic array architec- complexity of different models. We also reviewed the explora-
ture and SIMD based computation. A DNN accelerator tion of different computation platforms of DL implementation.
designed using these techniques usually consumes higher en- Then, we discussed a review of the literature about the different
ergy. Therefore, various techniques have been explored to approaches used to optimise DL models to make them more
obtain high throughput with low energy consumption such us hardware friendly. In the end, we presented and analysed the
92 DHOUIBI ET AL.
-
TA B L E 7 DNN accelerators employing computational transform

Algorithm DL Model Design DSP Utilization Throughput Energy Efficiency Reference


FFT VGG16 ‐ 224 123.48 GOPs 9.37 GOPs/W [119]

Google LSTM C/C++/HLS 2786 330 275 FPs 14 359 FPs/W C‐LSTM [59]

Small LSTM 2347 559,257 FPs 25,420 FPs/W C‐LSTM [59]

Winograd VGG C/HLS 2520 2940.7 GOPs 124.6 GOPs/W [120]

VGG OpenCL/HLS 1307 55 GOPs ‐ [122]

VGG C/HLS 756 943 GOPs 74.5� (over CPU) [123]

TA B L E 8 DNN accelerators employing design automation

Design DL Model DSP Utilization Throughput (GOPs) Energy Efficiency Reference


HLS VGG‐16 2833 354 ‐ Caffeine [99]

VGG‐16 380 138 41.8 GOPs/W [124]

VGG‐16 1307 55 ‐ [122]

NMT 5969 14.8 ‐ [125]

RTL VGG‐16 3600 720.15 ‐ [126]

VGG‐16 2967 638.9 ‐ CaFPGA [127]

LRCN 1248 690.76 86.34 GOPs/W [128]

RTL‐HLS AlexNet 256 114.5 ‐ [129]

VGG‐19 1036 364.36 14.57 GOPs/J FP‐DNN [130]

LSTM‐LM 1036 315.85 12.63 GOPs/J FP‐DNN [130]

used acceleration techniques for the deployment of DL models new FPGAs, using parallel processing and embedded pro-
on FPGA platforms. The deployment of DL on embedded gramable cores have advantages over other hardware platforms
equipments with high accuracy, high throughput and low con- for DNN implementations. Whole systems can be integrated on
sumption is still a challenge. Indeed, hardware constraints a chip using many hardware components such as memories, fast
required for lower power consumption, such as limited pro- devices, DSP units and processor cores which expedite the
cessing power, lower memory footprint, and less bandwidth, design of such large‐scale systems. FPGAs are very flexible and
reduce the accuracy. Due to the increasing complexity of DNN allow reconfiguration to optimise bit resolution, clock rate,
models it is difficult to integrate a large DNN into an embedded parallelisation, and pipeline processing for a given application.
hardware design. This made researches think about applying Some FPGA manufacturers like Xilinx have provided accelera-
optimization and acceleration techniques. Optimization tech- tors (DPU) along with other tools and APIs to optimise pre‐
niques focussed on modifying DL algorithms to make them trained DL models by applying pruning and quantization
more hardware‐friendly. They effectively digest the redundancy techniques.
of models and provide improved computing efficiency with
minimal loss of accuracy. However, the acceleration methods O R CI D
aim to speed up DNNs while improving throughput and Meriam Dhouibi https://orcid.org/0000-0002-0273-3262
reducing energy consumption. Also, applying algorithmic opti-
mization like Fast Fourier and Winograd algorithms can accel- REFERENCES
erate DNNs and improve resource productivity and efficiency. 1. Li, Y. et al.: Face recognition based on recurrent regression neural
In addition, the use of frameworks to automatically map models network. Neurocomputing. 297, 50–58 (2018)
onto hardware platforms simplifies the development and 2. Marra, F. et al.: A deep learning approach for iris sensor model iden-
tification. Pattern Recogn. Lett.. 113, 46–53 (2018)
speedup the automatic generation of the hardware acceleration. 3. Lee, J.G., et al.: Deep learning in medical imaging: general overview.
The efficient implementation of complex DNN models on new Korean J. Radiol.. 18(4), 570–584 (2017)
and increasingly powerful embedded platforms can offer many 4. Justesen, N. et al.: Deep learning for video Game Playing. IEEE Trans.
benefits for AI applications. Previous works faced challenges Games (2019)
such as limited hardware resources, long development time, and 5. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521, 436–444
(2015)
performance degradation. Moreover, it is difficult to use all the 6. Fan, K., Wen, S., Deng, Z.: Deep learning for detecting Breast cancer
functionalities of neural network algorithms in hardware Metastases on WSI, In: Innovation in Medicine and Healthcare Sys-
compared to software implementations [131]. In this context, tems, and Multimedia, pp. 137–145. Springer, Singapore (2019)
DHOUIBI ET AL. 93
-
7. Wang, J. et al.: Deep learning for smart manufacturing: methods and 30. Bacon, D.F., et al.: FPGA programing for the masses. Commun. ACM.
applications. J. Manuf. Syst.. 48, 144–156 (2018) 56(4), 56–63 (2013)
8. Rémy, S.: Apprentissage profond et acquisition de représentations 31. Lacey, G., Taylor, G.W., Areibi, S.: Deep learning on fpgas: Past, pre-
latentes de séquences peptidiques’. https://hal.inria.fr/hal-01406368 sent, and future. arXiv preprint arXiv:160204283 (2016)
(2016). Accessed 25 March 2018 32. Hou, X. et al.: Vehicle licence plate recognition system based on deep
9. Groumpos, P.P.: Deep learning vs. Wise learning: a critical and chal- learning deployed to PYNQ. In: Iscit 2018 ‐ 18th International Sym-
lenging overview. IFAC‐PapersOnLine. 49(29), 180–189 (2016) posium on Communication and Information Technology, Bangkok, 26‐
10. Ackerman, E.: How drive. AI is mastering autonomous driving with 29 September 2018(2018)
deep learning, IEEE Spectrum, 1. https://spectrum.ieee.org/cars-that- 33. Ranjan, R., Patel, V.M., Chellappa, R.: HyperFace: a deep multi‐task
think/transportation/self-driving/how-driveai-is-mastering-autonomou learning framework for Face detection, Landmark Localization, Pose
s-driving-with-deep-learning (2017). Accessed 20 March 2019 estimation, and Gender recognition. IEEE Trans. Pattern Anal. Mach.
11. Giusti, A., et al.: A machine learning approach to visual perception of Intell, Vol. 41, pp. 121–135 (2019)
forest trails for mobile robots. IEEE Robot. Autom. Lett. (2016) 34. Gaide, B. et al.: Xilinx adaptive compute acceleration platform: Ver-
12. Esteva, A., et al.: Dermatologist‐level classification of skin cancer with salTM; architecture. In: Fpga 2019 ‐ Proceedings of the 2019 ACM/
deep neural networks. Nature. 542(7639), 115 (2017) SIGDA International Symposium on Field‐Programmable Gate Arrays,
13. Khagi, B., Lee, C.G., Kwon, G.R.: Alzheimer’s disease Classification Seaside, CA, February 2019 (2019)
from Brain MRI based on transfer learning from CNN. In: BMEiCON 35. XILINX. DPU: For convolutional neural network ‐ DPU IP product
2018 ‐ 11th Biomedical Engineering International Conference, Chiang Guide. https://www.xilinx.com/products/intellectual-property/dpu.
Mai, 21‐24 November 2018 (2019) html (2019). Accessed 17 April 2020
14. Gilmer, J., et al.: Neural message passing for quantum chemistry. In: 36. Alveo. https://www.xilinx.com/products/boards-and-kits/alveo.html
Proceedings of the 34th International conference on machine learning, (2020). Accessed 7 May 2020
Vol. 70, JMLR. org, Sydney, NSW, pp. 1263–1272 (2017) 37. MPC‐X Series | Maxeler Technologies. https://www.maxeler.com/
15. Zhang, J., et al.: Deep neural networks in machine translation: an products/mpc-xseries/ (2020). Accessed 9 May 2020
overview. IEEE Intelligent Systems. 30(5), 16–25 (2015) 38. Feldman, M.: Microsoft goes all in for FPGAs to build out AI cloud |
16. Owens, A., et al.: Visually indicated sounds. In: Proceedings of the TOP500 Supercomputer Sites. https://www.top500.org/news/micros
IEEE conference on computer vision and pattern recognition, pp. oft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/ (2017). Accessed
2405–2413 (2016) 9 May 2020
17. Karpathy, A., Fei, L.: Deep Visual‐Semantic Alignments for generating 39. Amazon: Amazon EC2 F1 instances (2019). https://aws.amazon.com/
image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664– ec2/instance-types/f1/ Accessed 9 May 2020
676 (2017) 40. FPGA cloud server. https://cn.aliyun.com/product/ecs/fpga (2019).
18. He, K. et al.: Deep residual learning for image recognition. In: Pro- Accessed 9 May 2020
ceedings of the IEEE Computer Society Conference on Computer 41. FPGA accelerated cloud server. https://www.huaweicloud.com/
Vision and Pattern Recognition, Las Vegas, NV, 27‐30 June 2016 (2016) product/fcs.html (2019). Accessed 9 May 2020
19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for 42. FPGA cloud server_FPGA instance_hardware acceleration‐Tencent
large‐scale image recognition. In: 3rd International conference on Cloud. https://cloud.tencent.com/product/fpga (2019). Accessed 10
learning representations, ICLR 2015 ‐ Conference Track Proceedings, May 2020
San Diego, CA, 7‐9 May 2015 (2015) 43. FPGA cloud server_Baidu Cloud. https://cloud.baidu.com/product/
20. Canziani, A., Culurciello, E., Paszke, A.: Analysis of deep neural fpga.html (2019). Accessed 10 May 2020
network architectures for practical applications. CoRR. abs/ 44. Chen, Y.H. et al.: Eyeriss: An energy‐efficient reconfigurable accelerator
1605.07678. http://arxiv.org/abs/1605.07678 (2016) for deep convolutional neural networks. IEEE J. Solid State Circ. (2017)
21. Horowitz, M.:1.1 computing’s energy problem (and what we can do 45. Cloud, TPU. https://cloud.google.com/tpu (2020)
about it). In: 2014 IEEE International Solid‐State Circuits Conference 46. Linley, G.: Habana Wins Cigar for AI inference: startup takes perfor-
Digest of Technical Papers (ISSCC), San Francisco, CA, 9‐13 February mance lead with Mystery architecture. https://www.linleygroup.com/
2014, IEEE, pp. 10–14 (2014) mpr/article.php?id=12103 (2019). Accessed 4 May 2020
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with 47. Lightspeeur 2801 neural accelerator for edge devices. https://www.
deep convolutional neural networks. In: Advances in Neural Informa- gyrfalcontech.ai/solutions/2801s/ (2019). Accessed 10 May 2020
tion Processing Systems, Lake Tahoe, Nevada, December 2012, pp. 48. Synced, California: Startup GTI Releases AI chips to challenge NVI-
1097–1105 (2012) DIA and Intel (2019). https://syncedreview.com/2019/01/28/
23. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of california-startup-gti-releases-ai-chips-to-challenge-nvidia-and-intel/
the IEEE Conference on Computer Vision and Pattern Recognition, 49. Sheng, T. et al.: A quantization‐friendly separable convolution
Boston, MA, 8–10 June 2015, pp. 1–9 (2015) for MobileNets. In: Proceedings ‐ 1st Workshop on Energy Effi-
24. Sze, V. et al.: Hardware for machine learning: challenges and oppor- cient Machine Learning and Cognitive Computing for Embedded
tunities. In: 2017 IEEE Custom Integrated Circuits Conference (CICC), Applications, Williamsburg, VA, 25‐25 March 2018, EMC2 2018
Austin, TX, 30 April‐3 May 2017, pp. 1–8, IEEE (2017) (2018)
25. Oliveira, D., et al.: Experimental and analytical study of xeon phi reliability. 50. Zhou, A. et al.: Incremental network quantization: towards lossless cnns
In: Proceedings of the International Conference for High Performance with low‐precision weights. In: 5th International Conference on
Computing, Networking, Storage and Analysis, ACM, p. 28. (2017) Learning Representations, ICLR 2017 ‐ Conference Track Proceedings,
26. Frank, B.H.: Google’s new chip makes machine learning way faster. Toulon, France, 24‐26 April 2017 (2019)
https://www.computerworld.com/article/3072652/googles-new-chip- 51. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint
makes-machine-learning-way-faster.html (2016). Accessed 6 May 2018 arXiv:160504711 (2016)
27. Conformant products ‐ the Khronos group Inc. (2019). https://www. 52. Rastegari, M. et al.: XNOR‐net: Imagenet classification using binary
khronos.org/conformance/adopters/conformant-products{\#|opencl convolutional neural networks. In: Lecture Notes in Computer Science
28. NVDLA Primer — NVDLA Documentation. http://nvdla.org/ (including subseries Lecture notes in Artificial Intelligence and Lecture
primer.html (2020). Accessed 7 May 2020 notes in Bioinformatics). Springer, Cham, Amsterdam, The
29. Tayal, P.: AMD’s new Vega GPUs target deep learning. https:// Netherlands (2016)
marketrealist.com/2018/12/amds-new-vega-gpus-target-deep-learning/ 53. Gysel, P.: Ristretto: hardware‐oriented approximation of convolutional
(2018). Accessed 15 April 2020 neural networks. arXiv preprint arXiv:160506402 (2016)
94 DHOUIBI ET AL.
-
54. Zhou, S.C. et al.: Balanced quantization: an effective and efficient 73. Liu, C., Wu, H.: Channel pruning based on mean gradient for accel-
approach to quantized neural networks. J. Comput. Sci. Technol.. 32(4), erating convolutional neural networks. Signal Process. 156, 84–91
667–682 (2017) (2019)
55. Hubara, I. et al.: Quantized neural networks: training neural networks 74. Liu, Z., et al.: Learning efficient convolutional networks through
with low precision weights and activations. J. Mach. Learn. Res.. 18(1), network slimming.In: Proceedings of the IEEE International Confer-
6869–6898 (2017) ence on Computer Vision, Venice, 22‐29 October 2017 (2017)
56. Cai, Z. et al.: Deep learning with low precision by half‐wave Gaussian 75. Sze, V., et al.: Efficient processing of deep neural networks. Synthesis
quantization. In: Proceedings of the IEEE Conference on Computer Lectures on Computer Architecture. 15(2), 1–341 (2020)
Vision and Pattern Recognition, pp. 5918–5926 (2017) 76. He, Y., et al.: AutoML for model compression and acceleration on
57. Courbariaux, M. et al.: Binarised neural networks: training deep neural mobile devices. In: Lecture notes in computer science (including sub-
networks with weights and activations constrained to+ 1 or‐1. arXiv series Lecture notes in Artificial Intelligence and Lecture notes in
preprint arXiv:160202830 (2016) Bioinformatics). Springer, Cham, Munich, Germany (2018)
58. Zhou, S. et al.: Dorefa‐net: training low bitwidth convolutional neu- 77. Qin, Q., et al.: To compress, or not to compress: Characterising deep
ral networks with low bitwidth gradients.arXiv preprint arXiv:16060 learning model compression for embedded inference.In: Proceedings ‐
6160, 2016 16th IEEE International Symposium on Parallel and Distributed Pro-
59. Wang, S., et al. ‘C: Enabling efficient LSTM using structured cessing with Applications, 17th IEEE International Conference on
compression techniques on FPGAs.In: Fpga 2018 ‐ Proceedings of the Ubiquitous Computing and Communications, 8th IEEE International
2018 ACM/SIGDA International Symposium on Field‐Programmable Conference on Big Data and Cloud Computing, 11th IEEE Interna-
Gate Arrays, Monterey, CA, February 2018 (2018) tional Conference on Social Computing and Networking and 8th IEEE
60. Moss, D.J.M., et al.: A customisable matrix multiplication framework for International Conference on Sustainable Computing and Communica-
the intel harpv2 xeon+ fpga platform: a deep learning case study.In: tions, Melbourne, Australia, 11‐13 December 2018. ISPA/IUCC/
Proceedings of the 2018 ACM/SIGDA International Symposium on BDCloud/SocialCom/SustainCom 2018. (2019)
field‐Programmable Gate Arrays, Monterey, CA, February 2018, ACM, 78. Tung, F., Mori, G.:CLIP‐Q: Deep network compression learning by in‐
pp. 107–116 (2018) parallel pruning‐quantization. In:Proceedings of the IEEE computer
61. Han, S., et al.: Ese: efficient speech recognition engine with sparse lstm on Society Conference on Computer Vision and Pattern Recognition, Salt
fpga. In:Proceedings of the 2017 ACM/SIGDA International Sympo- Lake City, UT, 18‐23 June 2018 (2018)
sium on field‐Programmable Gate Arrays, ACM, pp. 75–84 (2017) 79. Faraone, J. et al.: Customising low‐precision deep neural networks for
62. Shen, J. et al.: Towards a uniform template‐based architecture for FPGAs. In: 28th International Conference on Field Programmable
accelerating 2d and 3d cnns on fpga.In: Proceedings of the 2018 ACM/ Logic and Applications (FPL), Dublin, 27‐31 August 2018, pp. 97–973,
SIGDA International Symposium on Field‐Programmable Gate Arrays, IEEE (2018)
Monterey, CA, February 2018, ACM, pp. 97–106 (2018) 80. Posewsky, T., Ziener, D.: A flexible FPGA‐based inference architecture
63. Zhang, X., et al.: High‐performance video content recognition with for pruned deep neural networks.In: Lecture Notes in Computer Sci-
long‐term recurrent convolutional network for FPGA. In: 27th Inter- ence (including subseries Lecture notes in Artificial Intelligence and
national Conference on field programmable logic and applications Lecture notes in Bioinformatics). Springer, Cham, Braunschweig,
(FPL), Ghent, 4‐8 September 2017, IEEE, pp. 1–4 (2017) Germany (2018)
64. Hu, Q., Wang, P., Cheng, J.: From hashing to cnns: training binary weight 81. Zhang, M. et al.: Optimised compression for implementing convolu-
networks via hashing. In:Thirty‐Second AAAI Conference on Artificial tional neural networks on FPGA. Electronics (Switzerland) (2019)
Intelligence, New Orleans, Louisiana, 2‐7 February 2017 (2018) 82. Kim, Y.D. et al.: Compression of deep convolutional neural networks
65. Wang, P., Cheng, J.: Fixed‐point factorised networks. In: Proceedings of for fast and low power mobile applications.In: 4th International Con-
the IEEE Conference on Computer Vision and Pattern Recognition, ference on Learning Representations, ICLR 2016 ‐ Conference Track
Honolulu, HI, 21‐26 July 2017, pp. 4012–4020 (2017) Proceedings, San Juan, Puerto Rico, 2‐4 May 2016 (2016)
66. Carreira‐ Perpinán, M.A., Idelbayev, Y.: “Learning‐Compression” al- 83. Tai, C. et al.: Convolutional neural networks with low‐rank regularisa-
gorithms for neural net pruning. In: Proceedings of the IEEE Con- tion.In: 4th International Conference on Learning Representations,
ference on Computer Vision and Pattern Recognition, Salt Lake City, ICLR 2016 ‐ Conference Track Proceedings, San Juan, Puerto Rico, 2‐4
UT, 18‐23 June 2018, pp. 8532–8541 (2018) May 2016 (2016)
67. Zhang, T., et al.: A systematic DNN weight pruning framework using 84. Zhang, X. et al.: Accelerating very deep convolutional networks for
alternating direction method of multipliers. In:Lecture Notes in Com- classification and detection. IEEE Trans. Pattern Anal. Mach. Intell.
puter Science (including subseries Lecture notes in Artificial Intelligence 38(10), 1943–1955 (2015)
and Lecture notes in Bioinformatics). Springer, Cham, Munich, Ger- 85. Wang, M., Liu, B., Foroosh, H.: Factorized convolutional neural net-
many (2018) works. In: Proceedings ‐ 2017 IEEE International Conference on
68. Yang, T.J., Chen, Y.H., Sze, V.: Designing energy‐efficient convolutional computer vision Workshops, Venice, 22‐29 Oct. 2017. ICCVW (2017)
neural networks using energy‐aware pruning.In: Proceedings of the 86. Chen, W. et al.: A layer decomposition‐recomposition framework for
IEEE Conference on computer vision and pattern recognition, Hon- neuron pruning towards accurate lightweight networks.In: Proceedings
olulu, HI, 21‐26 July 2017, , pp. 5687–5695 (2017) of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii,
69. Li, H. et al.: Pruning filters for efficient convnets.In: 5th International January 27–February 1, 2019 (2019)
Conference on Learning Representations, ICLR 2017 ‐ Conference 87. Wen, W. et al.: Coordinating filters for faster deep neural networks.In:
Track Proceedings, Toulon, France, 24 ‐ 26 April 2017 (2019) Proceedings of the IEEE International Conference on Computer
70. Huang, Q. et al.: Learning to prune filters in convolutional neural Vision, Venice, 22‐29 October 2017 (2017)
networks. In: IEEE Winter Conference on Applications of Computer 88. Wang, P., Cheng, J.: Accelerating convolutional neural networks for
Vision (WACV), Lake Tahoe, NV, 12‐15 March 2018, pp. 709–718, mobile applications. In:MM 2016 ‐ Proceedings of the 2016 ACM
IEEE (2018) Multimedia Conference, Amsterdam, The Netherlands, October 2016
71. Singh, P. et al.: Multi‐layer pruning framework for compressing single (2016)
shot multibox detector. In: IEEE Winter Conference on Applications 89. Denton, E. et al.: Exploiting linear structure within convolutional
of Computer Vision (WACV), Waikoloa Village, HI, 7‐11 January 2019, networks for efficient evaluation. In Proceedings of the 27th Interna-
IEEE, pp. 1318–1327 (2019) tional Conference on Neural Information Processing Systems
72. Zhuang, Z., et al.: Discrimination‐aware Channel pruning for deep (NIPS'14). MIT Press, Cambridge, MA, Vol. 1, 1269–1277 (2014)
neural networks. In: Advances in Neural Information Processing Sys- 90. Sainath, T.N. et al.: Low‐rank matrix factorization for Deep Neural
tems. Curran Associates Inc., Montreal, Canada (2018) Network training with high‐dimensional output targets.In: ICASSP,
DHOUIBI ET AL. 95
-
IEEE International Conference on Acoustics, Speech and ignal Pro- 110. Wang, Y., Li, H., Li, X.: Re‐architecting the on‐chip memory sub‐system
cessing ‐ Proceedings, Vancouver, BC, Canada, 26‐31 May 2013 (2013) of machine‐learning accelerator for embedded devices. In:IEEE/ACM
91. Qiu, J., et al.: Going deeper with embedded FPGA platform for con- International Conference on Computer‐Aided Design, Digest of
volutional neural network.In: Fpga 2016 ‐ Proceedings of the 2016 Technical Papers, Austin, TX, 7‐10 November 2016, ICCAD (2016)
ACM/SIGDA International Symposium on Field‐Programmable Gate 111. Shen, Y., Ferdman, M., Milder, P.: Escher: A CNN accelerator with
Arrays, Monterey, CA, February 2016 (2016) flexible buffering to minimise off‐chip transfer. In: Proceedings ‐ IEEE
92. Ding, H., et al.: A compact CNN‐DBLSTM based character model for 25th Annual International Symposium on Field‐Programmable Custom
offline handwriting recognition with tucker decomposition.In: Pro- Computing Machines, Napa, CA, 30 April‐2 May 2017, FCCM (2017)
ceedings of the International Conference on Document Analysis and 112. Li, J., et al.: SmartShuttle: optimising off‐chip memory accesses for deep
Recognition, Kyoto, 9‐15 November 2017. ICDAR (2017) learning accelerators. In: Proceedings of the 2018 Design, Automation
93. Li, B. et al.: Running sparse and low‐precision neural network: when and Test in Europe Conference and Exhibition, Dresden, 19‐23 March
algorithm meets hardware.In: Proceedings of the Asia and South Pacific 2018. DATE 2018 (2018)
Design Automation Conference, Jeju, 22‐25 January 2018. ASP‐DAC 113. Li, G. et al.: Block convolution: towards memory‐efficient inference of
(2018) large‐scale CNNs on FPGA. In:Proceedings of the 2018 design, auto-
94. Ma, Y. et al.: Optimising loop operation and dataflow in FPGA accel- mation and test in Europe Conference and Exhibition, DATE 2018,
eration of deep convolutional neural networks.In: Fpga 2017 ‐ Pro- Dresden, 19‐23 March 2018 (2018)
ceedings of the 2017 ACM/SIGDA International Symposium on field‐ 114. Zhang, C. et al.: Energy‐efficient CNN implementation on a deeply
programmable gate arrays, Monterey, CA, February 2017 (2017) pipelined FPGA cluster. In:Proceedings of the International Sympo-
95. Alwani, M. et al.: Fused‐layer CNN accelerators.In: Proceedings of the sium on Low Power Electronics and Design, San Francisco Airport,
Annual International Symposium on Microarchitecture, MICRO, Tai- CA, August 2016 (2016)
pei, Taiwan, 15‐19 October 2016 (2016) 115. Zhu, J., Qian, Z., Tsui, C.Y. ‘LRADNN: High‐throughput and en-
96. Rahman, A., Lee, J., Choi, K.: Efficient FPGA acceleration of Con- ergy‐efficient deep neural network accelerator using low rank
volutional Neural Networks using logical‐3D compute array.In: Pro- approximation. In:Proceedings of the Asia and South Pacific Design
ceedings of the 2016 Design, Automation and Test in Europe Automation Conference, Macao, China, 25‐28 January 2016. ASP‐
Conference and Exhibition, DATE 2016, Dresden, Germany, 14‐18 DAC. (2016)
March 2016 (2016) 116. Zhao, R., et al.: Accelerating binarised convolutional neural networks
97. Ma, Y. et al.: FPGA acceleration of deep learning algorithms with a with software‐programmable FPGAs. In:Fpga 2017 ‐ Proceedings of
modularised RTL compiler. Integration. 62, 14–23 (2018) the 2017 ACM/SIGDA International Symposium on Field‐Program-
98. Wang, C. et al.: DLAU: a scalable deep learning accelerator unit on mable Gate Arrays, Monterey, CA, February 2017 (2017)
FPGA. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 36, 513– 117. Li, Z., et al.: E‐RNN:Design optimization for efficient recurrent neural
517 (2017) networks in FPGAS’2019. In: Proceedings ‐ 25th IEEE International
99. Zhang, C. et al.: Caffeine: towards uniformed representation and ac- Symposium on High Performance Computer Architecture, Washington,
celeration for deep convolutional neural networks. IEEE Trans. Com- DC, 16‐20 February 2019. HPCA (2019)
put. Aided Des. Integr. Circ. Syst. 38, 2072–2085 (2019) 118. Lin, S., et al.: FFT‐based deep learning deployment in embedded sys-
100. Wei, X., et al.: Automated systolic array architecture synthesis for high tems.In: Proceedings of the 2018 Design, Automation and Test in
throughput CNN inference on FPGAs. In:Proceedings ‐ Design Europe Conference and Exhibition, Dresden, 19‐23 March 2018.
Automation Conference, Austin, TX, 18‐22 June 2017 (2017) DATE 2018 (2018)
101. Aydonat, U. et al.: An OpenCLTM deep learning accelerator on Arria 10. 119. Zhang, C., Prasanna, V.: Frequency domain acceleration of convolu-
In:Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International tional neural networks on CPU‐FPGA shared memory system.In: Fpga
Symposium on Field‐Programmable Gate Arrays, Monterey, CA, 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Sympo-
February 2017(2017) sium on Field‐Programmable Gate Arrays, Monterey, CA, February
102. Zhang, J. et al.: Frequency improvement of systolic array‐based CNNs 2017 (2017)
on FPGAs.In: Proceedings ‐ IEEE International Symposium on Cir- 120. Lu, L. et al.: Evaluating fast algorithms for convolutional neural net-
cuits and Systems, Sapporo, Japan, 26‐29 May 2019 (2019) works on FPGAs.In: Proceedings ‐ IEEE 25th Annual International
103. Nguyen, D., Kim, D., Lee, J.: Double‐MAC: Doubling the performance Symposium on Field‐Programmable Custom Computing Machines,
of convolutional neural networks on modern FPGAs. In:Proceedings of Napa, CA, 30 April‐2 May 2017. FCCM 2017 (2017)
the 2017 design, automation and test in Europe, DATE 2017, Lausanne, 121. Lin, J., Yao, Y.: A fast algorithm for convolutional neural networks using
27‐31 March 2017 (2017) tile‐based fast Fourier transforms,Neural Process. Lett. (2019)
104. Zhong, G. et al.: Synergy: an HW/SW framework for high throughput 122. Di Cecco, R. et al.: FPGA framework for convolutional neural net-
CNNs on embedded heterogeneous SoC. ACM Trans. Embed. Com- works. In: Proceedings of the 2016 International Conference on Field‐
put. Syst. (2019) Programmable Technology. FPT 2016. (2017)
105. Spagnolo, F. et al.: Energy‐efficient architecture for CNNs inference on 123. Huang, Y. et al.: A high‐efficiency FPGA‐based accelerator for con-
heterogeneous FPGA. J. Low. Power Electron. Appl. (2020) volutional neural networks using Winograd algorithm. In:Journal of
106. Price, M., Glass, J., Chandrakasan, A.P.: A scalable speech recogniser Physics: Conference Series, 6–8 March 2018 Location: Avid College,
with deep‐neural‐network acoustic models and voice‐activated power Maldives (2018)
gating.In: Digest of Technical Papers ‐ IEEE International Solid‐State 124. Kim, J.H. et al.: FPGA‐based CNN inference accelerator synthesised
Circuits Conference (2017) from multi‐threaded C software.In: International System on Chip
107. Yazdanbakhsh, A. et al.: A unified MIMD‐SIMD acceleration for Conference (2017)
generative adversarial networks. In: Proceedings ‐ International Sym- 125. Li, Q. et al.: Implementing neural machine translation with bi‐directional
posium on computer architecture, Los Angeles, CA, June 2018 (2018) GRU and attention mechanism on FPGAs using HLS. In:Proceedings
108. Lin, C.Y., Lai, B.C.: Supporting compressed‐sparse activations and of the Asia and South Pacific Design Automation Conference, Tokyo,
weights on SIMD‐like accelerator for sparse convolutional neural net- Japan, January 2019. ASP‐DAC. (2019)
works.In: Proceedings of the Asia and South Pacific Design Automation 126. Ma, Y. et al.: An automatic RTL compiler for high‐throughput FPGA
Conference. ASP‐DAC. (2018) implementation of diverse deep convolutional neural networks. In: 27th
109. Zhang, J., Li, J.: Improving the performance of OpenCL‐based FPGA International Conference on Field Programmable Logic and Applica-
accelerator for convolutional neural network.In: Fpga 2017 ‐ Pro- tions. FPL 2017 (2017)
ceedings of the 2017 ACM/SIGDA International Symposium on Field‐ 127. Xu, J. et al.: CaFPGA: an automatic generation model for CNN
Programmable Gate Arrays, Monterey, CA, February 2017 (2017) accelerator. Microprocess. Microsyst. 60, 196–206 (2018)
96 DHOUIBI ET AL.
-
128. Zeng, S., et al.: An efficient reconfigurable framework for general 131. Alrawashdeh, K., Purdy, C.: Reducing calculation requirements in FPGA
purpose CNN‐RNN models on FPGAs.In: International Conference implementation of deep learning algorithms for online anomaly intrusion
on Digital Signal Processing, Shanghai, China, 19‐21 November 2018. detection.In: Proceedings of the IEEE National Aerospace Electronics
DSP. (2019) Conference, Dayton, OH, 27‐30 June 2017. NAECON. (2018)
129. Ma, Y. et al.: Scalable and modularised RTL compilation of convolu-
tional neural networks onto FPGA.In: Fpl 2016 ‐ 26th International
Conference on field‐programmable logic and applications, Lausanne, How to cite this article: Dhouibi M, Ben Salem AK,
Switzerland, 29 August‐2 September 2016 (2016) Saidi A, Ben Saoud S. Accelerating Deep Neural
130. Guan, Y., et al.: FP: An automated framework for mapping deep neural Networks implementation: A survey. IET Comput.
networks onto FPGAs with RTL‐HLS hybrid templates. In:Proceedings
‐ IEEE 25th Annual International Symposium on Field‐Programmable
Digit. Tech. 2021;15:79–96. https://doi.org/10.1049/
Custom Computing Machines, Napa, CA, 30 April‐2 May 2017. FCCM cdt2.12016
2017 (2017)

You might also like