Accelerating Deep Neural Networks Implem
Accelerating Deep Neural Networks Implem
DOI: 10.1049/cdt2.12016
REVIEW
- -
Revised: 5 December 2020 Accepted: 11 December 2020
-
Accelerating Deep Neural Networks implementation: A survey
Meriam Dhouibi | Ahmed Karim Ben Salem | Afef Saidi | Slim Ben Saoud
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2021 The Authors. IET Computers & Digital Techniques published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
- 79
80 DHOUIBI ET AL.
-
precision, pruning network, low‐rank approximation, etc. CNN models have demonstrated impressive performance in
Furthermore, for efficient implementation of an optimised DL computer vision applications such as autonomous car vision
model, further acceleration improvement is required. Indeed, it systems [10], drone navigation, robotics [11], etc. CNNs, have
is necessary to maximise the utilization of all offered oppor- also proved to be more effective in medical field and specially
tunities at several levels of hardware/software codesign to in image recognition. It have been adopted at detecting a
achieve high performance in terms of precision, energy con- tumour or any other type of lesion than the most experienced
sumption and throughput. This survey takes a deep dive into radiologists [12]. In Ref. [13], an image extracted from Mag-
DL implementation on advanced and dedicated computation netic Resonance Imaging (MRI) of a human brain was pro-
platforms and reveals its bottlenecks. In addition, it is focus- cessed to predict Alzheimer's disease using CNN. DL models
sing on hardware and software techniques to optimise the are also used in drug research by predicting molecular prop-
implementation of DNNs and also provides a summary of erties such as toxicity or binding capacity. In particular, DL can
recent research work. There are some surveys that have been be used to simulate biological or chemical processes of
published dealing with DL implementation. However, those different molecules without the need for expensive software
papers have not discussed the state of the art in different simulators and is 30,000 times faster [14]. Moreover, RNN
hardware platforms. Most of the recent surveys have focussed models have exiled in natural language processing including
on FPGA‐based CNN acceleration without pointing out the automatic speech recognition, recommendation systems, audio
choice of FPGA over other platforms. Another strong aspect recognition, machine translation, social media filtering, etc. For
of our work is that we discussed the optimization of DNNs in example, various LSTM models have been proposed for
both levels' software and hardware. Furthermore, we presented sequence to sequence mapping that are suitable for machine
a classification of advanced hardware acceleration techniques translation [15]. Furthermore, CNNs and RNNs were com-
based on throughput and energy optimizations. An investiga- bined to add sounds to silent movies [16] and to generate
tion of the algorithmic side and its effect on designing accel- captions that describes the contents of images [17]. Besides, it
erators is also included in this survey. Additionally, we exposed is important to note that the effective implementation of DL
the tools that can automatically generate hardware design from models on embedded platforms is behind this diffusion of
software that are used for implementing and evaluating deep such applications. The performance of such AI algorithms
learning approaches. Herein, using DL models lies on the capacity of processors in sup-
porting the DNN with its varied number of layers, neurons per
‐ Section 2 presents the basics of DL and its popular models layer, multiple filters, filter sizes and channels while treating
and architectures currently in use and turns the lights on the large dataset. Indeed, DL workloads are both computation and
complexity of these models. memory intensive. For example, the well‐known CNN network
‐ Section 3 describes the various hardware platforms used to ResNet50 [18] requires up to 7.7 billion floating point opera-
implement DNNs. tions (FLOPs) and 25.6 million model parameters to classify a
‐ Section 4 exposes the optimization techniques that can be 224 � 224 � 3 image. As shown in Figure 1, the complex and
applied to make the model more efficient in terms of speed larger model VGG16 [19] with 138.3 million parameters model
and power. size, requires up to 30.97 Giga FLOPs (GFLOPs). Thus, the
number of operations and parameters increases with the
Finally, synthesis of different acceleration techniques complexity of the model architecture. Table 1 presents the
explored in recent research works is given and analysed. state‐of‐the‐art models' sizes and complexities (Table 2).
VGG models were developed by the Visual Geometry
Group from University of Oxford and are the most preferred
2 | BACKGROUND AND MOTIVATIONS choices in the community for extracting features from images.
They are widely used in many applications despite the expen-
Currently, DL represents the leading‐edge solution in virtually sive architecture in both terms of parameters number and
all relevant machine learning tasks in a large variety of fields computational requirements (Figure 1). The large dimension-
[5,6]. DL algorithms are showing significant improvement over ality of these models increases the computation and data
traditional machine learning algorithms based on manual movement. More precisely, it increases the amount of gener-
extraction of relevant features (handcrafted features) [7]. DL ated data which its movement considered more expensive than
models perform a hierarchical feature extraction and show also computation, in terms of power on hardware platforms [21].
better performance with the increase of the amount of data [8]. At this inflection point, it is therefore necessary to benefit
There are different methods and architectures of DL such as from new design methodology, to make good use of new
Multi‐Layer Perceptron (MLP), Autoencoder (AE), Deep design opportunities and to explore some optimization tech-
Belief Network (DBN), Convolutional Neural Network niques to reduce the network size and to enhance the imple-
(CNN), Recurrent Neural Network (RNN) including Long mentation performance in terms of throughput and energy
Short‐Term Memory (LSTM) and Gated Recurrent Units consumption. Besides, the choice of suitable hardware plat-
(GRU), Generative Adversarial Network (GAN), Deep Rein- form to implement a DL model is of paramount importance
forcement Learning (DRL), etc. [9]. These models have [24]. In the next section, we will explore different computation
covered several fields with a variety of applications. Particularly, platforms of DL implementation.
DHOUIBI ET AL. 81
-
F I G U R E 1 Computational cost of most popular
models: inference on ImageNet dataset [20]
3 | COMPUTATION PLATFORM OF DL meet the needs of DL: Intel has tweaked the CPUs of its
IMPLEMENTATION servers to improve its performance with DL [25]. Google has
developed a chip to perform DL tasks more economically [26].
The employment of DL into daily applications of different However, it is still very difficult for CPUs, even with multicore
fields will depend on the ease with which it will be possible to architecture to support the high computation and the storage
deploy DL model on small, low‐power devices rather than complexity of large DNN models.
large servers. In majority of cases, the training phase is per-
formed in the cloud. However, the inference phase is less
demanding, it can happen locally or in the cloud depending on 3.2 | Graphics processing units
the application [24]. Research is underway on the two phases
implementation using parallel architectures on different hard- A GPU excels in parallel computing. CPU has typically be-
ware targets and computing devices. Four major types of tween one and eight cores, and high‐end GPUs have thousands
technology are being used to accelerate DNNs: CPU, GPU, of cores (e.g. GeForce GTX TITAN Z included 5760 cores,
FPGA and ASIC. the last one is Geforce RTX 2080...). GPUs are slow during
sequential operations, but shine when given tasks that can run
in parallel. Since the operations required to run a DL algorithm
3.1 | Central processing units can be done in parallel, GPUs became extremely valuable tools.
Furthermore, by using OpenCL [27], an open standard for
Traditionally, DNNs were mainly tested on the CPU of a portable parallelisation, compute kernels written using a limited
computer. The CPU works by sequentially performing the subset of the C programing language can be lunched on GPUs.
computations that are sent to it. Sometimes, a programme has In this perspective, NVIDIA has invested much in its CUDA
different tasks that can be calculated independently of each (Compute Unified Device Architecture) language to make it
other. To optimise the time required to complete all tasks, support the most DL development frameworks. Similar to
many processors have multiple threads or cores that can OpenCL CUDA affords an environment of general‐purpose
perform parallel calculations. Some manufacturers have sought programing and enables parallel processing over NVIDIA
to optimise the hardware architectures of their processors to GPU's cores. NVIDIA GPUs are currently the most used for
82 DHOUIBI ET AL.
-
TA B L E 2 Reduce precision effect on DNN models
Bitwidth Results
Float 32‐Bit After
Baseline Reduction
Accuracy
Reduce Precision technique Reference DL Model Input Weight Activation Gradient Top‐1 Accuracy Loss
Reduce weight [49] MobileNetV1 ‐ 8‐bit 32‐bit 32‐bit 70.77% 2.74%
Reduce weight and activation FFN [65] AlexNet ‐ 2‐bit 32‐bit 32‐bit 57.20% 1.70%
Reduce input, weight and BNN [57] AlexNet 1‐bit 1‐bit 1‐bit 32‐bit 57.20% 30.10%
activation
Reduce weight, activation and DoReFa‐Net [58] AlexNet ‐ 8‐bit 8‐bit 8‐bit 55.90% 2.90%
gradient
Results
With Pruning
Without Pruning
Pruning technique Reference DL Model Top‐5 Accuracy Accuracy Loss Reduction
Weight pruning [66] ResNet110 93.50% 0% 90% weight
ResNet56 93.33%
[67] LeNet‐300‐100 98.40% 0% 22.9� weight
LeNet‐5 99.20% 71.2� weight
AlexNet 80.20% 21� weight
Energy‐aware AlexNet 80.43% 0.87% 3.7� energy
Pruning [68] Consumption
GoogLeNet 88.26% 0.98% 1.6� energy
Consumption
avoid the off‐chip intermediate data transfers. It performed Table 4 illustrates several accelerations techniques that opti-
high throughput on FPGA using VGG‐16. The fused‐layer mised low energy consumption.
CNN accelerator proposed by Alwani et al. [95] also minimised
the off‐chip data transfer by 95%.
5.3 | Algorithmic optimization
5.2.2 | Other approaches Recent works [59,118,119] have demonstrated that applying
mathematical optimizations such as Fast Fourier (FF) and
Many other methods have been used to reduce power con- Winograd algorithms to DNNs accelerators can improve
sumption. In Ref. [114], Zhang et al. proposed a deeply resource productivity and efficiency. These transformations
pipelined FPGA architecture to leverage the design space for can decrease the required MAC operations in the layer's
energy efficiency. The evaluation results on VGG‐16 achieved network by reducing the model arithmetic complexity. For
8.28 GOPs/J of energy efficiency. The study by Zhu et al. example, each element in the output feature map of a CNN
[115] showed that using low‐rank approximation, a 31% to model is computed individually. Contrariwise, FF and Wino-
53% energy reduction can be reached. The low data repre- grad algorithms transform the input feature map and filter to
sentation can also reduce energy consumption, the binarised corresponding domain ( Winograd or frequency) and then
neural networks in Ref. [116] attained 44.2 GOPs/W. In Ref. perform element‐wise matrix multiplication [120]. To get the
[61], Han et al. presented an Efficient Speech Recognition final output an inverse transformation is applied. The reduc-
Engine (ESE), a SW/HW co‐design framework which works tion of the model arithmetic complexity depends on the pa-
directly on compressed LSTM model. It achieved up to 428 rameters of the algorithm. With 8�8 input tile size; FF
FPs/W which is 40� more energy efficient than CPU. Li et al. Transform (FFT) algorithm can reduce the multiplication by
[117] presented the Efficient RNN (E‐RNN) framework for 3.45 times for 3�3 filters. On the other hand, with 6�6 input
FPGA implementation of the Automatic Speech Recognition tile size, Winograd algorithm can reduce the multiplication by 4
(ASR) application. For more accurate block‐circulant training, times for 3�3 filters. In Ref. [120], Liang et al. investigated
they used the Alternating Direction Method of Multipliers both Winograd and Fast Fourier transformations and proved
(ADMM) technique. This approach achieves 37.4� energy their considerable effect in reducing arithmetic complexity and
efficiency improvement respectively compared with ESE [61]. improving CNNs performance on FPGAs.
90 DHOUIBI ET AL.
-
5.3.1 | Fast Fourier transform 5.4 | HW design automation
FFT is a well‐known approach that reduces the computational Design automation frameworks have been explored to
complexity. The study by Lin et al. [118] presented a frame- accelerate DNNs by automatically map their models onto
work based on FFT that achieved significant processing speed hardware platforms. The use of such frameworks can
and reduction in storage requirement. Zhang et al, exploited significantly simplify the development and speedup the
FFT also to deal with the complexity of the convolutional automatic generation of the hardware accelerator. Some ap-
layers computation [119]. The proposed design performed proaches have focussed on using the HLS which is an
123.48 GFLOPs on Intel Quick‐Assist QPI FPGA Platform automated design process that generates high‐performance
using VGG. To accelerate operations in each convolutional FPGA hardware from software. The study by Zhang et al.
layer too, a tile‐based FFT algorithm (tFFT) is presented in [99] designed Caffeine, a HW/SW co‐designed library based
Ref. [121]. Another proposed framework, C‐LSTM [59], used on HLS tools. Kim et al. [124] analysed the efficiency of the
FFT to accelerate the LSTM inference by reducing the HLS implementation and designed a CNN based FPGA
computational and storage complexities. The latter performed accelerator using LegUp HLS tool. The proposed accelerator
18.8� and 33.5� gains for performance and energy efficiency performed 138 GOPs on VGG‐16. SDAccel, OpenCL, HLS
compared with the state‐of‐the‐art ESE [61], respectively. tools are applied in Ref. [122] to synthesise a CNN acceler-
ator that reached 55 GFLOPS on VGG. In Ref. [63], Zhang
et al. applied HLS for the implementation of Long‐term
5.3.2 | Winograd algorithm Recurrent Convolution Network (LRCN) on Xilinx FPGA
based on their designed resource allocation scheme REALM.
Very similar to FFT, Winograd, the fast matrix multiplication More recently, the authors in Ref. [125] presented an
algorithm is applied to DNNs to minimise the multiplication implementation of Neural Machine Translation ( NMT)
requirement. By adopting Winograd transformation in Ref. model on FPGA. It used HLS to build parameterised IPs.
[101], the DSP utilization is improved. Lu et al. [120] used Many other approaches have used Register Transfer Level
Winograd algorithm to accelerate CNNs by reducing the (RTL) which describes the design as the transfers that occur
multiplication operations and saving DSP resources. On VGG, between registers every clock cycle. RTL leveraging offers
the proposed design attained 2479.6 GOPs of throughput. higher performance. In Ref. [97], Ma et al. proposed an RTL‐
Additionally, the study by Di Cecco et al. [122] implemented a level CNN compiler that automatically generates customised
Winograd convolution engine on FPGA which performed 55 FPGA accelerator. The VGG implementation gained 2.7�
GOPs when executing VGG. More recently, Huang et al. [123] throughput improvement over [99]. The study by Ma et al.
designed an accelerator based on Winograd algorithm. In this [126] developed an RTL FPGA‐based accelerator which
work, the authors evaluated Winograd algorithm with different achieved 720.15 GOPs using VGG‐16. Using RTL codes, the
tile sizes. When using VGG, the design achieved 943 GOPs on designed accelerator in Ref. [127] achieved 638.9 GOPs on
FPGA. More details are presented in Table 5. VGG‐16. The study by Zeng et al. [128] used RTL IPs to
Acceleration technique DL Model Design Tool DSP Utilization Throughput (GOPs) Reference
Loop optimization VGG‐16 Verilog/ 1518 645.25 [94]
Quartus Prime
(96.2 frames/s)
Optimising off‐chip memory AlexNet Verilog/ 47.6% DRAM ‐ 36% ‐ SmartShuttle [112]
Synopsys Access reduction ‐
Avoiding off‐chip data VGG‐16 ‐ 1090 900 ‐ 374.98 GOPs Block convolution
transfers: [113]
Multi‐layer fusion
loop tiling
Pipelined FPGA cluster VGG‐16 HDL ‐ ‐ 8.28 (GOPs/J) 290 GOPs [114]
Flexible data buffering AlexNet RTL/HLS 1745 (59%) 2182 2.4�peak 135 GOPs Escher [111]
scheme bandwidth
Reduction
Low‐rank approximation DNN Verilog ‐ ‐ 31% to 53% 22% to 43% LRADNN [115]
(SVHN) RTL/ energy
Synopsys Reduction Throughput
increase
Binarised neural networks CNN C++/HLS 86‐94/140 3/220 44.2 GOPs/W 207.8 GOPs BNN [116]
Compression LSTM ‐ 1080 1504 428 FPs/W 282 GOPs ESE [61]
(quantization + pruning)
HW/SW codesigned
framework
create a reconfigurable framework for deploying CNN‐RNN memory bandwidth reduction and model compression. For
models on FPGAs. On LRCN network, the designed hard- further improvement, algorithmic optimization approaches like
ware system performed up to 690.76 GOPs throughput and Fast Fourier and Winograd algorithms can be used. Further-
achieved 86.34 GOPs/W energy efficiency. More results are more, the automatic generation of high‐performance hardware
provided in Table 6. Some other approaches combined the accelerator from software can significantly simplify the devel-
finer level optimization of RTL and the flexibility of HLS to opment and speed up the process (e.g. HLS). Reducing the
design DNNs accelerators which achieved 114.5 GOPs in energy consumption and improving throughout are key chal-
Ref. [129]. Based on RTL‐HLS hybrid library, Guan et al. lenges for designing an efficient DNN‐based accelerator.
designed FP‐DNN [130], a framework to automatically Therefore, various acceleration techniques can be combined
generate optimised DNNs implementations on FPGA. The along with the optimization approaches.
evaluation results reached 364.36 GOPs on CNN model and
315.85 GOPs on RNN model.
The acceleration methods aim to speed up DNNs while 6 | CONCLUSION
improving throughput and reducing energy consumption.
Several techniques have been explored to achieve higher Herein, DL concept was initially presented through the
throughput such us loop optimization, systolic array architec- complexity of different models. We also reviewed the explora-
ture and SIMD based computation. A DNN accelerator tion of different computation platforms of DL implementation.
designed using these techniques usually consumes higher en- Then, we discussed a review of the literature about the different
ergy. Therefore, various techniques have been explored to approaches used to optimise DL models to make them more
obtain high throughput with low energy consumption such us hardware friendly. In the end, we presented and analysed the
92 DHOUIBI ET AL.
-
TA B L E 7 DNN accelerators employing computational transform
Google LSTM C/C++/HLS 2786 330 275 FPs 14 359 FPs/W C‐LSTM [59]
used acceleration techniques for the deployment of DL models new FPGAs, using parallel processing and embedded pro-
on FPGA platforms. The deployment of DL on embedded gramable cores have advantages over other hardware platforms
equipments with high accuracy, high throughput and low con- for DNN implementations. Whole systems can be integrated on
sumption is still a challenge. Indeed, hardware constraints a chip using many hardware components such as memories, fast
required for lower power consumption, such as limited pro- devices, DSP units and processor cores which expedite the
cessing power, lower memory footprint, and less bandwidth, design of such large‐scale systems. FPGAs are very flexible and
reduce the accuracy. Due to the increasing complexity of DNN allow reconfiguration to optimise bit resolution, clock rate,
models it is difficult to integrate a large DNN into an embedded parallelisation, and pipeline processing for a given application.
hardware design. This made researches think about applying Some FPGA manufacturers like Xilinx have provided accelera-
optimization and acceleration techniques. Optimization tech- tors (DPU) along with other tools and APIs to optimise pre‐
niques focussed on modifying DL algorithms to make them trained DL models by applying pruning and quantization
more hardware‐friendly. They effectively digest the redundancy techniques.
of models and provide improved computing efficiency with
minimal loss of accuracy. However, the acceleration methods O R CI D
aim to speed up DNNs while improving throughput and Meriam Dhouibi https://orcid.org/0000-0002-0273-3262
reducing energy consumption. Also, applying algorithmic opti-
mization like Fast Fourier and Winograd algorithms can accel- REFERENCES
erate DNNs and improve resource productivity and efficiency. 1. Li, Y. et al.: Face recognition based on recurrent regression neural
In addition, the use of frameworks to automatically map models network. Neurocomputing. 297, 50–58 (2018)
onto hardware platforms simplifies the development and 2. Marra, F. et al.: A deep learning approach for iris sensor model iden-
tification. Pattern Recogn. Lett.. 113, 46–53 (2018)
speedup the automatic generation of the hardware acceleration. 3. Lee, J.G., et al.: Deep learning in medical imaging: general overview.
The efficient implementation of complex DNN models on new Korean J. Radiol.. 18(4), 570–584 (2017)
and increasingly powerful embedded platforms can offer many 4. Justesen, N. et al.: Deep learning for video Game Playing. IEEE Trans.
benefits for AI applications. Previous works faced challenges Games (2019)
such as limited hardware resources, long development time, and 5. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature. 521, 436–444
(2015)
performance degradation. Moreover, it is difficult to use all the 6. Fan, K., Wen, S., Deng, Z.: Deep learning for detecting Breast cancer
functionalities of neural network algorithms in hardware Metastases on WSI, In: Innovation in Medicine and Healthcare Sys-
compared to software implementations [131]. In this context, tems, and Multimedia, pp. 137–145. Springer, Singapore (2019)
DHOUIBI ET AL. 93
-
7. Wang, J. et al.: Deep learning for smart manufacturing: methods and 30. Bacon, D.F., et al.: FPGA programing for the masses. Commun. ACM.
applications. J. Manuf. Syst.. 48, 144–156 (2018) 56(4), 56–63 (2013)
8. Rémy, S.: Apprentissage profond et acquisition de représentations 31. Lacey, G., Taylor, G.W., Areibi, S.: Deep learning on fpgas: Past, pre-
latentes de séquences peptidiques’. https://hal.inria.fr/hal-01406368 sent, and future. arXiv preprint arXiv:160204283 (2016)
(2016). Accessed 25 March 2018 32. Hou, X. et al.: Vehicle licence plate recognition system based on deep
9. Groumpos, P.P.: Deep learning vs. Wise learning: a critical and chal- learning deployed to PYNQ. In: Iscit 2018 ‐ 18th International Sym-
lenging overview. IFAC‐PapersOnLine. 49(29), 180–189 (2016) posium on Communication and Information Technology, Bangkok, 26‐
10. Ackerman, E.: How drive. AI is mastering autonomous driving with 29 September 2018(2018)
deep learning, IEEE Spectrum, 1. https://spectrum.ieee.org/cars-that- 33. Ranjan, R., Patel, V.M., Chellappa, R.: HyperFace: a deep multi‐task
think/transportation/self-driving/how-driveai-is-mastering-autonomou learning framework for Face detection, Landmark Localization, Pose
s-driving-with-deep-learning (2017). Accessed 20 March 2019 estimation, and Gender recognition. IEEE Trans. Pattern Anal. Mach.
11. Giusti, A., et al.: A machine learning approach to visual perception of Intell, Vol. 41, pp. 121–135 (2019)
forest trails for mobile robots. IEEE Robot. Autom. Lett. (2016) 34. Gaide, B. et al.: Xilinx adaptive compute acceleration platform: Ver-
12. Esteva, A., et al.: Dermatologist‐level classification of skin cancer with salTM; architecture. In: Fpga 2019 ‐ Proceedings of the 2019 ACM/
deep neural networks. Nature. 542(7639), 115 (2017) SIGDA International Symposium on Field‐Programmable Gate Arrays,
13. Khagi, B., Lee, C.G., Kwon, G.R.: Alzheimer’s disease Classification Seaside, CA, February 2019 (2019)
from Brain MRI based on transfer learning from CNN. In: BMEiCON 35. XILINX. DPU: For convolutional neural network ‐ DPU IP product
2018 ‐ 11th Biomedical Engineering International Conference, Chiang Guide. https://www.xilinx.com/products/intellectual-property/dpu.
Mai, 21‐24 November 2018 (2019) html (2019). Accessed 17 April 2020
14. Gilmer, J., et al.: Neural message passing for quantum chemistry. In: 36. Alveo. https://www.xilinx.com/products/boards-and-kits/alveo.html
Proceedings of the 34th International conference on machine learning, (2020). Accessed 7 May 2020
Vol. 70, JMLR. org, Sydney, NSW, pp. 1263–1272 (2017) 37. MPC‐X Series | Maxeler Technologies. https://www.maxeler.com/
15. Zhang, J., et al.: Deep neural networks in machine translation: an products/mpc-xseries/ (2020). Accessed 9 May 2020
overview. IEEE Intelligent Systems. 30(5), 16–25 (2015) 38. Feldman, M.: Microsoft goes all in for FPGAs to build out AI cloud |
16. Owens, A., et al.: Visually indicated sounds. In: Proceedings of the TOP500 Supercomputer Sites. https://www.top500.org/news/micros
IEEE conference on computer vision and pattern recognition, pp. oft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/ (2017). Accessed
2405–2413 (2016) 9 May 2020
17. Karpathy, A., Fei, L.: Deep Visual‐Semantic Alignments for generating 39. Amazon: Amazon EC2 F1 instances (2019). https://aws.amazon.com/
image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664– ec2/instance-types/f1/ Accessed 9 May 2020
676 (2017) 40. FPGA cloud server. https://cn.aliyun.com/product/ecs/fpga (2019).
18. He, K. et al.: Deep residual learning for image recognition. In: Pro- Accessed 9 May 2020
ceedings of the IEEE Computer Society Conference on Computer 41. FPGA accelerated cloud server. https://www.huaweicloud.com/
Vision and Pattern Recognition, Las Vegas, NV, 27‐30 June 2016 (2016) product/fcs.html (2019). Accessed 9 May 2020
19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for 42. FPGA cloud server_FPGA instance_hardware acceleration‐Tencent
large‐scale image recognition. In: 3rd International conference on Cloud. https://cloud.tencent.com/product/fpga (2019). Accessed 10
learning representations, ICLR 2015 ‐ Conference Track Proceedings, May 2020
San Diego, CA, 7‐9 May 2015 (2015) 43. FPGA cloud server_Baidu Cloud. https://cloud.baidu.com/product/
20. Canziani, A., Culurciello, E., Paszke, A.: Analysis of deep neural fpga.html (2019). Accessed 10 May 2020
network architectures for practical applications. CoRR. abs/ 44. Chen, Y.H. et al.: Eyeriss: An energy‐efficient reconfigurable accelerator
1605.07678. http://arxiv.org/abs/1605.07678 (2016) for deep convolutional neural networks. IEEE J. Solid State Circ. (2017)
21. Horowitz, M.:1.1 computing’s energy problem (and what we can do 45. Cloud, TPU. https://cloud.google.com/tpu (2020)
about it). In: 2014 IEEE International Solid‐State Circuits Conference 46. Linley, G.: Habana Wins Cigar for AI inference: startup takes perfor-
Digest of Technical Papers (ISSCC), San Francisco, CA, 9‐13 February mance lead with Mystery architecture. https://www.linleygroup.com/
2014, IEEE, pp. 10–14 (2014) mpr/article.php?id=12103 (2019). Accessed 4 May 2020
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with 47. Lightspeeur 2801 neural accelerator for edge devices. https://www.
deep convolutional neural networks. In: Advances in Neural Informa- gyrfalcontech.ai/solutions/2801s/ (2019). Accessed 10 May 2020
tion Processing Systems, Lake Tahoe, Nevada, December 2012, pp. 48. Synced, California: Startup GTI Releases AI chips to challenge NVI-
1097–1105 (2012) DIA and Intel (2019). https://syncedreview.com/2019/01/28/
23. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of california-startup-gti-releases-ai-chips-to-challenge-nvidia-and-intel/
the IEEE Conference on Computer Vision and Pattern Recognition, 49. Sheng, T. et al.: A quantization‐friendly separable convolution
Boston, MA, 8–10 June 2015, pp. 1–9 (2015) for MobileNets. In: Proceedings ‐ 1st Workshop on Energy Effi-
24. Sze, V. et al.: Hardware for machine learning: challenges and oppor- cient Machine Learning and Cognitive Computing for Embedded
tunities. In: 2017 IEEE Custom Integrated Circuits Conference (CICC), Applications, Williamsburg, VA, 25‐25 March 2018, EMC2 2018
Austin, TX, 30 April‐3 May 2017, pp. 1–8, IEEE (2017) (2018)
25. Oliveira, D., et al.: Experimental and analytical study of xeon phi reliability. 50. Zhou, A. et al.: Incremental network quantization: towards lossless cnns
In: Proceedings of the International Conference for High Performance with low‐precision weights. In: 5th International Conference on
Computing, Networking, Storage and Analysis, ACM, p. 28. (2017) Learning Representations, ICLR 2017 ‐ Conference Track Proceedings,
26. Frank, B.H.: Google’s new chip makes machine learning way faster. Toulon, France, 24‐26 April 2017 (2019)
https://www.computerworld.com/article/3072652/googles-new-chip- 51. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint
makes-machine-learning-way-faster.html (2016). Accessed 6 May 2018 arXiv:160504711 (2016)
27. Conformant products ‐ the Khronos group Inc. (2019). https://www. 52. Rastegari, M. et al.: XNOR‐net: Imagenet classification using binary
khronos.org/conformance/adopters/conformant-products{\#|opencl convolutional neural networks. In: Lecture Notes in Computer Science
28. NVDLA Primer — NVDLA Documentation. http://nvdla.org/ (including subseries Lecture notes in Artificial Intelligence and Lecture
primer.html (2020). Accessed 7 May 2020 notes in Bioinformatics). Springer, Cham, Amsterdam, The
29. Tayal, P.: AMD’s new Vega GPUs target deep learning. https:// Netherlands (2016)
marketrealist.com/2018/12/amds-new-vega-gpus-target-deep-learning/ 53. Gysel, P.: Ristretto: hardware‐oriented approximation of convolutional
(2018). Accessed 15 April 2020 neural networks. arXiv preprint arXiv:160506402 (2016)
94 DHOUIBI ET AL.
-
54. Zhou, S.C. et al.: Balanced quantization: an effective and efficient 73. Liu, C., Wu, H.: Channel pruning based on mean gradient for accel-
approach to quantized neural networks. J. Comput. Sci. Technol.. 32(4), erating convolutional neural networks. Signal Process. 156, 84–91
667–682 (2017) (2019)
55. Hubara, I. et al.: Quantized neural networks: training neural networks 74. Liu, Z., et al.: Learning efficient convolutional networks through
with low precision weights and activations. J. Mach. Learn. Res.. 18(1), network slimming.In: Proceedings of the IEEE International Confer-
6869–6898 (2017) ence on Computer Vision, Venice, 22‐29 October 2017 (2017)
56. Cai, Z. et al.: Deep learning with low precision by half‐wave Gaussian 75. Sze, V., et al.: Efficient processing of deep neural networks. Synthesis
quantization. In: Proceedings of the IEEE Conference on Computer Lectures on Computer Architecture. 15(2), 1–341 (2020)
Vision and Pattern Recognition, pp. 5918–5926 (2017) 76. He, Y., et al.: AutoML for model compression and acceleration on
57. Courbariaux, M. et al.: Binarised neural networks: training deep neural mobile devices. In: Lecture notes in computer science (including sub-
networks with weights and activations constrained to+ 1 or‐1. arXiv series Lecture notes in Artificial Intelligence and Lecture notes in
preprint arXiv:160202830 (2016) Bioinformatics). Springer, Cham, Munich, Germany (2018)
58. Zhou, S. et al.: Dorefa‐net: training low bitwidth convolutional neu- 77. Qin, Q., et al.: To compress, or not to compress: Characterising deep
ral networks with low bitwidth gradients.arXiv preprint arXiv:16060 learning model compression for embedded inference.In: Proceedings ‐
6160, 2016 16th IEEE International Symposium on Parallel and Distributed Pro-
59. Wang, S., et al. ‘C: Enabling efficient LSTM using structured cessing with Applications, 17th IEEE International Conference on
compression techniques on FPGAs.In: Fpga 2018 ‐ Proceedings of the Ubiquitous Computing and Communications, 8th IEEE International
2018 ACM/SIGDA International Symposium on Field‐Programmable Conference on Big Data and Cloud Computing, 11th IEEE Interna-
Gate Arrays, Monterey, CA, February 2018 (2018) tional Conference on Social Computing and Networking and 8th IEEE
60. Moss, D.J.M., et al.: A customisable matrix multiplication framework for International Conference on Sustainable Computing and Communica-
the intel harpv2 xeon+ fpga platform: a deep learning case study.In: tions, Melbourne, Australia, 11‐13 December 2018. ISPA/IUCC/
Proceedings of the 2018 ACM/SIGDA International Symposium on BDCloud/SocialCom/SustainCom 2018. (2019)
field‐Programmable Gate Arrays, Monterey, CA, February 2018, ACM, 78. Tung, F., Mori, G.:CLIP‐Q: Deep network compression learning by in‐
pp. 107–116 (2018) parallel pruning‐quantization. In:Proceedings of the IEEE computer
61. Han, S., et al.: Ese: efficient speech recognition engine with sparse lstm on Society Conference on Computer Vision and Pattern Recognition, Salt
fpga. In:Proceedings of the 2017 ACM/SIGDA International Sympo- Lake City, UT, 18‐23 June 2018 (2018)
sium on field‐Programmable Gate Arrays, ACM, pp. 75–84 (2017) 79. Faraone, J. et al.: Customising low‐precision deep neural networks for
62. Shen, J. et al.: Towards a uniform template‐based architecture for FPGAs. In: 28th International Conference on Field Programmable
accelerating 2d and 3d cnns on fpga.In: Proceedings of the 2018 ACM/ Logic and Applications (FPL), Dublin, 27‐31 August 2018, pp. 97–973,
SIGDA International Symposium on Field‐Programmable Gate Arrays, IEEE (2018)
Monterey, CA, February 2018, ACM, pp. 97–106 (2018) 80. Posewsky, T., Ziener, D.: A flexible FPGA‐based inference architecture
63. Zhang, X., et al.: High‐performance video content recognition with for pruned deep neural networks.In: Lecture Notes in Computer Sci-
long‐term recurrent convolutional network for FPGA. In: 27th Inter- ence (including subseries Lecture notes in Artificial Intelligence and
national Conference on field programmable logic and applications Lecture notes in Bioinformatics). Springer, Cham, Braunschweig,
(FPL), Ghent, 4‐8 September 2017, IEEE, pp. 1–4 (2017) Germany (2018)
64. Hu, Q., Wang, P., Cheng, J.: From hashing to cnns: training binary weight 81. Zhang, M. et al.: Optimised compression for implementing convolu-
networks via hashing. In:Thirty‐Second AAAI Conference on Artificial tional neural networks on FPGA. Electronics (Switzerland) (2019)
Intelligence, New Orleans, Louisiana, 2‐7 February 2017 (2018) 82. Kim, Y.D. et al.: Compression of deep convolutional neural networks
65. Wang, P., Cheng, J.: Fixed‐point factorised networks. In: Proceedings of for fast and low power mobile applications.In: 4th International Con-
the IEEE Conference on Computer Vision and Pattern Recognition, ference on Learning Representations, ICLR 2016 ‐ Conference Track
Honolulu, HI, 21‐26 July 2017, pp. 4012–4020 (2017) Proceedings, San Juan, Puerto Rico, 2‐4 May 2016 (2016)
66. Carreira‐ Perpinán, M.A., Idelbayev, Y.: “Learning‐Compression” al- 83. Tai, C. et al.: Convolutional neural networks with low‐rank regularisa-
gorithms for neural net pruning. In: Proceedings of the IEEE Con- tion.In: 4th International Conference on Learning Representations,
ference on Computer Vision and Pattern Recognition, Salt Lake City, ICLR 2016 ‐ Conference Track Proceedings, San Juan, Puerto Rico, 2‐4
UT, 18‐23 June 2018, pp. 8532–8541 (2018) May 2016 (2016)
67. Zhang, T., et al.: A systematic DNN weight pruning framework using 84. Zhang, X. et al.: Accelerating very deep convolutional networks for
alternating direction method of multipliers. In:Lecture Notes in Com- classification and detection. IEEE Trans. Pattern Anal. Mach. Intell.
puter Science (including subseries Lecture notes in Artificial Intelligence 38(10), 1943–1955 (2015)
and Lecture notes in Bioinformatics). Springer, Cham, Munich, Ger- 85. Wang, M., Liu, B., Foroosh, H.: Factorized convolutional neural net-
many (2018) works. In: Proceedings ‐ 2017 IEEE International Conference on
68. Yang, T.J., Chen, Y.H., Sze, V.: Designing energy‐efficient convolutional computer vision Workshops, Venice, 22‐29 Oct. 2017. ICCVW (2017)
neural networks using energy‐aware pruning.In: Proceedings of the 86. Chen, W. et al.: A layer decomposition‐recomposition framework for
IEEE Conference on computer vision and pattern recognition, Hon- neuron pruning towards accurate lightweight networks.In: Proceedings
olulu, HI, 21‐26 July 2017, , pp. 5687–5695 (2017) of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii,
69. Li, H. et al.: Pruning filters for efficient convnets.In: 5th International January 27–February 1, 2019 (2019)
Conference on Learning Representations, ICLR 2017 ‐ Conference 87. Wen, W. et al.: Coordinating filters for faster deep neural networks.In:
Track Proceedings, Toulon, France, 24 ‐ 26 April 2017 (2019) Proceedings of the IEEE International Conference on Computer
70. Huang, Q. et al.: Learning to prune filters in convolutional neural Vision, Venice, 22‐29 October 2017 (2017)
networks. In: IEEE Winter Conference on Applications of Computer 88. Wang, P., Cheng, J.: Accelerating convolutional neural networks for
Vision (WACV), Lake Tahoe, NV, 12‐15 March 2018, pp. 709–718, mobile applications. In:MM 2016 ‐ Proceedings of the 2016 ACM
IEEE (2018) Multimedia Conference, Amsterdam, The Netherlands, October 2016
71. Singh, P. et al.: Multi‐layer pruning framework for compressing single (2016)
shot multibox detector. In: IEEE Winter Conference on Applications 89. Denton, E. et al.: Exploiting linear structure within convolutional
of Computer Vision (WACV), Waikoloa Village, HI, 7‐11 January 2019, networks for efficient evaluation. In Proceedings of the 27th Interna-
IEEE, pp. 1318–1327 (2019) tional Conference on Neural Information Processing Systems
72. Zhuang, Z., et al.: Discrimination‐aware Channel pruning for deep (NIPS'14). MIT Press, Cambridge, MA, Vol. 1, 1269–1277 (2014)
neural networks. In: Advances in Neural Information Processing Sys- 90. Sainath, T.N. et al.: Low‐rank matrix factorization for Deep Neural
tems. Curran Associates Inc., Montreal, Canada (2018) Network training with high‐dimensional output targets.In: ICASSP,
DHOUIBI ET AL. 95
-
IEEE International Conference on Acoustics, Speech and ignal Pro- 110. Wang, Y., Li, H., Li, X.: Re‐architecting the on‐chip memory sub‐system
cessing ‐ Proceedings, Vancouver, BC, Canada, 26‐31 May 2013 (2013) of machine‐learning accelerator for embedded devices. In:IEEE/ACM
91. Qiu, J., et al.: Going deeper with embedded FPGA platform for con- International Conference on Computer‐Aided Design, Digest of
volutional neural network.In: Fpga 2016 ‐ Proceedings of the 2016 Technical Papers, Austin, TX, 7‐10 November 2016, ICCAD (2016)
ACM/SIGDA International Symposium on Field‐Programmable Gate 111. Shen, Y., Ferdman, M., Milder, P.: Escher: A CNN accelerator with
Arrays, Monterey, CA, February 2016 (2016) flexible buffering to minimise off‐chip transfer. In: Proceedings ‐ IEEE
92. Ding, H., et al.: A compact CNN‐DBLSTM based character model for 25th Annual International Symposium on Field‐Programmable Custom
offline handwriting recognition with tucker decomposition.In: Pro- Computing Machines, Napa, CA, 30 April‐2 May 2017, FCCM (2017)
ceedings of the International Conference on Document Analysis and 112. Li, J., et al.: SmartShuttle: optimising off‐chip memory accesses for deep
Recognition, Kyoto, 9‐15 November 2017. ICDAR (2017) learning accelerators. In: Proceedings of the 2018 Design, Automation
93. Li, B. et al.: Running sparse and low‐precision neural network: when and Test in Europe Conference and Exhibition, Dresden, 19‐23 March
algorithm meets hardware.In: Proceedings of the Asia and South Pacific 2018. DATE 2018 (2018)
Design Automation Conference, Jeju, 22‐25 January 2018. ASP‐DAC 113. Li, G. et al.: Block convolution: towards memory‐efficient inference of
(2018) large‐scale CNNs on FPGA. In:Proceedings of the 2018 design, auto-
94. Ma, Y. et al.: Optimising loop operation and dataflow in FPGA accel- mation and test in Europe Conference and Exhibition, DATE 2018,
eration of deep convolutional neural networks.In: Fpga 2017 ‐ Pro- Dresden, 19‐23 March 2018 (2018)
ceedings of the 2017 ACM/SIGDA International Symposium on field‐ 114. Zhang, C. et al.: Energy‐efficient CNN implementation on a deeply
programmable gate arrays, Monterey, CA, February 2017 (2017) pipelined FPGA cluster. In:Proceedings of the International Sympo-
95. Alwani, M. et al.: Fused‐layer CNN accelerators.In: Proceedings of the sium on Low Power Electronics and Design, San Francisco Airport,
Annual International Symposium on Microarchitecture, MICRO, Tai- CA, August 2016 (2016)
pei, Taiwan, 15‐19 October 2016 (2016) 115. Zhu, J., Qian, Z., Tsui, C.Y. ‘LRADNN: High‐throughput and en-
96. Rahman, A., Lee, J., Choi, K.: Efficient FPGA acceleration of Con- ergy‐efficient deep neural network accelerator using low rank
volutional Neural Networks using logical‐3D compute array.In: Pro- approximation. In:Proceedings of the Asia and South Pacific Design
ceedings of the 2016 Design, Automation and Test in Europe Automation Conference, Macao, China, 25‐28 January 2016. ASP‐
Conference and Exhibition, DATE 2016, Dresden, Germany, 14‐18 DAC. (2016)
March 2016 (2016) 116. Zhao, R., et al.: Accelerating binarised convolutional neural networks
97. Ma, Y. et al.: FPGA acceleration of deep learning algorithms with a with software‐programmable FPGAs. In:Fpga 2017 ‐ Proceedings of
modularised RTL compiler. Integration. 62, 14–23 (2018) the 2017 ACM/SIGDA International Symposium on Field‐Program-
98. Wang, C. et al.: DLAU: a scalable deep learning accelerator unit on mable Gate Arrays, Monterey, CA, February 2017 (2017)
FPGA. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 36, 513– 117. Li, Z., et al.: E‐RNN:Design optimization for efficient recurrent neural
517 (2017) networks in FPGAS’2019. In: Proceedings ‐ 25th IEEE International
99. Zhang, C. et al.: Caffeine: towards uniformed representation and ac- Symposium on High Performance Computer Architecture, Washington,
celeration for deep convolutional neural networks. IEEE Trans. Com- DC, 16‐20 February 2019. HPCA (2019)
put. Aided Des. Integr. Circ. Syst. 38, 2072–2085 (2019) 118. Lin, S., et al.: FFT‐based deep learning deployment in embedded sys-
100. Wei, X., et al.: Automated systolic array architecture synthesis for high tems.In: Proceedings of the 2018 Design, Automation and Test in
throughput CNN inference on FPGAs. In:Proceedings ‐ Design Europe Conference and Exhibition, Dresden, 19‐23 March 2018.
Automation Conference, Austin, TX, 18‐22 June 2017 (2017) DATE 2018 (2018)
101. Aydonat, U. et al.: An OpenCLTM deep learning accelerator on Arria 10. 119. Zhang, C., Prasanna, V.: Frequency domain acceleration of convolu-
In:Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International tional neural networks on CPU‐FPGA shared memory system.In: Fpga
Symposium on Field‐Programmable Gate Arrays, Monterey, CA, 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Sympo-
February 2017(2017) sium on Field‐Programmable Gate Arrays, Monterey, CA, February
102. Zhang, J. et al.: Frequency improvement of systolic array‐based CNNs 2017 (2017)
on FPGAs.In: Proceedings ‐ IEEE International Symposium on Cir- 120. Lu, L. et al.: Evaluating fast algorithms for convolutional neural net-
cuits and Systems, Sapporo, Japan, 26‐29 May 2019 (2019) works on FPGAs.In: Proceedings ‐ IEEE 25th Annual International
103. Nguyen, D., Kim, D., Lee, J.: Double‐MAC: Doubling the performance Symposium on Field‐Programmable Custom Computing Machines,
of convolutional neural networks on modern FPGAs. In:Proceedings of Napa, CA, 30 April‐2 May 2017. FCCM 2017 (2017)
the 2017 design, automation and test in Europe, DATE 2017, Lausanne, 121. Lin, J., Yao, Y.: A fast algorithm for convolutional neural networks using
27‐31 March 2017 (2017) tile‐based fast Fourier transforms,Neural Process. Lett. (2019)
104. Zhong, G. et al.: Synergy: an HW/SW framework for high throughput 122. Di Cecco, R. et al.: FPGA framework for convolutional neural net-
CNNs on embedded heterogeneous SoC. ACM Trans. Embed. Com- works. In: Proceedings of the 2016 International Conference on Field‐
put. Syst. (2019) Programmable Technology. FPT 2016. (2017)
105. Spagnolo, F. et al.: Energy‐efficient architecture for CNNs inference on 123. Huang, Y. et al.: A high‐efficiency FPGA‐based accelerator for con-
heterogeneous FPGA. J. Low. Power Electron. Appl. (2020) volutional neural networks using Winograd algorithm. In:Journal of
106. Price, M., Glass, J., Chandrakasan, A.P.: A scalable speech recogniser Physics: Conference Series, 6–8 March 2018 Location: Avid College,
with deep‐neural‐network acoustic models and voice‐activated power Maldives (2018)
gating.In: Digest of Technical Papers ‐ IEEE International Solid‐State 124. Kim, J.H. et al.: FPGA‐based CNN inference accelerator synthesised
Circuits Conference (2017) from multi‐threaded C software.In: International System on Chip
107. Yazdanbakhsh, A. et al.: A unified MIMD‐SIMD acceleration for Conference (2017)
generative adversarial networks. In: Proceedings ‐ International Sym- 125. Li, Q. et al.: Implementing neural machine translation with bi‐directional
posium on computer architecture, Los Angeles, CA, June 2018 (2018) GRU and attention mechanism on FPGAs using HLS. In:Proceedings
108. Lin, C.Y., Lai, B.C.: Supporting compressed‐sparse activations and of the Asia and South Pacific Design Automation Conference, Tokyo,
weights on SIMD‐like accelerator for sparse convolutional neural net- Japan, January 2019. ASP‐DAC. (2019)
works.In: Proceedings of the Asia and South Pacific Design Automation 126. Ma, Y. et al.: An automatic RTL compiler for high‐throughput FPGA
Conference. ASP‐DAC. (2018) implementation of diverse deep convolutional neural networks. In: 27th
109. Zhang, J., Li, J.: Improving the performance of OpenCL‐based FPGA International Conference on Field Programmable Logic and Applica-
accelerator for convolutional neural network.In: Fpga 2017 ‐ Pro- tions. FPL 2017 (2017)
ceedings of the 2017 ACM/SIGDA International Symposium on Field‐ 127. Xu, J. et al.: CaFPGA: an automatic generation model for CNN
Programmable Gate Arrays, Monterey, CA, February 2017 (2017) accelerator. Microprocess. Microsyst. 60, 196–206 (2018)
96 DHOUIBI ET AL.
-
128. Zeng, S., et al.: An efficient reconfigurable framework for general 131. Alrawashdeh, K., Purdy, C.: Reducing calculation requirements in FPGA
purpose CNN‐RNN models on FPGAs.In: International Conference implementation of deep learning algorithms for online anomaly intrusion
on Digital Signal Processing, Shanghai, China, 19‐21 November 2018. detection.In: Proceedings of the IEEE National Aerospace Electronics
DSP. (2019) Conference, Dayton, OH, 27‐30 June 2017. NAECON. (2018)
129. Ma, Y. et al.: Scalable and modularised RTL compilation of convolu-
tional neural networks onto FPGA.In: Fpl 2016 ‐ 26th International
Conference on field‐programmable logic and applications, Lausanne, How to cite this article: Dhouibi M, Ben Salem AK,
Switzerland, 29 August‐2 September 2016 (2016) Saidi A, Ben Saoud S. Accelerating Deep Neural
130. Guan, Y., et al.: FP: An automated framework for mapping deep neural Networks implementation: A survey. IET Comput.
networks onto FPGAs with RTL‐HLS hybrid templates. In:Proceedings
‐ IEEE 25th Annual International Symposium on Field‐Programmable
Digit. Tech. 2021;15:79–96. https://doi.org/10.1049/
Custom Computing Machines, Napa, CA, 30 April‐2 May 2017. FCCM cdt2.12016
2017 (2017)