Irmak2021energy_efficient
Irmak2021energy_efficient
Abstract—Convolutional Neural Networks (CNNs) are a very design is verified on the hardware using Nexys DDR 4 FPGA
popular class of artificial neural networks. Current CNN mo- evaluation board.
2021 29th Signal Processing and Communications Applications Conference (SIU) | 978-1-6654-3649-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/SIU53274.2021.9477823
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on August 17,2021 at 12:49:20 UTC from IEEE Xplore. Restrictions apply.
convolutional layers, the input image becomes converted and accuracy drop as compared to floating-point accuracy. It is
downsampled feature maps. These features are used for the seen that for lower than 0.1 percentage drop in the accuracy
classification in the fully connected layers. Fully connected as compared to floating-point accuracy, using 8 bits weights,
layers are feed-forward neural networks consisting of one or 16 bits activations, and 32 bits biases is sufficient for the
more hidden layers. After the hidden layers, there is a final hardware design. The accuracies of fixed-point and floating-
output layer showing the class scores of each object to be point designs are both greater than 98%. Moreover, the number
classified. The general block diagram of a typical CNN is of layers and number of features are heuristically optimized to
shown in Figure 1. improve the accuracy. As a result, the optimized CNN consists
of 2 convolutional layers, 2 max-pooling layers, a hidden fully
connected layer, and an output layer. 3 and 12 feature maps
are used in the convolutional layers, respectively. Except for
the last layer, the ReLU activation function is used in the
convolutional and hidden layers and max-pooling is used in
the pooling layers. In convolutional layers, the convolutional
kernel is selected as 5 x 5 for its better performance as
compared to smaller size kernels. In the pooling layers, 2 x
Figure 1: A Typical Block Diagram of CNN with one input 2 kernels are used and the downsampling factor is selected as
layer, one convolutional layer, one pooling layer, two hidden two. After the convolutional layers, data is flattened and fed
layers, and one output layer to the fully connected layers. In the hidden layer, 48 neural
network nodes are used and in the output layer, there are
10 nodes showing the number of digits to be classified. The
In recent years, FPGA-based CNN accelerators have be- CNN is trained and tested using MNIST dataset. MNIST is
come a promising research area. Custom parallel processing a handwritten digit dataset that is commonly used for various
capabilities and higher performance per watt values make image processing systems [21].
FPGA more attractive in CNN implementations. Different
CNN architectures are implemented on FPGA platforms in the
literature [3]. In order to decrease the computational comple-
xity and memory requirements, binarized neural networks are
used in some studies [11], [12]. They reduce execution times
using bitwise operations, however, accuracy is generally less
than the fixed point models [13]. Some of the FPGA imple-
mentations focus on the optimization of the convolution engine
[14], [15]. These engines make the convolution operation in a Figure 2: Proposed CNN for FPGA Implementation with two
pipelined manner. There are also works using the Zynq series convolutional layers, two pooling layers, one fully connected
FPGAs, and these works process the data with the help of layer, and an output layer
embedded processor and programmable logic together in the
accelerator [16], [17]. Lenet, Alexnet and VGGNet are the
most popular CNNs used in the FPGA implementation. Ho- After optimizing the fixed-point model in Python, the CNN
wever, the power consumptions, in general, are compared with accelerator is developed on the FPGA platform using this
either processor, GPU, or PC implementations, which is not a fixed-point model. Convolutional, pooling, and fully connected
fair comparison [16], [18], [19]. Since FPGAs are inherently layers are coded according to the model. In Figure 2, each
energy-efficient devices, a fair comparison should be done blue box is designed separately in Vitis HLS. Vitis HLS tool
between FPGA implementations. In this work, Artix-7 FPGA transforms a C, C++, or SystemC code into a register transfer
family is selected intentionally because Artix-7 series FPGAs level (RTL) implementation to use in Xilinx FPGAs. Using the
are the cost-effective and energy-efficient FPGAs among the pragmas in the software code, different parallelization levels
Xilinx FPGA series [20]. Moreover, keeping the resource and different hardware can be generated. Vitis HLS methodo-
usage as low as possible without degrading the performance logy allows designers to develop and verify the designs faster
helps to fit all the CNN architecture in a very small package than the traditional hardware description languages.
FPGA (i.e.1 cm x 1 cm) with consuming only 628 mW. This In the CNN accelerator design, the key processing ope-
not only helps developing compact designs but also makes the ration is the convolution operation. It dominates the total
CNN accelerator cost-effective. processing time. Therefore, it needs to be carefully designed
in hardware. As seen in Figure 3, two-dimensional convolution
III. ACCELERATOR DESIGN is calculated for each pixel and generates an output pixel for
the feature map.
In this section, the proposed CNN accelerator is explained
in detail. In this work, a LeNet CNN architecture has been From Figure 3, it can be seen that for each pixel of the
developed, implemented, and verified on the FPGA platform. output feature map, 25 multiplications and 25 additions are
The developed LeNet CNN structure is given in Figure 2. required. In order to increase the throughput, for each pixel
The CNN input is a 32 x 32 grayscale image and the output convolution operation is done in one clock cycle. As a result,
is the classification result. The network is first developed 25 DSP slices are used in the convolution operation. DSP slices
using Python Tensorflow. It uses fixed-point data types in are the basic elements of the FPGA for arithmetic operations.
all the stages and the bit sizes are optimized based on the Basic DSP operations such as accumulator, multiplier, adder
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on August 17,2021 at 12:49:20 UTC from IEEE Xplore. Restrictions apply.
IV. EVALUATION
The design is implemented and tested on a Digilent Nexys4
DDR FPGA board [22]. The board is equipped with a Xilinx
Artix XC7A100T FPGA. The overall design is running at 125
MHz. The overall resource usage of the whole design is given
in Table 3. Since the design has a very low resource usage,
it can fit the smallest package FPGAs of Xilinx 7 series such
Figure 3: Two-Dimensional Convolution of Input Image (left as XC7A50T FPGA (i.e., 1 cm x 1 cm in dimension) [23].
hand side) to Output Image (right hand side) Moreover, the power consumption of the whole design is 628
mW. This consumption is taken by Vivado’s power report of
the implemented design and consisting of 94 mW static and
can be implemented using these slices. Since the number of 534 mW dynamic power. This is nearly 67 % lower than the
DSPs are limited, these slices are used only for multiplication other LeNet CNN architectures which are around 1800 mW
operations, other mathematical operations are done in the [16] [17].
programmable logic of the FPGA device. In the second layer,
since 12 feature map exists, for each four feature map, one
convolution engine is used. So, totally of 75 DSP slices are TABLE III: R ESOURCE U SAGE OF P ROPOSED CNN ACCE -
used in the second convolutional layer. Finally, in the hidden LERATOR
layer and output layer, the fully connected layers are paralleli- BRAM DSP LUT FF
zed in order to decrease the processing time. For each feature Used Resources 29 120 15951 17664
map for the output of the second pooling layer, a parallel path Resources in Nexys DDR 4 Board 135 240 63400 126800
Utilization in Nexys DDR 4 Board (%) 21.48 50 25.16 13.93
is created. In other words, hidden layer multiplications are
performed using 12 DSP slices. The resource usage of each
layer is shown in Table 1. In addition to these parallelizations, In the experimental setup, the images are loaded using
for each layer, the internal memories storing the weights and the serial interface of the board and the result is shown
biases are concatenated to reach the data in one clock cycle to on the LEDs of the board. Meanwhile, the output of the
avoid memory bottlenecks. each layer in the FPGA CNN accelerator is verified bitwise
by matching the outputs of the Python design using the
Vivado hardware manager. In other words, the output of the
TABLE I: R ESOURCE U SAGE OF THE D IFFERENT L AYERS Python and FPGA designs give exactly the same result in
each stage of the CNN. Moreover, for a fair comparison, the
BRAM DSP LUT FF proposed accelerator is compared with the other LeNet CNN
Conv Layer 1 6 25 4670 4690 implementations in the literature having the same number of
Conv Layer 2 14 75 11436 12057
convolutional and fully connected layers [17] [24] [25]. The
Hidden Layer 7 12 371 332
Output Layer 2 8 297 209
design of [24] is using Zynq Ultrascale FPGA and HLS is used
in the development stage. In the design of [25], a ZCU102
board with a Xilinx FPGA chip ZU9EG is used and different
In the accelerator design, since optimized bit widths are accelerators are used for processing the CNN layers. Lastly,
used in weights and biases, these coefficients are fit to the in the study of [17], Digilent Arty Z7-20 development board,
internal memories of the FPGA. Using the internal memory based on the Xilinx Zynq-7000 System on Chip (SoC), is used.
of FPGA achieves high memory bandwidth and decreases the This design proposes an HW/SW co-processing accelerator. It
number of clock cycles required to finish the CNN operations. uses programmable logic as an accelerator, and the system
Every layer is optimized in terms of processing time as much is managed by the ARM processor. Performance comparison
as possible based on the available resources in the FPGA with these studies is given in Table 4. As clearly seen from
device. The number of clock cycles for processing of each Table 4, the proposed accelerator has lower resource usage
layer is shown in Table 2. As shown in Table 2, the total in DSP and BRAM, which are the most critical components
processing time for CNN operation is 70 us. In other words, of the FPGAs, and much lower processing time as compared
the proposed CNN accelerator can process 14K images/sec. to the other implementations. Using the pragmas efficiently
in the hardware design such as pipelining, loop unrolling,
and memory reshaping in the proposed design achieves much
TABLE II: P ROCESSING TIME OF THE D IFFERENT L AYERS higher throughput as compared to the other implementations.
Clock Cycle Processing Time Besides, using internal memories of the FPGA instead of
Conv Layer 1 + Max Pool Layer 1 3144 25.15 us external memory decreases the processing time much further.
Conv Layer 2 + Max Pool Layer 2 3599 28.8 us
Hidden Layer 1878 15.02 us
Output Layer 160 1.28 us
V. CONCLUSION
Total 8781 70.2 us
In this work, an FPGA-based accelerator for CNN archi-
tectures is implemented, particularly LeNet architecture. The
After designing each layer, the CNN accelerator is created fixed point design is using 8 bits for weights, 16 bits for
using Vivado by cascading these layers. The final design is activations, and 32 bits for biases. The accuracy is higher
placed and routed without any placement, routing, or timing than 98% and the difference between fixed-point and floating-
errors. point designs are less than 0.1 percent. Vitis HLS is used for
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on August 17,2021 at 12:49:20 UTC from IEEE Xplore. Restrictions apply.
[15] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan,
TABLE IV: C OMPARISON OF D IFFERENT L E N ET CNN AC - “A high-performance cnn processor based on fpga for mobilenets,” in
CELERATORS 2019 29th International Conference on Field Programmable Logic and
Applications (FPL), pp. 136–143, 2019.
BRAM DSP LUT Processing Time
[16] D. Rongshi and T. Yongming, “Accelerator implementation of lenet-
Gonzalez [17] 44 153 4738 2268 us 5 convolution neural network based on fpga with hls,” in 2019 3rd
Cho [24] 95 143 32689 3500 us International Conference on Circuits, System and Simulation (ICCSS),
Shi [25] 54 204 25276 170 us pp. 64–67, IEEE, 2019.
This work 29 120 15951 70 us [17] E. González, W. D. Villamizar Luna, and C. A. Fajardo Ariza, “A har-
dware accelerator for the inference of a convolutional neural network,”
designing the layers and the whole CNN accelerator is finalized Ciencia E Ingenieria Neogranadina, vol. 30, no. 1, pp. 107–116, 2020.
in Vivado. The FPGA design is tested and verified on a Nexys [18] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, “A novel fpga
4 DDR evaluation board. The accelerator runs at 125 MHz and accelerator design for real-time and ultra-low power deep convolutional
overall throughput is 14K images/sec with consuming only neural networks compared with titan x gpu,” IEEE Access, vol. 8,
628 mW. Therefore, the proposed solution is 7x better than pp. 105455–105471, 2020.
current LeNet FPGA implementations in performance per watt [19] Y. Zhou and J. Jiang, “An fpga-based accelerator implementation
for deep convolutional neural networks,” in 2015 4th International
and it can be used in real-time embedded CNN applications Conference on Computer Science and Network Technology (ICCSNT),
effectively. vol. 1, pp. 829–832, IEEE, 2015.
[20] E. Mohsen, “Reducing system power and cost with artix-7 fpgas,”
Xilinx, Artix, vol. 7, pp. 1–12, 2012.
ACKNOWLEDGMENT
[21] L. Deng, “The mnist database of handwritten digit images for machine
This research was supported by The Scientific and Techno- learning research [best of the web],” IEEE Signal Processing Magazine,
logical Research Council of Turkey (TUBITAK). vol. 29, no. 6, pp. 141–142, 2012.
[22] “Nexys 4 ddr reference manual.” https://reference.digilentinc.com/
reference/programmable-logic/nexys-4-ddr/reference-manual. Acces-
R EFERENCES sed: 2021-03-10.
[23] P. Specification, “7 series fpgas packaging and pinout,” 2011.
[1] M.-j. Lee and Y.-g. Ha, “Autonomous driving control using end-to-end
deep learning,” in 2020 IEEE International Conference on Big Data [24] M. Cho and Y. Kim, “Implementation of data-optimized fpga-based
and Smart Computing (BigComp), pp. 470–473, IEEE, 2020. accelerator for convolutional neural network,” in 2020 International
Conference on Electronics, Information, and Communication (ICEIC),
[2] H. Sumida, F. Ren, S. Nishide, and X. Kang, “Environment recognition
pp. 1–2, IEEE, 2020.
using robot camera,” in 2020 5th IEEE International Conference on Big
Data Analytics (ICBDA), pp. 282–286, 2020. [25] Y. Shi, T. Gan, and S. Jiang, “Design of parallel acceleration method
of convolutional neural network based on fpga,” in 2020 IEEE 5th
[3] A. Shawahna, S. M. Sait, and A. El-Maleh, “Fpga-based accelerators of
International Conference on Cloud Computing and Big Data Analytics
deep learning networks for learning and classification: A review,” IEEE
(ICCCBDA), pp. 133–137, IEEE, 2020.
Access, vol. 7, pp. 7823–7859, 2019.
[4] “Comparing hardware for artificial intelligence: Fpgas vs.
gpus vs. asics.” http://lreese.dotsenkoweb.com/2019/03/30/
comparing-hardware-for-artificial-intelligence-fpgas-vs-gpus-vs-asics/
#. Accessed: 2021-03-25.
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
[6] “Introduction to tensorflow.” https://www.tensorflow.org/learn. Acces-
sed: 2021-03-10.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in neural informa-
tion processing systems, vol. 25, pp. 1097–1105, 2012.
[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
[10] Y. Wang, Y. Li, Y. Song, and X. Rong, “The influence of the activation
function in a convolution neural network model of facial expression
recognition,” Applied Sciences, vol. 10, no. 5, p. 1897, 2020.
[11] P. Wang, J. Song, Y. Peng, and G. Liu, “Binarized neural network
based on fpga to realize handwritten digit recognition,” in 2020 IEEE
International Conference on Information Technology,Big Data and
Artificial Intelligence (ICIBA), vol. 1, pp. 1204–1207, 2020.
[12] J. H. Kim, J. Lee, and J. H. Anderson, “Fpga architecture enhancements
for efficient bnn implementation,” in 2018 International Conference on
Field-Programmable Technology (FPT), pp. 214–221, 2018.
[13] T. Simons and D.-J. Lee, “A review of binarized neural networks,”
Electronics, vol. 8, no. 6, p. 661, 2019.
[14] Z. Liu, Y. Dou, J. Jiang, J. Xu, S. Li, Y. Zhou, and Y. Xu, “Throughput-
optimized fpga accelerator for deep convolutional neural networks,”
ACM Transactions on Reconfigurable Technology and Systems (TRETS),
vol. 10, no. 3, pp. 1–23, 2017.
Authorized licensed use limited to: UNIVERSITY OF TWENTE.. Downloaded on August 17,2021 at 12:49:20 UTC from IEEE Xplore. Restrictions apply.