FPGA Convolution Network Acceleration
FPGA Convolution Network Acceleration
ISSN (2210-142X)
Int. J. Com. Dig. Sys. 11, No.1 (Jan-2022)
https://dx.doi.org/10.12785/ijcds/110136
Received 22 Sep. 2020, Revised 24 Dec. 2021, Accepted 4 Jan. 2022, Published 20 Jan. 2022
Abstract: Convolutional neural network is now widely used in computer vision and deep learning applications. The most compute-
intensive layer in convolutional neural networks is the convolutional layer, which should be accelerated in hardware. This paper aims
to develop an efficient hardware-software co-design framework for machine learning applications on the PYNQ-Z2 board. To achieve
this goal, we develop hardware implementations of convolutional IP core and use them as Python overlays. Experiments show that
the hardware implementations of the convolutional IP core outperform their software implementations by factors of up to 9 times.
Furthermore, we make use of the designed convolutional IP core as hardware accelerator in the handwritten digit recognition application
with MNIST dataset. Thanks to the use of the hardware accelerator for the convolutional layers, the execution performance of the
convolutional neural network has been improved by a factor of 6.2 times.
• The designed 2D convolutional IP core is then used layer performs the classification task to produce a label
as a hardware accelerator to accelerate the inference indicating the correct category of the input image. In the
of a convolutional neural network for the handwritten example in Figure 1, three fully-connected layers are used.
digit recognition application with the MNIST dataset
on the PYNQ-Z2 FPGA board. Among all the layers of the network, the most compute-
intensive layer is the convolutional layer. In this work,
The remaining of this paper is organized as follows. the convolutional layer will be implemented on the pro-
Section 2 presents the background of the paper. Section 3 grammable logic of the FPGA so as to accelerate the
shows the architectual design and hardware implementation inference performance of the whole convolutional neural
of the proposed 2D convolution IP core targeting for Xilinx network. The description of the 2D convolution operation
PYNQ FPGA, followed by the evaluation of the designed is presented in the next subsection.
IP core on the Xilinx PYNQ-Z2 device. The application of
the designed IP core in convolutional neural network for B. The 2D convolution operation
the handwritten digit recognition is presented in detail in The convolution operation is, by far, known as the most
Section 4. In Section 5, we summarize our work and sketch commonly used and high compute-intensive operation in
out future research directions. both image processing [12], [13] and artificial intelligence
applications like convolutional neural networks [6], [11].
2. Background Given an M×N input image I and an S×S kernel W, the 2D
A. Convolutional Neural Networks convolution output image F of size M×N is computed by
The convolutional neural network is a commonly used Equation (1), as follows:
deep learning model for image processing and computer
vision. By combining feature extraction and classification, S −1 X
X S −1
CNN can offer very high accuracy recognition results. A F(m, n) = W[i, j] · I[m − i, n − j] (1)
typical architecture of CNN, the LeNet network adopted i=0 j=0
from [11], is shown in Figure 1. CNN consists of three main
types of layers: convolutional layers, subsampling layers Figure 2 shows an illustrated view of the 2D convolution
and fully-connected layers. computation, in which the image size is 5x5 pixels and the
The convolutional layer performs the two-dimensional kernel size is 3x3 elements. To compute the convolution
(2D) convolution between the input data and the kernel, then for each pixel, a sliding window of size SxS is utilized
an activation function is applied to the convoluted result to for extracting the right neighboring pixels necessary for the
produce a feature map. The kernel size is normally 3x3 convolution computation of the computed pixel at hand. In
or 5x5 elements. A ReLu (rectified linear unit) activation general, a 2D convolution with an S×S kernel requires S×S
function is often used in CNNs. There are often many multiply-accumulate (MAC) operations for each sample;
kernels used in each convolutional layer of the CNN to thereby, the number of MAC operations is M×N×S×S for
produce many feature maps in order to extract different the whole image.
types of features from the input data. C. The PYNQ-Z2 FPGA
The subsampling layer performs the reduction of the The PYNQ-Z2 board is a Xilinx ZYNQ SoC device
spatial size of feature maps from its previous convolutional based on a dual-core ARM Cortex-A9 processor integrated
layer. It is useful to extract the dominant features which with a FPGA fabric [14]. The functional block diagram
are rotational and positional invariant, thereby maintaining of the Xilinx ZYNQ SoC is shown in Figure 3. The
the effectiveness of the training process of the model. The dual-core ARM Cortex-A9 processor is referred to as the
subsampling layer is also useful to reduce the computa- Processing System (PS), and the FPGA fabric is referred to
tional complexity of the network. There are two types of as Programmable Logic (PL). The PS subsystem includes
subsampling operations: max-pooling and average-pooling, a number of dedicated peripherals (including memory con-
for which the max-pooling is preferable as it performs better trollers and other peripheral interfaces) and can be extended
than the average-pooling. The commonly used subsampling with additional customized hardware IP cores in the PL
operation is the 2x2 max-pooling. overlay.
There are multiple pairs of convolutional and sub- Overlays, or hardware libraries, are programmable
sampling layers concatenating in a convolutional neural FPGA designs that extend the user applications from the
network. For example, the network in Figure 1 has two PS subsytem of the ZYNQ device into the PL subsystem.
convolutional layers and two subsampling layers to perform Overlays can be used to accelerate a software application,
the feature extraction for the input data. or to customize the hardware platform for a particular
application. In addition, the most advantage feature of the
Once the feature extraction is done, the output of the PYNQ-Z2 board is that it provides a Python interface to
subsampling layer is flattened into a single vector of values allow overlays in the PL to be controlled from Python
and fed into the fully-connected layer. The fully-connected programs running in the PS, making FPGAs easier to use
https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 443
https://journals.uob.edu.bh
444 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2
https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 445
TABLE I. Synthesis result of 2D convolutional IP core TABLE II. Peak performance of 2D convolution IP core
D. Packaging convolution IP core as a Python overlay on HW execution time (s) 0.033 0.124 0.487
the PYNQ-Z2 board SW execution time (s) 0.260 1.061 4.364
Speed-up (times) 7.8 8.6 9.0
Once the 2D convolution core has successfully been
synthesized and verified, we export the design into an user
IP core using the Xilinx Vivado software tool [22] (the free
WebPack edition). For simplifying the software control, we Table II reports the peak performance of the designed IP
employ an Advanced eXtensible Interface (AXI) Lite to core. Since all the computing modules are fully pipelined,
carry out the data communication between the IP core and the IP core is expected to provide the computed result
the ZYNQ-7 host processing system. at every clock cycle. The execution cycles for the three
implementations are 1034, 4106 and 16394 clock cycles,
Figure 8 shows the block design view of the whole respectively, with the same overhead latency of 10 clock
system, in which the 2D convolution core (conv2D 0) is cycles each. We configure a working clock frequency of
connected with ZYNQ-7 processing system via the AXI 100MHz for the IP core. The corresponding execution
interconnect and is under the common reset control of the times measured in µs are then reported. The maximal
processor system reset block. frame rates at a clock frequency of 100MHz of the three
implementations are 96759, 24358 and 6100 frames per
We then run the bit stream generation and export the second for the image sizes of 32x32, 64x64 and 128x128,
system to a Python overlay that can be loaded and executed respectively.
on the PYNQ-Z2 development board. The exported overlay
consists of two main parts: the bitstream file (.bit) that Table III presents the performance comparison between
contains the hardware design and the project block diagram the hardware implementations of the 2D convolutional IP
Tcl file (.tcl). The Tcl is used by PYNQ to automati- core and their pure software implementations in Python
cally identify the ZYNQ system configuration, IP including running on the same PYNQ-Z2 board. Figure 9 illustrates
versions, interrupts, resets, and other control signals [14]. the performance speedups of the hardware implementations
As we investigate three hardware implementations of the over the software ones. The sustained performances of
convolution IP core, we then generate three different Python the hardware implementations are worse than their corre-
overlays corresponding to the three hardware implementa- sponding peak performances; the performance degradation
tions of IP cores with input image sizes of 32x32 pixels, is due to the data transfer between the IP core and the
64x64 pixels and 128x128 pixels. The kernel size for all external memory. However, the hardware implementations
three implementations is fixed at the size of 3-by-3. outperform their software counterparts by the factors of 7.8,
8.6 and 9.0 times, respectively.
E. Evaluation of the 2D convolution IP core
In this subsection, we present the performance eval- 4. Convolutional Neural Network Application for Hand-
uation of the designed IP core. We evaluate both the written Digit Recognition on PYNQ-Z2
theoretical peak performance and the practical sustained We make use of the designed convolution IP core
performance of the designed IP core. The peak performance in a practical application, that is: the handwritten digit
can be determined via simulations with the assumption that recognition with the MNIST dataset [11], [23]. In this
the data transfers between the IP core and external memory application, we train a convolutional neural network to carry
will cause no delay. On the other hand, the sustained out the classification problem on the SoC PYNQ-Z2 device.
performance provides a more realistic figure of merit for the A hardware-software co-design approach is exploited in
whole system since it takes into account the data transfers this work. Specifically, the forward inference of the trained
between the IP core and memory. convolutional neural network model is executed on the
https://journals.uob.edu.bh
446 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2
Figure 8. Block design view of the ZYNQ7 system with 2D convolution IP core via AXI
https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 447
https://journals.uob.edu.bh
448 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2
TABLE IV. Convolution neural network architecture for handwritten digit recognition with MNIST dataset
No Layer Number of feature maps Size Kernel size Stride Activation function
0 Input Image 1 32x32 - - -
1 Convolution 16 30x30 3x 3 1 ReLu
2 MaxPooling 16 15x15 2x2 2 -
3 Convolution 36 13x13 3x3 1 ReLu
4 MaxPooling 36 7x7 2x2 2 -
5 Flattening - 1764 - - -
6 Fully-Connected - 120 - - ReLu
7 Fully-Connected - 84 - - ReLu
8 Fully-Connected - 10 - - Softmax
TABLE V. Execution time [second] of convolution neural network implementations with MNIST dataset on PYNQ-Z2 FPGA
[3] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech [10] TUL Technology Unlimited, “TUL PYNQ-Z2 board,” accessed:
recognition using deep neural networks: A systematic review,” IEEE 2020-10-18. [Online]. Available: tul.com.tw/ProductsPYNQ-Z2.
access, vol. 7, pp. 19 143–19 165, 2019. html
[4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang learning applied to document recognition,” Proceedings of the IEEE,
et al., “End to end learning for self-driving cars,” arXiv preprint vol. 86, no. 11, pp. 2278–2324, 1998.
arXiv:1604.07316, 2016.
[12] B. Cope et al., “Implementation of 2d convolution on fpga, gpu and
[5] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scala- cpu,” Imperial College Report, pp. 2–5, 2006.
bility of gpu-based convolutional neural networks,” in 2010 18th
Euromicro Conference on Parallel, Distributed and Network-based [13] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo, “A high-
Processing. IEEE, 2010, pp. 317–324. performance fully reconfigurable fpga-based 2d convolution pro-
cessor,” Microprocessors and Microsystems, vol. 29, no. 8-9, pp.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- 381–391, 2005.
efficient reconfigurable accelerator for deep convolutional neural
networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. [14] “PYNQ Overlay Tutorials,” accessed: 2020-10-18. [Online].
127–138, 2016. Available: pynq.readthedocs.io/en/v2.5.1/pynq overlays.html
[7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti- [15] J. N. Coleman, E. Chester, C. I. Softley, and J. Kadlec, “Arithmetic
mizing fpga-based accelerator design for deep convolutional neural on the european logarithmic microprocessor,” IEEE Transactions on
networks,” in Proceedings of the 2015 ACM/SIGDA international Computers, vol. 49, no. 7, pp. 702–715, 2000.
symposium on field-programmable gate arrays, 2015, pp. 161–170.
[16] F. Albu, J. Kadlec, N. Coleman, and A. Fagan, “Pipelined implemen-
[8] M. Sit, R. Kazami, and H. Amano, “Fpga-based accelerator for tations of the a priori error-feedback lsl algorithm using logarithmic
losslessly quantized convolutional neural networks,” in 2017 Inter- arithmetic,” in 2002 IEEE International Conference on Acoustics,
national Conference on Field Programmable Technology (ICFPT). Speech, and Signal Processing, vol. 3. IEEE, 2002, pp. III–2681.
IEEE, 2017, pp. 295–298.
[17] T. V. Huynh, “Design space exploration for a single-fpga hand-
[9] “PYNQ Homepage,” accessed: 2020-10-18. [Online]. Available: written digit recognition system,” in 2014 IEEE Fifth International
pynq.io/home.html
https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 449
Conference on Communications and Electronics (ICCE). IEEE, 2020-10-18. [Online]. Available: colab.research.google.com/
2014, pp. 291–296. notebooks/intro.ipynb
[18] T. V. Huynh, “Evaluation of artificial neural network architectures [25] A. Baldominos, Y. Saez, and P. Isasi, “A survey of handwritten
for pattern recognition on fpga,” International Journal of Computing character recognition with mnist and emnist,” Applied Sciences,
and Digital Systems, vol. 6, no. 03, pp. 133–138, 2017. vol. 9, no. 15, p. 3169, 2019.
https://journals.uob.edu.bh