0% found this document useful (0 votes)

10 views9 pages

FPGA Convolution Network Acceleration

This paper presents a hardware-software co-design framework for accelerating convolutional neural networks (CNNs) on the PYNQ-Z2 board using a custom-designed 2D convolutional IP core. The hardware implementations significantly outperform software implementations, achieving up to 9 times faster performance, and improve execution performance in a handwritten digit recognition application by a factor of 6.2. The study emphasizes the advantages of using FPGA for CNN acceleration and provides insights into the design and implementation of the convolutional layer.

Uploaded by

Thinh Phat Huynh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views9 pages

FPGA Convolution Network Acceleration

Uploaded by

Thinh Phat Huynh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

International Journal of Computing and Digital Systems

ISSN (2210-142X)
Int. J. Com. Dig. Sys. 11, No.1 (Jan-2022)
https://dx.doi.org/10.12785/ijcds/110136

FPGA-based Acceleration for Convolutional

Neural Networks on PYNQ-Z2
Thang Viet Huynh
Faculty of Electronics and Telecommunication Engineering, The University of Danang - University of Science and Technology,
Danang city, Vietnam

Received 22 Sep. 2020, Revised 24 Dec. 2021, Accepted 4 Jan. 2022, Published 20 Jan. 2022

Abstract: Convolutional neural network is now widely used in computer vision and deep learning applications. The most compute-
intensive layer in convolutional neural networks is the convolutional layer, which should be accelerated in hardware. This paper aims
to develop an efficient hardware-software co-design framework for machine learning applications on the PYNQ-Z2 board. To achieve
this goal, we develop hardware implementations of convolutional IP core and use them as Python overlays. Experiments show that
the hardware implementations of the convolutional IP core outperform their software implementations by factors of up to 9 times.
Furthermore, we make use of the designed convolutional IP core as hardware accelerator in the handwritten digit recognition application
with MNIST dataset. Thanks to the use of the hardware accelerator for the convolutional layers, the execution performance of the
convolutional neural network has been improved by a factor of 6.2 times.

Keywords: FPGA, Convolutional Neural Network, Hardware Accelerator, Python, PYNQ

1. Introduction it easier to develop FPGA based deep learning applications,

In recent years, convolutional neural networks (CNN) in which designers can efficiently combine the benefits of
have become very popular in deep learning area. CNNs programmable logic and microprocessors using the Python
combine advantages of the convolutional filtering operations language and libraries.
and the traditional artificial neural networks in both feature
extraction and classification. It has been used in a variety The typical architecture of CNNs often comprises of
of applications requiring high accuracy, such as image many layers. The three common types of layers are con-
classification [1], [2], speech recognition [3], or self-driving volutional layer, subsampling layer and fully-connected
cars [4]. Since the applications of CNNs is now becoming layer. The most compute-intensive layer in CNN is the
more complex, the number of layers and computational convolutional layer. Therefore, to accelerate the inference
operations in the CNN architectures are rapidly increasing, of CNNs, the convolutional layer should be deployed in
thereby requiring a large amount of computing resources hardware.
and memory storage. In this work, we aim to develop an efficient hardware-
To overcome this problem, many researchers have pro- software co-design framework for machine learning appli-
posed various architectures and techniques to accelerate cations on the PYNQ board. To achieve this goal, we will
the inference of CNNs. With respect to hardware im- implement the convolutional layer in the programmable
plementations, three hardware platforms can be used as logic of the FPGA hardware while keeping the other layers
CNN accelerators: graphic processing units (GPU) [5], of the network executed in the software microprocessor
application-specific integrated circuits (ASICs) [6], and for flexibility. The Xilinx ZYNQ SoC based PYNQ-Z2
field programmable gate arrays (FPGAs) [7], [8]. Among device [10] is used in this work. The scientific contributions
these platforms, FPGA has revealed itself as the high- of this paper are the followings:
performance and low-cost embedded device very suitable
for the hardware prototyping of convolutional accelerators. • We design and implement in VHDL (Very High
Recently, new FPGA hardware solutions have been intro- Speed Integrated Circuits Hardware Description Lan-
duced, allowing for an efficient hardware and software co- guage) a 2D convolutional intellectual property (IP)
design framework for the deep learning applications. The core fully synthesizable for FPGA.
PYNQ [9] is an open-source project from Xilinx that makes

E-mail address: [email protected]

https://journals.uob.edu.bh
442 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

• The designed 2D convolutional IP core is then used layer performs the classification task to produce a label
as a hardware accelerator to accelerate the inference indicating the correct category of the input image. In the
of a convolutional neural network for the handwritten example in Figure 1, three fully-connected layers are used.
digit recognition application with the MNIST dataset
on the PYNQ-Z2 FPGA board. Among all the layers of the network, the most compute-
intensive layer is the convolutional layer. In this work,
The remaining of this paper is organized as follows. the convolutional layer will be implemented on the pro-
Section 2 presents the background of the paper. Section 3 grammable logic of the FPGA so as to accelerate the
shows the architectual design and hardware implementation inference performance of the whole convolutional neural
of the proposed 2D convolution IP core targeting for Xilinx network. The description of the 2D convolution operation
PYNQ FPGA, followed by the evaluation of the designed is presented in the next subsection.
IP core on the Xilinx PYNQ-Z2 device. The application of
the designed IP core in convolutional neural network for B. The 2D convolution operation
the handwritten digit recognition is presented in detail in The convolution operation is, by far, known as the most
Section 4. In Section 5, we summarize our work and sketch commonly used and high compute-intensive operation in
out future research directions. both image processing [12], [13] and artificial intelligence
applications like convolutional neural networks [6], [11].
2. Background Given an M×N input image I and an S×S kernel W, the 2D
A. Convolutional Neural Networks convolution output image F of size M×N is computed by
The convolutional neural network is a commonly used Equation (1), as follows:
deep learning model for image processing and computer
vision. By combining feature extraction and classification, S −1 X
X S −1
CNN can offer very high accuracy recognition results. A F(m, n) = W[i, j] · I[m − i, n − j] (1)
typical architecture of CNN, the LeNet network adopted i=0 j=0
from [11], is shown in Figure 1. CNN consists of three main
types of layers: convolutional layers, subsampling layers Figure 2 shows an illustrated view of the 2D convolution
and fully-connected layers. computation, in which the image size is 5x5 pixels and the
The convolutional layer performs the two-dimensional kernel size is 3x3 elements. To compute the convolution
(2D) convolution between the input data and the kernel, then for each pixel, a sliding window of size SxS is utilized
an activation function is applied to the convoluted result to for extracting the right neighboring pixels necessary for the
produce a feature map. The kernel size is normally 3x3 convolution computation of the computed pixel at hand. In
or 5x5 elements. A ReLu (rectified linear unit) activation general, a 2D convolution with an S×S kernel requires S×S
function is often used in CNNs. There are often many multiply-accumulate (MAC) operations for each sample;
kernels used in each convolutional layer of the CNN to thereby, the number of MAC operations is M×N×S×S for
produce many feature maps in order to extract different the whole image.
types of features from the input data. C. The PYNQ-Z2 FPGA
The subsampling layer performs the reduction of the The PYNQ-Z2 board is a Xilinx ZYNQ SoC device
spatial size of feature maps from its previous convolutional based on a dual-core ARM Cortex-A9 processor integrated
layer. It is useful to extract the dominant features which with a FPGA fabric [14]. The functional block diagram
are rotational and positional invariant, thereby maintaining of the Xilinx ZYNQ SoC is shown in Figure 3. The
the effectiveness of the training process of the model. The dual-core ARM Cortex-A9 processor is referred to as the
subsampling layer is also useful to reduce the computa- Processing System (PS), and the FPGA fabric is referred to
tional complexity of the network. There are two types of as Programmable Logic (PL). The PS subsystem includes
subsampling operations: max-pooling and average-pooling, a number of dedicated peripherals (including memory con-
for which the max-pooling is preferable as it performs better trollers and other peripheral interfaces) and can be extended
than the average-pooling. The commonly used subsampling with additional customized hardware IP cores in the PL
operation is the 2x2 max-pooling. overlay.

There are multiple pairs of convolutional and sub- Overlays, or hardware libraries, are programmable
sampling layers concatenating in a convolutional neural FPGA designs that extend the user applications from the
network. For example, the network in Figure 1 has two PS subsytem of the ZYNQ device into the PL subsystem.
convolutional layers and two subsampling layers to perform Overlays can be used to accelerate a software application,
the feature extraction for the input data. or to customize the hardware platform for a particular
application. In addition, the most advantage feature of the
Once the feature extraction is done, the output of the PYNQ-Z2 board is that it provides a Python interface to
subsampling layer is flattened into a single vector of values allow overlays in the PL to be controlled from Python
and fed into the fully-connected layer. The fully-connected programs running in the PS, making FPGAs easier to use

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 443

Figure 1. The Lenet convolutional neural network architecture

Figure 2. An illustration of the 2D convolution operation

Figure 4. The design flow for CNN on PYNQ-Z2

Figure 3. Functional block diagram of Xilinx ZYNQ SoC

can be used for the hardware realization on FPGA, as

with most of the computer vision and machine learning reported in [15], [16], [17], [18], [19], [20]. The logarithmic
applications. Figure 4 presents the design flow which is number system, in which a real number is represented as a
applied in this work for the implementation of CNN models fixed-point logarithm, was developed [15] and applied for
on the PYNQ-Z2 board. adaptive signal processing algorithms [16]. Floating-point
number format gives more accurate computed results [17],
D. Data representation for FPGA implementation [18], [19], while fixed-point number format brings much
One challenge for efficient hardware implementations better computation performance with respect to execution
of image processing and machine learning applications on time and hardware resources.
FPGA is to choose the suitable data format for real number Recently, it is shown that the low-precision fixed-point
operands and operations. Both fixed-point and floating-point number format is sufficient for the training and computation
number formats, as well as the logarithmic number system, of deep learning neural network models [20], with little to

https://journals.uob.edu.bh
444 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

Figure 5. Functional block diagram of the 2D convolution IP core

Figure 7. Block diagram of the muladdtree3x3 submodule

A. Input buffer and Output buffer modules

The input- and output-buffer modules are designed to
support the communication between the IP core and the
host processing system. The input buffer module is used to
store all the incoming image pixels before sending them to
the convolution module. Similarly, the output buffer module
Figure 6. Block diagram of the 2D convolution module
will store all the computed results and notify the host
processing system about the readiness of the convolution
computation via the result rdy signal.
no degradation in the classification accuracy. To perform B. Convolution module
the arithmetic computations in the designed 2D convolution
The 2D convolution module – the heart of the IP core –
core, we employ the VHDL fixed-point package designed
includes two submodules: a line buffer and a muladdtree3x3
by David Bishop [21] and use the fixed-point signed number
submodule, as shown in Figure 6.
format Q1.7 with 8-bit operands in this work.
3. Design and Implementation of 2D Convolution Core on The line buffer reads the input image line I(m, n) to
FPGA extract the nine neighboring image pixels (including the
convolved pixel at hand) necessary for the convolutional
In this section, we present the design and implemen- computation of each input image pixel. The nine out-
tation of 2D convolution IP core targeting for the Xilinx put pixels of the line buffer are denoted as xm yn , where
FPGAs. For the design and hardware implementation of the m, n = 0, 1, 2. After some overhead clock cycles, the line
targeted IP core, we utilize the Vivado Design Suite 2018.3 buffer continuously provides the nine image pixels for the
WebPack Edition from Xilinx and we implement the design convolutional computation at every clock cycle.
on the PYNQ-Z2 FPGA board.
The muladdtree3x3 submodule performs the convolution
In this work, we will investigate three different hardware between the kernel and the nine neighboring image pixels;
implementations of the 2D convolution core that correspond this computation is a dot-product between the weight vector
to the image sizes of 32x32, 64x64 and 128x128 pixels. The and the output vector of the line buffer. For implementing
kernel size is fixed at 3x3 elements. We then generate three this dot-product, we use a mul-add tree architecture to
different Python overlays corresponding to the three chosen increase the computing speed. Figure 7 gives details about
hardware implementations of the designed IP core. the architecture of the muladdtree3x3 submodule. The mul-
Figure 5 briefly presents the general block diagram of add tree architecture is fully pipelined using a series of
the designed IP core. The IP core consists of three modules: registers (DFF), thereby providing the convolved result for
an input buffer module, a 2D convolution module and an each input pixel at every clock cycle.
output buffer module. The three modules are fully pipelined C. Synthesis result of the 2D convolution core
to increase the execution performance and data throughput.
The IP core architecture is implemented in VHDL.
Table I presents the synthesis results of the 2D convolutional
IP core for three different implementations corresponding

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 445

TABLE I. Synthesis result of 2D convolutional IP core TABLE II. Peak performance of 2D convolution IP core

Resource Available Resource Utilization Specification Implementation

32x32 64x64 128x128 32x32 64x64 128x128
LUT 53200 1463 3517 11766 Number of pixels 1024 4096 16384
LUTRAM 17400 570 2218 8786 Execution cycles 1034 4106 16394
FF 106400 486 559 795 Overhead cycles 10 10 10
Execution time [µs] 10.34 41.06 163.94
LUT: Look-Up Table; FF: Flip-Flop
Frames per second 96759 24358 6100
Note: Measurements are carried out at a clock frequency of 100MHz.
to three input image sizes of 32x32 pixels, 64x64 pixels
and 128x128 pixels. As shown in Table I, all the three TABLE III. Performance evaluation of 2D convolution IP core
implementations are successfully fitted on the chosen Xilinx
PYNQ-Z2 FPGA board; the hardware resources increase Measurement Implementation
with the size of the input image. 32x32 64x64 128x128

D. Packaging convolution IP core as a Python overlay on HW execution time (s) 0.033 0.124 0.487
the PYNQ-Z2 board SW execution time (s) 0.260 1.061 4.364
Speed-up (times) 7.8 8.6 9.0
Once the 2D convolution core has successfully been
synthesized and verified, we export the design into an user
IP core using the Xilinx Vivado software tool [22] (the free
WebPack edition). For simplifying the software control, we Table II reports the peak performance of the designed IP
employ an Advanced eXtensible Interface (AXI) Lite to core. Since all the computing modules are fully pipelined,
carry out the data communication between the IP core and the IP core is expected to provide the computed result
the ZYNQ-7 host processing system. at every clock cycle. The execution cycles for the three
implementations are 1034, 4106 and 16394 clock cycles,
Figure 8 shows the block design view of the whole respectively, with the same overhead latency of 10 clock
system, in which the 2D convolution core (conv2D 0) is cycles each. We configure a working clock frequency of
connected with ZYNQ-7 processing system via the AXI 100MHz for the IP core. The corresponding execution
interconnect and is under the common reset control of the times measured in µs are then reported. The maximal
processor system reset block. frame rates at a clock frequency of 100MHz of the three
implementations are 96759, 24358 and 6100 frames per
We then run the bit stream generation and export the second for the image sizes of 32x32, 64x64 and 128x128,
system to a Python overlay that can be loaded and executed respectively.
on the PYNQ-Z2 development board. The exported overlay
consists of two main parts: the bitstream file (.bit) that Table III presents the performance comparison between
contains the hardware design and the project block diagram the hardware implementations of the 2D convolutional IP
Tcl file (.tcl). The Tcl is used by PYNQ to automati- core and their pure software implementations in Python
cally identify the ZYNQ system configuration, IP including running on the same PYNQ-Z2 board. Figure 9 illustrates
versions, interrupts, resets, and other control signals [14]. the performance speedups of the hardware implementations
As we investigate three hardware implementations of the over the software ones. The sustained performances of
convolution IP core, we then generate three different Python the hardware implementations are worse than their corre-
overlays corresponding to the three hardware implementa- sponding peak performances; the performance degradation
tions of IP cores with input image sizes of 32x32 pixels, is due to the data transfer between the IP core and the
64x64 pixels and 128x128 pixels. The kernel size for all external memory. However, the hardware implementations
three implementations is fixed at the size of 3-by-3. outperform their software counterparts by the factors of 7.8,
8.6 and 9.0 times, respectively.
E. Evaluation of the 2D convolution IP core
In this subsection, we present the performance eval- 4. Convolutional Neural Network Application for Hand-
uation of the designed IP core. We evaluate both the written Digit Recognition on PYNQ-Z2
theoretical peak performance and the practical sustained We make use of the designed convolution IP core
performance of the designed IP core. The peak performance in a practical application, that is: the handwritten digit
can be determined via simulations with the assumption that recognition with the MNIST dataset [11], [23]. In this
the data transfers between the IP core and external memory application, we train a convolutional neural network to carry
will cause no delay. On the other hand, the sustained out the classification problem on the SoC PYNQ-Z2 device.
performance provides a more realistic figure of merit for the A hardware-software co-design approach is exploited in
whole system since it takes into account the data transfers this work. Specifically, the forward inference of the trained
between the IP core and memory. convolutional neural network model is executed on the

https://journals.uob.edu.bh
446 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

Figure 8. Block design view of the ZYNQ7 system with 2D convolution IP core via AXI

chosen FPGA device as follows:

• The convolutional operation will be executed on the

Programmable Logic of the PYNQ-Z2 device; the
designed 2D convolutional IP core is loaded as a
Python overlay and is executed as a Python function.
• The other operations of the network (i.e., ReLu
activation function, max-pooling, flattening, fully-
connected layers) will be executed on the Processing
System of the PYNQ-Z2 device based on the ARM
Cortex-A9 processor.

The MNIST dataset is used for training and testing the

network. The dataset has a training set of 60.000 samples
and a test set of 10.000 samples. There are 10 different
handwritten digits ranging from 0 to 9 in the dataset. Each
digit is normalized and centered in a gray-level image with
size 28x28. For the sake of convenience, the samples are
extended to 32x32 with background pixels.
A. CNN configuration and model training
The chosen architecture of the CNN for handwritten
digit recognition is shown in Table IV. There are eight
layers in the network. The first layer is a convolutional
Figure 9. Performance comparison among various implementations layer having 16 kernel maps, followed by a max-pooling
of the 2D convolution IP core
layer for dimensional reduction. The third and fourth layers
are another pair of convolutional and max-pooling layers
having 36 kernel maps. All convolutional layers use the

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 447

plementation with the 2D convolution IP core executed on

the Programmable Logic of the FPGA device while others
operations executed by the ARM Cortex-A9 processor. For
a better understanding of the performance of the entire
network, the execution time of each layer is also measured.
Table V reports the total execution time of two im-
plementations of the network. Thanks to the use of the
convolution IP core on FPGA fabric, the hardware-software
co-design implementation outperforms the pure software
implementation by a factor of 6.2 times.
Table V also reports the execution times of all layers,
which allow for an efficient performance profiling. The
convolution layers account for most of the computational
loads of the network: 98% and 95% of the total execution
Figure 10. Convolutional neural network model training after 20 times of the two implementations, respectively. Obviously,
epochs, with a test accuracy of 99.04% the performance of the whole network can be further
improved when the execution of the convolutional layers
are speeduped.
kernel size of 3x3 and the ReLu activation function, while
all max-pooling layers use the window size of 2x2. 5. Conclusions
In this paper, we have presented the design, imple-
After the convolution and max-pooling operations, all mentation and evaluation of a 2D convolutional IP core
the extracted features are flattened into a vector by the synthesizable for FPGAs. We have developed and gener-
flattening layer with a resulted feature vector of 1764 ele- ated hardware implementations of the IP core as Python
ments. These feature vector then become the inputs of three overlays, and carried out the performance evaluation on
consecutive fully-connected layers having 120, 84 and 10 the PYNQ-Z2 device. It has been shown that the hardware
neurons, respectively for performing the classification task. implementations of the IP core outperform their software
The final fully-connected layer has the softmax activation implementations. Furthermore, we make use of the designed
function while the other fully-connected layers use the ReLu convolutional IP core as hardware accelerator in the hand-
activation functions. written digit recognition application, in which a hardware-
software co-design framework is deployed. Thanks to the
For training the convolutional neural network model, we use of hardware accelerator for the convolutional layers, the
exploit the Google Colab [24]. The chosen convolutional execution performance of the convolutional neural network
neural network is described with Python and the model has been improved by a factor of 6.2 times. We believe that
training is executed in Colab framework. Figure 10 illus- the framework presented in this work will help to accelerate
trates the result of the training process. After 20 epochs, the FPGA-based hardware implementations of the image
the convolutional neural network gives the accuracies of processing and deep learning applications.
99.77% for training set and 99.04% for test set. These ac-
curacies are comparable with the accuracies of other related To further increase the performance of the convolu-
models reported in [25] for handwritten digit recognition tional neural network implementations, we will improve
applications. In [19], a deep neural network for handwritten the communication interface between the IP core and the
digit recognition with MNIST dataset was used, resulting ARM-based processing system by employing an AXI-
in an accuracy of 97.14% for the test set. Compared with Stream interface with direct memory access (DMA) control
previous work in [19], this work offers a much higher mechanism. This will be our future work.
accuracy. Acknowledgment
Once the training is done, all the parameters of the This research is funded by Funds for Science and
trained network are saved to use in the execution of the Technology Development of the University of Danang under
forward inference of the network on PYNQ-Z2 device. project number B2019-DN02-61.
References
B. Performance evaluation on PYNQ-Z2
[1] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and
The performance evaluation of the handwritten digit S. Yan, “Hcp: A flexible cnn framework for multi-label image
recognition application is carried out on the PYNQ-Z2 de- classification,” IEEE transactions on pattern analysis and machine
vice. Two implementations are performed on the device: i) intelligence, vol. 38, no. 9, pp. 1901–1907, 2015.
a pure software implementation that runs on the Processing
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
System of the device fully based on the ARM Cortex- tion with deep convolutional neural networks,” Advances in neural
A9 processor; and ii) a hardware-software co-design im- information processing systems, vol. 25, pp. 1097–1105, 2012.

https://journals.uob.edu.bh
448 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

TABLE IV. Convolution neural network architecture for handwritten digit recognition with MNIST dataset

No Layer Number of feature maps Size Kernel size Stride Activation function
0 Input Image 1 32x32 - - -
1 Convolution 16 30x30 3x 3 1 ReLu
2 MaxPooling 16 15x15 2x2 2 -
3 Convolution 36 13x13 3x3 1 ReLu
4 MaxPooling 36 7x7 2x2 2 -
5 Flattening - 1764 - - -
6 Fully-Connected - 120 - - ReLu
7 Fully-Connected - 84 - - ReLu
8 Fully-Connected - 10 - - Softmax

TABLE V. Execution time [second] of convolution neural network implementations with MNIST dataset on PYNQ-Z2 FPGA

Layer Pure software implementation Implementation with 2D Convolution

IP core on hardware
(s) (%) (s) (%)
1st Convolution 4.017 69.30 0.634 68.22
1st ReLu + MaxPooling 0.005 0.08 0.002 0.24
2nd Convolution 1.735 29.94 0.253 27.28
2nd ReLu + MaxPooling 0.004 0.06 0.004 0.38
Flattening 0.001 0.02 0.001 0.11
1st Fully-Connected 0.028 0.47 0.028 2.96
2nd Fully-Connected 0.005 0.08 0.005 0.51
3rd Fully-Connected 0.003 0.05 0.003 0.31
Total execution time 5.797 100.00 0.929 100.00
Speedup 1X 6.2X

[3] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech [10] TUL Technology Unlimited, “TUL PYNQ-Z2 board,” accessed:
recognition using deep neural networks: A systematic review,” IEEE 2020-10-18. [Online]. Available: tul.com.tw/ProductsPYNQ-Z2.
access, vol. 7, pp. 19 143–19 165, 2019. html

[4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang learning applied to document recognition,” Proceedings of the IEEE,
et al., “End to end learning for self-driving cars,” arXiv preprint vol. 86, no. 11, pp. 2278–2324, 1998.
arXiv:1604.07316, 2016.
[12] B. Cope et al., “Implementation of 2d convolution on fpga, gpu and
[5] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scala- cpu,” Imperial College Report, pp. 2–5, 2006.
bility of gpu-based convolutional neural networks,” in 2010 18th
Euromicro Conference on Parallel, Distributed and Network-based [13] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo, “A high-
Processing. IEEE, 2010, pp. 317–324. performance fully reconfigurable fpga-based 2d convolution pro-
cessor,” Microprocessors and Microsystems, vol. 29, no. 8-9, pp.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- 381–391, 2005.
efficient reconfigurable accelerator for deep convolutional neural
networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. [14] “PYNQ Overlay Tutorials,” accessed: 2020-10-18. [Online].
127–138, 2016. Available: pynq.readthedocs.io/en/v2.5.1/pynq overlays.html

[7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti- [15] J. N. Coleman, E. Chester, C. I. Softley, and J. Kadlec, “Arithmetic
mizing fpga-based accelerator design for deep convolutional neural on the european logarithmic microprocessor,” IEEE Transactions on
networks,” in Proceedings of the 2015 ACM/SIGDA international Computers, vol. 49, no. 7, pp. 702–715, 2000.
symposium on field-programmable gate arrays, 2015, pp. 161–170.
[16] F. Albu, J. Kadlec, N. Coleman, and A. Fagan, “Pipelined implemen-
[8] M. Sit, R. Kazami, and H. Amano, “Fpga-based accelerator for tations of the a priori error-feedback lsl algorithm using logarithmic
losslessly quantized convolutional neural networks,” in 2017 Inter- arithmetic,” in 2002 IEEE International Conference on Acoustics,
national Conference on Field Programmable Technology (ICFPT). Speech, and Signal Processing, vol. 3. IEEE, 2002, pp. III–2681.
IEEE, 2017, pp. 295–298.
[17] T. V. Huynh, “Design space exploration for a single-fpga hand-
[9] “PYNQ Homepage,” accessed: 2020-10-18. [Online]. Available: written digit recognition system,” in 2014 IEEE Fifth International
pynq.io/home.html

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 449

Conference on Communications and Electronics (ICCE). IEEE, 2020-10-18. [Online]. Available: colab.research.google.com/
2014, pp. 291–296. notebooks/intro.ipynb

[18] T. V. Huynh, “Evaluation of artificial neural network architectures [25] A. Baldominos, Y. Saez, and P. Isasi, “A survey of handwritten
for pattern recognition on fpga,” International Journal of Computing character recognition with mnist and emnist,” Applied Sciences,
and Digital Systems, vol. 6, no. 03, pp. 133–138, 2017. vol. 9, no. 15, p. 3169, 2019.

[19] T. V. Huynh, “Deep neural network accelerator based on fpga,” in

2017 4th NAFOSTED Conference on Information and Computer
Science. IEEE, 2017, pp. 254–257.

[20] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,

Thang Viet Huynh Thang Viet Huynh re-
“Deep learning with limited numerical precision,” in International
conference on machine learning. PMLR, 2015, pp. 1737–1746. ceived his PhD degree in Electrical and Elec-
tronic Engineering from Graz University of
[21] Bishop, David W., “VHDL-2008 support library,” 2011, accessed: Technology (TUGraz), Austria in 2012. He
2020-10-18. [Online]. Available: github.com/FPHDL/fphdl is currently working as a senior lecturer at
the Faculty of Electronics and Telecommu-
[22] Xilinx, “Vivado Design Suite Evaluation and WebPACK,” accessed: nication Engineering, Danang University of
2020-10-18. [Online]. Available: xilinx.com/products/design-tools/ Science and Technology (DUT), The Uni-
vivado/vivado-webpack.html
versity of Danang (UDN), in Danang City,
[23] “MNIST database,” accessed: 2020-10-18. [Online]. Available: Vietnam. His research interests include em-
yann.lecun.com/exdb/mnist/ bedded reconfigurable computing (FPGA), hardware implemen-
tations of deep learning models, edge computing, and respective
[24] Google, “Google Colaboratory (Colab) Introduction,” accessed: applications.

https://journals.uob.edu.bh

Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
A Neural Network Approach For Roughness-Dependent Update of Tyre Friction
No ratings yet
A Neural Network Approach For Roughness-Dependent Update of Tyre Friction
18 pages
rongshi2019
No ratings yet
rongshi2019
4 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
Pynq Classification
No ratings yet
Pynq Classification
65 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Irmak2021energy_efficient
No ratings yet
Irmak2021energy_efficient
4 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
Design of a Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of a Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block on FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block on FPGAs
9 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
7-Research on FPGA High-Performance Implementation Method of CNN
No ratings yet
7-Research on FPGA High-Performance Implementation Method of CNN
5 pages
[email protected]
No ratings yet
[email protected]
4 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Electronics 13 01564 v2
No ratings yet
Electronics 13 01564 v2
18 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
1 s2.0 S1877050922005701 Main
No ratings yet
1 s2.0 S1877050922005701 Main
6 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
A Scalable FPGA Based Accelerator for Tiny-YOLO-V2
No ratings yet
A Scalable FPGA Based Accelerator for Tiny-YOLO-V2
9 pages
tesi
No ratings yet
tesi
73 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
cao2019
No ratings yet
cao2019
5 pages
An Efficient Reconfigurable Hardware Accelerator for Cnn
No ratings yet
An Efficient Reconfigurable Hardware Accelerator for Cnn
5 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
FPGA-Based Hardware Acceleration Using PYNQ-Z2
No ratings yet
FPGA-Based Hardware Acceleration Using PYNQ-Z2
4 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
applsci-15-00688-v3
No ratings yet
applsci-15-00688-v3
21 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
A_Fast_Accurate_and_Comprehensive_PPA_Estimation_of_Convolutional_Hardware_Accelerators
No ratings yet
A_Fast_Accurate_and_Comprehensive_PPA_Estimation_of_Convolutional_Hardware_Accelerators
14 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
zhou2020
No ratings yet
zhou2020
6 pages
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
No ratings yet
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
7 pages
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
No ratings yet
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
11 pages
Guddu jha_organized
No ratings yet
Guddu jha_organized
3 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
10.1109 fpl53798.2021.00061
No ratings yet
10.1109 fpl53798.2021.00061
6 pages
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
No ratings yet
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
7 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Msword&Rendition 1
No ratings yet
Msword&Rendition 1
21 pages
CCS355 SET2 Anna University Lab Question Set Neural Network
No ratings yet
CCS355 SET2 Anna University Lab Question Set Neural Network
2 pages
Traffic Light Counter Detection Comparison Using You Only Look Oncev3 and You Only Look Oncev5 For Version 3 and 5
No ratings yet
Traffic Light Counter Detection Comparison Using You Only Look Oncev3 and You Only Look Oncev5 For Version 3 and 5
8 pages
基于Inception与Residual组合网络的农作物病虫害识别
No ratings yet
基于Inception与Residual组合网络的农作物病虫害识别
6 pages
A Deformable CNN-DLSTM Based Transfer Learning Method For Fault Diagnosis of Rolling Bearing Under Multiple Working Conditions
No ratings yet
A Deformable CNN-DLSTM Based Transfer Learning Method For Fault Diagnosis of Rolling Bearing Under Multiple Working Conditions
17 pages
The_Role_of_Artificial_Intelligence_in_Future_Reha
No ratings yet
The_Role_of_Artificial_Intelligence_in_Future_Reha
24 pages
Transfer Learning
No ratings yet
Transfer Learning
7 pages
cnn6
No ratings yet
cnn6
10 pages
Research_on_music_genre_recognition_method_based_o
No ratings yet
Research_on_music_genre_recognition_method_based_o
17 pages
DLT Record Final
No ratings yet
DLT Record Final
120 pages
Human Emotion Detection Using Deep Learning
No ratings yet
Human Emotion Detection Using Deep Learning
7 pages
Mathematical Model Reference
No ratings yet
Mathematical Model Reference
10 pages
Automatic Number Plate Detection System and Automating The Fine Generation Using YOLO-v3
No ratings yet
Automatic Number Plate Detection System and Automating The Fine Generation Using YOLO-v3
8 pages
Sun Et Al - 2021 - MEAN-SSD
No ratings yet
Sun Et Al - 2021 - MEAN-SSD
11 pages
CS601 Machine Learning Unit 3
No ratings yet
CS601 Machine Learning Unit 3
47 pages
Siamese CNN For Job-Candidate Matching: Thomas Belhalfaoui
No ratings yet
Siamese CNN For Job-Candidate Matching: Thomas Belhalfaoui
21 pages
缺陷检测技术的发展与应用研究综述李少波
No ratings yet
缺陷检测技术的发展与应用研究综述李少波
18 pages
Machine Learning & Deep Learning
No ratings yet
Machine Learning & Deep Learning
23 pages
AI Facilitators Handbook X (1)
No ratings yet
AI Facilitators Handbook X (1)
42 pages
Spam Identification On Facebook, Twitter and Email Using Machine Learning
No ratings yet
Spam Identification On Facebook, Twitter and Email Using Machine Learning
9 pages
AI Presentation
No ratings yet
AI Presentation
66 pages
Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey
No ratings yet
Multiple Object Tracking For Video Analysis and Surveillance A Literature Survey
10 pages
Weapon Detection - 028
No ratings yet
Weapon Detection - 028
24 pages
Lung Cancer Prediction Using Electronic Claims Records A Transformer-Based Approach
No ratings yet
Lung Cancer Prediction Using Electronic Claims Records A Transformer-Based Approach
12 pages
Thesis Report On Brain Tumor Detection
100% (3)
Thesis Report On Brain Tumor Detection
8 pages
Sustainability 11 03570
No ratings yet
Sustainability 11 03570
16 pages
An Effective Disease Prediction System Using Incremental Feature Selection and Temporal Convolutional Neural Network 2020
No ratings yet
An Effective Disease Prediction System Using Incremental Feature Selection and Temporal Convolutional Neural Network 2020
14 pages
Autism Detecting Model Using Image
No ratings yet
Autism Detecting Model Using Image
21 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
28 pages

FPGA Convolution Network Acceleration

Uploaded by

FPGA Convolution Network Acceleration

Uploaded by

International Journal of Computing and Digital Systems

FPGA-based Acceleration for Convolutional

Keywords: FPGA, Convolutional Neural Network, Hardware Accelerator, Python, PYNQ

1. Introduction it easier to develop FPGA based deep learning applications,

E-mail address: [email protected]

Figure 1. The Lenet convolutional neural network architecture

Figure 2. An illustration of the 2D convolution operation

Figure 4. The design flow for CNN on PYNQ-Z2

can be used for the hardware realization on FPGA, as

Figure 5. Functional block diagram of the 2D convolution IP core

Figure 7. Block diagram of the muladdtree3x3 submodule

A. Input buffer and Output buffer modules

Resource Available Resource Utilization Specification Implementation

chosen FPGA device as follows:

• The convolutional operation will be executed on the

The MNIST dataset is used for training and testing the

plementation with the 2D convolution IP core executed on

Layer Pure software implementation Implementation with 2D Convolution

[19] T. V. Huynh, “Deep neural network accelerator based on fpga,” in

[20] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,

You might also like