0% found this document useful (0 votes)
10 views9 pages

FPGA Convolution Network Acceleration

This paper presents a hardware-software co-design framework for accelerating convolutional neural networks (CNNs) on the PYNQ-Z2 board using a custom-designed 2D convolutional IP core. The hardware implementations significantly outperform software implementations, achieving up to 9 times faster performance, and improve execution performance in a handwritten digit recognition application by a factor of 6.2. The study emphasizes the advantages of using FPGA for CNN acceleration and provides insights into the design and implementation of the convolutional layer.

Uploaded by

Thinh Phat Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

FPGA Convolution Network Acceleration

This paper presents a hardware-software co-design framework for accelerating convolutional neural networks (CNNs) on the PYNQ-Z2 board using a custom-designed 2D convolutional IP core. The hardware implementations significantly outperform software implementations, achieving up to 9 times faster performance, and improve execution performance in a handwritten digit recognition application by a factor of 6.2. The study emphasizes the advantages of using FPGA for CNN acceleration and provides insights into the design and implementation of the convolutional layer.

Uploaded by

Thinh Phat Huynh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of Computing and Digital Systems

ISSN (2210-142X)
Int. J. Com. Dig. Sys. 11, No.1 (Jan-2022)
https://dx.doi.org/10.12785/ijcds/110136

FPGA-based Acceleration for Convolutional


Neural Networks on PYNQ-Z2
Thang Viet Huynh
Faculty of Electronics and Telecommunication Engineering, The University of Danang - University of Science and Technology,
Danang city, Vietnam

Received 22 Sep. 2020, Revised 24 Dec. 2021, Accepted 4 Jan. 2022, Published 20 Jan. 2022

Abstract: Convolutional neural network is now widely used in computer vision and deep learning applications. The most compute-
intensive layer in convolutional neural networks is the convolutional layer, which should be accelerated in hardware. This paper aims
to develop an efficient hardware-software co-design framework for machine learning applications on the PYNQ-Z2 board. To achieve
this goal, we develop hardware implementations of convolutional IP core and use them as Python overlays. Experiments show that
the hardware implementations of the convolutional IP core outperform their software implementations by factors of up to 9 times.
Furthermore, we make use of the designed convolutional IP core as hardware accelerator in the handwritten digit recognition application
with MNIST dataset. Thanks to the use of the hardware accelerator for the convolutional layers, the execution performance of the
convolutional neural network has been improved by a factor of 6.2 times.

Keywords: FPGA, Convolutional Neural Network, Hardware Accelerator, Python, PYNQ

1. Introduction it easier to develop FPGA based deep learning applications,


In recent years, convolutional neural networks (CNN) in which designers can efficiently combine the benefits of
have become very popular in deep learning area. CNNs programmable logic and microprocessors using the Python
combine advantages of the convolutional filtering operations language and libraries.
and the traditional artificial neural networks in both feature
extraction and classification. It has been used in a variety The typical architecture of CNNs often comprises of
of applications requiring high accuracy, such as image many layers. The three common types of layers are con-
classification [1], [2], speech recognition [3], or self-driving volutional layer, subsampling layer and fully-connected
cars [4]. Since the applications of CNNs is now becoming layer. The most compute-intensive layer in CNN is the
more complex, the number of layers and computational convolutional layer. Therefore, to accelerate the inference
operations in the CNN architectures are rapidly increasing, of CNNs, the convolutional layer should be deployed in
thereby requiring a large amount of computing resources hardware.
and memory storage. In this work, we aim to develop an efficient hardware-
To overcome this problem, many researchers have pro- software co-design framework for machine learning appli-
posed various architectures and techniques to accelerate cations on the PYNQ board. To achieve this goal, we will
the inference of CNNs. With respect to hardware im- implement the convolutional layer in the programmable
plementations, three hardware platforms can be used as logic of the FPGA hardware while keeping the other layers
CNN accelerators: graphic processing units (GPU) [5], of the network executed in the software microprocessor
application-specific integrated circuits (ASICs) [6], and for flexibility. The Xilinx ZYNQ SoC based PYNQ-Z2
field programmable gate arrays (FPGAs) [7], [8]. Among device [10] is used in this work. The scientific contributions
these platforms, FPGA has revealed itself as the high- of this paper are the followings:
performance and low-cost embedded device very suitable
for the hardware prototyping of convolutional accelerators. • We design and implement in VHDL (Very High
Recently, new FPGA hardware solutions have been intro- Speed Integrated Circuits Hardware Description Lan-
duced, allowing for an efficient hardware and software co- guage) a 2D convolutional intellectual property (IP)
design framework for the deep learning applications. The core fully synthesizable for FPGA.
PYNQ [9] is an open-source project from Xilinx that makes

E-mail address: [email protected]


https://journals.uob.edu.bh
442 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

• The designed 2D convolutional IP core is then used layer performs the classification task to produce a label
as a hardware accelerator to accelerate the inference indicating the correct category of the input image. In the
of a convolutional neural network for the handwritten example in Figure 1, three fully-connected layers are used.
digit recognition application with the MNIST dataset
on the PYNQ-Z2 FPGA board. Among all the layers of the network, the most compute-
intensive layer is the convolutional layer. In this work,
The remaining of this paper is organized as follows. the convolutional layer will be implemented on the pro-
Section 2 presents the background of the paper. Section 3 grammable logic of the FPGA so as to accelerate the
shows the architectual design and hardware implementation inference performance of the whole convolutional neural
of the proposed 2D convolution IP core targeting for Xilinx network. The description of the 2D convolution operation
PYNQ FPGA, followed by the evaluation of the designed is presented in the next subsection.
IP core on the Xilinx PYNQ-Z2 device. The application of
the designed IP core in convolutional neural network for B. The 2D convolution operation
the handwritten digit recognition is presented in detail in The convolution operation is, by far, known as the most
Section 4. In Section 5, we summarize our work and sketch commonly used and high compute-intensive operation in
out future research directions. both image processing [12], [13] and artificial intelligence
applications like convolutional neural networks [6], [11].
2. Background Given an M×N input image I and an S×S kernel W, the 2D
A. Convolutional Neural Networks convolution output image F of size M×N is computed by
The convolutional neural network is a commonly used Equation (1), as follows:
deep learning model for image processing and computer
vision. By combining feature extraction and classification, S −1 X
X S −1
CNN can offer very high accuracy recognition results. A F(m, n) = W[i, j] · I[m − i, n − j] (1)
typical architecture of CNN, the LeNet network adopted i=0 j=0
from [11], is shown in Figure 1. CNN consists of three main
types of layers: convolutional layers, subsampling layers Figure 2 shows an illustrated view of the 2D convolution
and fully-connected layers. computation, in which the image size is 5x5 pixels and the
The convolutional layer performs the two-dimensional kernel size is 3x3 elements. To compute the convolution
(2D) convolution between the input data and the kernel, then for each pixel, a sliding window of size SxS is utilized
an activation function is applied to the convoluted result to for extracting the right neighboring pixels necessary for the
produce a feature map. The kernel size is normally 3x3 convolution computation of the computed pixel at hand. In
or 5x5 elements. A ReLu (rectified linear unit) activation general, a 2D convolution with an S×S kernel requires S×S
function is often used in CNNs. There are often many multiply-accumulate (MAC) operations for each sample;
kernels used in each convolutional layer of the CNN to thereby, the number of MAC operations is M×N×S×S for
produce many feature maps in order to extract different the whole image.
types of features from the input data. C. The PYNQ-Z2 FPGA
The subsampling layer performs the reduction of the The PYNQ-Z2 board is a Xilinx ZYNQ SoC device
spatial size of feature maps from its previous convolutional based on a dual-core ARM Cortex-A9 processor integrated
layer. It is useful to extract the dominant features which with a FPGA fabric [14]. The functional block diagram
are rotational and positional invariant, thereby maintaining of the Xilinx ZYNQ SoC is shown in Figure 3. The
the effectiveness of the training process of the model. The dual-core ARM Cortex-A9 processor is referred to as the
subsampling layer is also useful to reduce the computa- Processing System (PS), and the FPGA fabric is referred to
tional complexity of the network. There are two types of as Programmable Logic (PL). The PS subsystem includes
subsampling operations: max-pooling and average-pooling, a number of dedicated peripherals (including memory con-
for which the max-pooling is preferable as it performs better trollers and other peripheral interfaces) and can be extended
than the average-pooling. The commonly used subsampling with additional customized hardware IP cores in the PL
operation is the 2x2 max-pooling. overlay.

There are multiple pairs of convolutional and sub- Overlays, or hardware libraries, are programmable
sampling layers concatenating in a convolutional neural FPGA designs that extend the user applications from the
network. For example, the network in Figure 1 has two PS subsytem of the ZYNQ device into the PL subsystem.
convolutional layers and two subsampling layers to perform Overlays can be used to accelerate a software application,
the feature extraction for the input data. or to customize the hardware platform for a particular
application. In addition, the most advantage feature of the
Once the feature extraction is done, the output of the PYNQ-Z2 board is that it provides a Python interface to
subsampling layer is flattened into a single vector of values allow overlays in the PL to be controlled from Python
and fed into the fully-connected layer. The fully-connected programs running in the PS, making FPGAs easier to use

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 443

Figure 1. The Lenet convolutional neural network architecture

Figure 2. An illustration of the 2D convolution operation

Figure 4. The design flow for CNN on PYNQ-Z2


Figure 3. Functional block diagram of Xilinx ZYNQ SoC

can be used for the hardware realization on FPGA, as


with most of the computer vision and machine learning reported in [15], [16], [17], [18], [19], [20]. The logarithmic
applications. Figure 4 presents the design flow which is number system, in which a real number is represented as a
applied in this work for the implementation of CNN models fixed-point logarithm, was developed [15] and applied for
on the PYNQ-Z2 board. adaptive signal processing algorithms [16]. Floating-point
number format gives more accurate computed results [17],
D. Data representation for FPGA implementation [18], [19], while fixed-point number format brings much
One challenge for efficient hardware implementations better computation performance with respect to execution
of image processing and machine learning applications on time and hardware resources.
FPGA is to choose the suitable data format for real number Recently, it is shown that the low-precision fixed-point
operands and operations. Both fixed-point and floating-point number format is sufficient for the training and computation
number formats, as well as the logarithmic number system, of deep learning neural network models [20], with little to

https://journals.uob.edu.bh
444 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

Figure 5. Functional block diagram of the 2D convolution IP core

Figure 7. Block diagram of the muladdtree3x3 submodule

A. Input buffer and Output buffer modules


The input- and output-buffer modules are designed to
support the communication between the IP core and the
host processing system. The input buffer module is used to
store all the incoming image pixels before sending them to
the convolution module. Similarly, the output buffer module
Figure 6. Block diagram of the 2D convolution module
will store all the computed results and notify the host
processing system about the readiness of the convolution
computation via the result rdy signal.
no degradation in the classification accuracy. To perform B. Convolution module
the arithmetic computations in the designed 2D convolution
The 2D convolution module – the heart of the IP core –
core, we employ the VHDL fixed-point package designed
includes two submodules: a line buffer and a muladdtree3x3
by David Bishop [21] and use the fixed-point signed number
submodule, as shown in Figure 6.
format Q1.7 with 8-bit operands in this work.
3. Design and Implementation of 2D Convolution Core on The line buffer reads the input image line I(m, n) to
FPGA extract the nine neighboring image pixels (including the
convolved pixel at hand) necessary for the convolutional
In this section, we present the design and implemen- computation of each input image pixel. The nine out-
tation of 2D convolution IP core targeting for the Xilinx put pixels of the line buffer are denoted as xm yn , where
FPGAs. For the design and hardware implementation of the m, n = 0, 1, 2. After some overhead clock cycles, the line
targeted IP core, we utilize the Vivado Design Suite 2018.3 buffer continuously provides the nine image pixels for the
WebPack Edition from Xilinx and we implement the design convolutional computation at every clock cycle.
on the PYNQ-Z2 FPGA board.
The muladdtree3x3 submodule performs the convolution
In this work, we will investigate three different hardware between the kernel and the nine neighboring image pixels;
implementations of the 2D convolution core that correspond this computation is a dot-product between the weight vector
to the image sizes of 32x32, 64x64 and 128x128 pixels. The and the output vector of the line buffer. For implementing
kernel size is fixed at 3x3 elements. We then generate three this dot-product, we use a mul-add tree architecture to
different Python overlays corresponding to the three chosen increase the computing speed. Figure 7 gives details about
hardware implementations of the designed IP core. the architecture of the muladdtree3x3 submodule. The mul-
Figure 5 briefly presents the general block diagram of add tree architecture is fully pipelined using a series of
the designed IP core. The IP core consists of three modules: registers (DFF), thereby providing the convolved result for
an input buffer module, a 2D convolution module and an each input pixel at every clock cycle.
output buffer module. The three modules are fully pipelined C. Synthesis result of the 2D convolution core
to increase the execution performance and data throughput.
The IP core architecture is implemented in VHDL.
Table I presents the synthesis results of the 2D convolutional
IP core for three different implementations corresponding

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 445

TABLE I. Synthesis result of 2D convolutional IP core TABLE II. Peak performance of 2D convolution IP core

Resource Available Resource Utilization Specification Implementation


32x32 64x64 128x128 32x32 64x64 128x128
LUT 53200 1463 3517 11766 Number of pixels 1024 4096 16384
LUTRAM 17400 570 2218 8786 Execution cycles 1034 4106 16394
FF 106400 486 559 795 Overhead cycles 10 10 10
Execution time [µs] 10.34 41.06 163.94
LUT: Look-Up Table; FF: Flip-Flop
Frames per second 96759 24358 6100
Note: Measurements are carried out at a clock frequency of 100MHz.
to three input image sizes of 32x32 pixels, 64x64 pixels
and 128x128 pixels. As shown in Table I, all the three TABLE III. Performance evaluation of 2D convolution IP core
implementations are successfully fitted on the chosen Xilinx
PYNQ-Z2 FPGA board; the hardware resources increase Measurement Implementation
with the size of the input image. 32x32 64x64 128x128

D. Packaging convolution IP core as a Python overlay on HW execution time (s) 0.033 0.124 0.487
the PYNQ-Z2 board SW execution time (s) 0.260 1.061 4.364
Speed-up (times) 7.8 8.6 9.0
Once the 2D convolution core has successfully been
synthesized and verified, we export the design into an user
IP core using the Xilinx Vivado software tool [22] (the free
WebPack edition). For simplifying the software control, we Table II reports the peak performance of the designed IP
employ an Advanced eXtensible Interface (AXI) Lite to core. Since all the computing modules are fully pipelined,
carry out the data communication between the IP core and the IP core is expected to provide the computed result
the ZYNQ-7 host processing system. at every clock cycle. The execution cycles for the three
implementations are 1034, 4106 and 16394 clock cycles,
Figure 8 shows the block design view of the whole respectively, with the same overhead latency of 10 clock
system, in which the 2D convolution core (conv2D 0) is cycles each. We configure a working clock frequency of
connected with ZYNQ-7 processing system via the AXI 100MHz for the IP core. The corresponding execution
interconnect and is under the common reset control of the times measured in µs are then reported. The maximal
processor system reset block. frame rates at a clock frequency of 100MHz of the three
implementations are 96759, 24358 and 6100 frames per
We then run the bit stream generation and export the second for the image sizes of 32x32, 64x64 and 128x128,
system to a Python overlay that can be loaded and executed respectively.
on the PYNQ-Z2 development board. The exported overlay
consists of two main parts: the bitstream file (.bit) that Table III presents the performance comparison between
contains the hardware design and the project block diagram the hardware implementations of the 2D convolutional IP
Tcl file (.tcl). The Tcl is used by PYNQ to automati- core and their pure software implementations in Python
cally identify the ZYNQ system configuration, IP including running on the same PYNQ-Z2 board. Figure 9 illustrates
versions, interrupts, resets, and other control signals [14]. the performance speedups of the hardware implementations
As we investigate three hardware implementations of the over the software ones. The sustained performances of
convolution IP core, we then generate three different Python the hardware implementations are worse than their corre-
overlays corresponding to the three hardware implementa- sponding peak performances; the performance degradation
tions of IP cores with input image sizes of 32x32 pixels, is due to the data transfer between the IP core and the
64x64 pixels and 128x128 pixels. The kernel size for all external memory. However, the hardware implementations
three implementations is fixed at the size of 3-by-3. outperform their software counterparts by the factors of 7.8,
8.6 and 9.0 times, respectively.
E. Evaluation of the 2D convolution IP core
In this subsection, we present the performance eval- 4. Convolutional Neural Network Application for Hand-
uation of the designed IP core. We evaluate both the written Digit Recognition on PYNQ-Z2
theoretical peak performance and the practical sustained We make use of the designed convolution IP core
performance of the designed IP core. The peak performance in a practical application, that is: the handwritten digit
can be determined via simulations with the assumption that recognition with the MNIST dataset [11], [23]. In this
the data transfers between the IP core and external memory application, we train a convolutional neural network to carry
will cause no delay. On the other hand, the sustained out the classification problem on the SoC PYNQ-Z2 device.
performance provides a more realistic figure of merit for the A hardware-software co-design approach is exploited in
whole system since it takes into account the data transfers this work. Specifically, the forward inference of the trained
between the IP core and memory. convolutional neural network model is executed on the

https://journals.uob.edu.bh
446 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

Figure 8. Block design view of the ZYNQ7 system with 2D convolution IP core via AXI

chosen FPGA device as follows:

• The convolutional operation will be executed on the


Programmable Logic of the PYNQ-Z2 device; the
designed 2D convolutional IP core is loaded as a
Python overlay and is executed as a Python function.
• The other operations of the network (i.e., ReLu
activation function, max-pooling, flattening, fully-
connected layers) will be executed on the Processing
System of the PYNQ-Z2 device based on the ARM
Cortex-A9 processor.

The MNIST dataset is used for training and testing the


network. The dataset has a training set of 60.000 samples
and a test set of 10.000 samples. There are 10 different
handwritten digits ranging from 0 to 9 in the dataset. Each
digit is normalized and centered in a gray-level image with
size 28x28. For the sake of convenience, the samples are
extended to 32x32 with background pixels.
A. CNN configuration and model training
The chosen architecture of the CNN for handwritten
digit recognition is shown in Table IV. There are eight
layers in the network. The first layer is a convolutional
Figure 9. Performance comparison among various implementations layer having 16 kernel maps, followed by a max-pooling
of the 2D convolution IP core
layer for dimensional reduction. The third and fourth layers
are another pair of convolutional and max-pooling layers
having 36 kernel maps. All convolutional layers use the

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 447

plementation with the 2D convolution IP core executed on


the Programmable Logic of the FPGA device while others
operations executed by the ARM Cortex-A9 processor. For
a better understanding of the performance of the entire
network, the execution time of each layer is also measured.
Table V reports the total execution time of two im-
plementations of the network. Thanks to the use of the
convolution IP core on FPGA fabric, the hardware-software
co-design implementation outperforms the pure software
implementation by a factor of 6.2 times.
Table V also reports the execution times of all layers,
which allow for an efficient performance profiling. The
convolution layers account for most of the computational
loads of the network: 98% and 95% of the total execution
Figure 10. Convolutional neural network model training after 20 times of the two implementations, respectively. Obviously,
epochs, with a test accuracy of 99.04% the performance of the whole network can be further
improved when the execution of the convolutional layers
are speeduped.
kernel size of 3x3 and the ReLu activation function, while
all max-pooling layers use the window size of 2x2. 5. Conclusions
In this paper, we have presented the design, imple-
After the convolution and max-pooling operations, all mentation and evaluation of a 2D convolutional IP core
the extracted features are flattened into a vector by the synthesizable for FPGAs. We have developed and gener-
flattening layer with a resulted feature vector of 1764 ele- ated hardware implementations of the IP core as Python
ments. These feature vector then become the inputs of three overlays, and carried out the performance evaluation on
consecutive fully-connected layers having 120, 84 and 10 the PYNQ-Z2 device. It has been shown that the hardware
neurons, respectively for performing the classification task. implementations of the IP core outperform their software
The final fully-connected layer has the softmax activation implementations. Furthermore, we make use of the designed
function while the other fully-connected layers use the ReLu convolutional IP core as hardware accelerator in the hand-
activation functions. written digit recognition application, in which a hardware-
software co-design framework is deployed. Thanks to the
For training the convolutional neural network model, we use of hardware accelerator for the convolutional layers, the
exploit the Google Colab [24]. The chosen convolutional execution performance of the convolutional neural network
neural network is described with Python and the model has been improved by a factor of 6.2 times. We believe that
training is executed in Colab framework. Figure 10 illus- the framework presented in this work will help to accelerate
trates the result of the training process. After 20 epochs, the FPGA-based hardware implementations of the image
the convolutional neural network gives the accuracies of processing and deep learning applications.
99.77% for training set and 99.04% for test set. These ac-
curacies are comparable with the accuracies of other related To further increase the performance of the convolu-
models reported in [25] for handwritten digit recognition tional neural network implementations, we will improve
applications. In [19], a deep neural network for handwritten the communication interface between the IP core and the
digit recognition with MNIST dataset was used, resulting ARM-based processing system by employing an AXI-
in an accuracy of 97.14% for the test set. Compared with Stream interface with direct memory access (DMA) control
previous work in [19], this work offers a much higher mechanism. This will be our future work.
accuracy. Acknowledgment
Once the training is done, all the parameters of the This research is funded by Funds for Science and
trained network are saved to use in the execution of the Technology Development of the University of Danang under
forward inference of the network on PYNQ-Z2 device. project number B2019-DN02-61.
References
B. Performance evaluation on PYNQ-Z2
[1] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and
The performance evaluation of the handwritten digit S. Yan, “Hcp: A flexible cnn framework for multi-label image
recognition application is carried out on the PYNQ-Z2 de- classification,” IEEE transactions on pattern analysis and machine
vice. Two implementations are performed on the device: i) intelligence, vol. 38, no. 9, pp. 1901–1907, 2015.
a pure software implementation that runs on the Processing
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
System of the device fully based on the ARM Cortex- tion with deep convolutional neural networks,” Advances in neural
A9 processor; and ii) a hardware-software co-design im- information processing systems, vol. 25, pp. 1097–1105, 2012.

https://journals.uob.edu.bh
448 Thang Viet Huynh: FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2

TABLE IV. Convolution neural network architecture for handwritten digit recognition with MNIST dataset

No Layer Number of feature maps Size Kernel size Stride Activation function
0 Input Image 1 32x32 - - -
1 Convolution 16 30x30 3x 3 1 ReLu
2 MaxPooling 16 15x15 2x2 2 -
3 Convolution 36 13x13 3x3 1 ReLu
4 MaxPooling 36 7x7 2x2 2 -
5 Flattening - 1764 - - -
6 Fully-Connected - 120 - - ReLu
7 Fully-Connected - 84 - - ReLu
8 Fully-Connected - 10 - - Softmax

TABLE V. Execution time [second] of convolution neural network implementations with MNIST dataset on PYNQ-Z2 FPGA

Layer Pure software implementation Implementation with 2D Convolution


IP core on hardware
(s) (%) (s) (%)
1st Convolution 4.017 69.30 0.634 68.22
1st ReLu + MaxPooling 0.005 0.08 0.002 0.24
2nd Convolution 1.735 29.94 0.253 27.28
2nd ReLu + MaxPooling 0.004 0.06 0.004 0.38
Flattening 0.001 0.02 0.001 0.11
1st Fully-Connected 0.028 0.47 0.028 2.96
2nd Fully-Connected 0.005 0.08 0.005 0.51
3rd Fully-Connected 0.003 0.05 0.003 0.31
Total execution time 5.797 100.00 0.929 100.00
Speedup 1X 6.2X

[3] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech [10] TUL Technology Unlimited, “TUL PYNQ-Z2 board,” accessed:
recognition using deep neural networks: A systematic review,” IEEE 2020-10-18. [Online]. Available: tul.com.tw/ProductsPYNQ-Z2.
access, vol. 7, pp. 19 143–19 165, 2019. html

[4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang learning applied to document recognition,” Proceedings of the IEEE,
et al., “End to end learning for self-driving cars,” arXiv preprint vol. 86, no. 11, pp. 2278–2324, 1998.
arXiv:1604.07316, 2016.
[12] B. Cope et al., “Implementation of 2d convolution on fpga, gpu and
[5] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and scala- cpu,” Imperial College Report, pp. 2–5, 2006.
bility of gpu-based convolutional neural networks,” in 2010 18th
Euromicro Conference on Parallel, Distributed and Network-based [13] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo, “A high-
Processing. IEEE, 2010, pp. 317–324. performance fully reconfigurable fpga-based 2d convolution pro-
cessor,” Microprocessors and Microsystems, vol. 29, no. 8-9, pp.
[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- 381–391, 2005.
efficient reconfigurable accelerator for deep convolutional neural
networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. [14] “PYNQ Overlay Tutorials,” accessed: 2020-10-18. [Online].
127–138, 2016. Available: pynq.readthedocs.io/en/v2.5.1/pynq overlays.html

[7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti- [15] J. N. Coleman, E. Chester, C. I. Softley, and J. Kadlec, “Arithmetic
mizing fpga-based accelerator design for deep convolutional neural on the european logarithmic microprocessor,” IEEE Transactions on
networks,” in Proceedings of the 2015 ACM/SIGDA international Computers, vol. 49, no. 7, pp. 702–715, 2000.
symposium on field-programmable gate arrays, 2015, pp. 161–170.
[16] F. Albu, J. Kadlec, N. Coleman, and A. Fagan, “Pipelined implemen-
[8] M. Sit, R. Kazami, and H. Amano, “Fpga-based accelerator for tations of the a priori error-feedback lsl algorithm using logarithmic
losslessly quantized convolutional neural networks,” in 2017 Inter- arithmetic,” in 2002 IEEE International Conference on Acoustics,
national Conference on Field Programmable Technology (ICFPT). Speech, and Signal Processing, vol. 3. IEEE, 2002, pp. III–2681.
IEEE, 2017, pp. 295–298.
[17] T. V. Huynh, “Design space exploration for a single-fpga hand-
[9] “PYNQ Homepage,” accessed: 2020-10-18. [Online]. Available: written digit recognition system,” in 2014 IEEE Fifth International
pynq.io/home.html

https://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 11, No.1, 441-449 (Jan-2022) 449

Conference on Communications and Electronics (ICCE). IEEE, 2020-10-18. [Online]. Available: colab.research.google.com/
2014, pp. 291–296. notebooks/intro.ipynb

[18] T. V. Huynh, “Evaluation of artificial neural network architectures [25] A. Baldominos, Y. Saez, and P. Isasi, “A survey of handwritten
for pattern recognition on fpga,” International Journal of Computing character recognition with mnist and emnist,” Applied Sciences,
and Digital Systems, vol. 6, no. 03, pp. 133–138, 2017. vol. 9, no. 15, p. 3169, 2019.

[19] T. V. Huynh, “Deep neural network accelerator based on fpga,” in


2017 4th NAFOSTED Conference on Information and Computer
Science. IEEE, 2017, pp. 254–257.

[20] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan,


Thang Viet Huynh Thang Viet Huynh re-
“Deep learning with limited numerical precision,” in International
conference on machine learning. PMLR, 2015, pp. 1737–1746. ceived his PhD degree in Electrical and Elec-
tronic Engineering from Graz University of
[21] Bishop, David W., “VHDL-2008 support library,” 2011, accessed: Technology (TUGraz), Austria in 2012. He
2020-10-18. [Online]. Available: github.com/FPHDL/fphdl is currently working as a senior lecturer at
the Faculty of Electronics and Telecommu-
[22] Xilinx, “Vivado Design Suite Evaluation and WebPACK,” accessed: nication Engineering, Danang University of
2020-10-18. [Online]. Available: xilinx.com/products/design-tools/ Science and Technology (DUT), The Uni-
vivado/vivado-webpack.html
versity of Danang (UDN), in Danang City,
[23] “MNIST database,” accessed: 2020-10-18. [Online]. Available: Vietnam. His research interests include em-
yann.lecun.com/exdb/mnist/ bedded reconfigurable computing (FPGA), hardware implemen-
tations of deep learning models, edge computing, and respective
[24] Google, “Google Colaboratory (Colab) Introduction,” accessed: applications.

https://journals.uob.edu.bh

You might also like