0% found this document useful (0 votes)
12 views

(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO.

10, OCTOBER 2022 4093

An Efficient Implementation of Convolutional


Neural Network With CLIP-Q
Quantization on FPGA
Wei Cheng , Ing-Chao Lin , Senior Member, IEEE, and Yun-Yang Shih

Abstract— Convolutional neural networks (CNNs) have exponentially growing computational time of a CNN. For
achieved tremendous success in the computer vision domain example, CNN models with more than 100 layers, such as
recently. The pursue for better model accuracy drives the ResNet101 [18] and DenseNet121 [19], require a considerable
model size and the storage requirements of CNNs as well as
the computational complexity. Therefore, Compression Learning amount of computing resources and memory space. In order to
by InParallel Pruning-Quantization (CLIP-Q) was proposed to use computing resources and memory space more efficiently,
reduce a vast amount of weight storage requirements by using a quantization, which simplifies and optimizes the CNN model,
few quantized segments to represent all weights in a CNN layer. has become a popular research field.
Among various quantization strategies, CLIP-Q is suitable for Quantization [31], [35], [37] constrains a data representation
hardware accelerators because it reduces model size significantly
while maintaining the full-precision model accuracy. However, to a smaller set, for example, using an 8-bit fixed point format
the current CLIP-Q approach did not consider the hardware to represent a 32-bit floating point format. Because fewer bits
characteristics and it is not straightforward when mapped to are used to represent a number, quantization greatly reduces
a CNN hardware accelerator. In this work, we propose a storage requirements. For example, the authors in [20] use a
software-hardware codesign platform that includes a modified 16-bit and 8-bit fixed point format to represent data. Binary
version of CLIP-Q algorithm and a hardware accelerator, which
consists of 5 × 5 reconfigurable convolutional arrays with input neural networks (BNNs) [21], [22] and ternary neural networks
and output channel parallelization. Additionally, the proposed (TNNs) [23] represent data in a CNN with less than two bits,
CNN accelerator maintains the same accuracy of a full-precision which reduces the memory space requirement for more than
CNN in Cifar-10 and Cifar-100 datasets. sixteen fold.
Index Terms— Convolutional neural network, CLIP-Q, accu- Recently, many attempts have been made to deal with
racy, energy, hardware implementation. the model sparsity through model compression [35]–[37],
and Compression Learning by InParallel Pruning-Quantization
I. I NTRODUCTION (CLIP-Q) has been proposed in [24], [25]. It quantizes the

I N RECENT years, convolutional neural networks (CNNs)


have been widely used in many applications, such as
image classification [1]–[3], object detection [4]–[7], semantic
full-precision weights by combining pruning and weight quan-
tization into a single learning framework during CNN model
training. Weight fine-tuning is also applied after model training
segmentation, [8]–[11], visual question answering, [12]–[15], is completed. Full-precision weights are discarded while quan-
speech recognition [16], and self-driving cars [17]. CNN tized weights are kept. There are significantly fewer quantized
achieves higher model accuracy than traditional image weights than full-precision weights since these weights are
processing methods in the above applications given enough compressed and stored in a sparse encoding format. Joint prun-
training data. ing and quantization help CLIP-Q achieve near-zero accuracy
However, to achieve better accuracy, the number of layers drop compared with full-precision models.
as well as the complexity of CNN models has increased Meanwhile, in order to accelerate CNN computation, many
significantly. The increased model complexity leads to an hardware CNN accelerators have been proposed. Instead of
using a full-precision format, model weights, activation, and/or
Manuscript received 2 April 2022; revised 23 June 2022; accepted 9 July
2022. Date of publication 4 August 2022; date of current version 29 Sep- input are quantized. Because the computing units in the
tember 2022. This work was supported in part by the Ministry of Sci- accelerators are designed according to the quantized data, the
ence and Technology under Grant 110-2221-E-006-084-MY3 and Grant hardware CNN accelerator can effectively accelerate CNN
109-2628-E-006-012-MY3; and in part by the Intelligent Manufacturing
Research Center from the Featured Areas Research Center Program by the computation. For example, if BNN only uses +1 and −1
Ministry of Education, Taiwan. This article was recommended by Associate to represent inputs and weights, XNOR gates can be used
Editor J. Di. (Corresponding author: Ing-Chao Lin.) to replace the multiplication in a BNN. The authors in [26]
Wei Cheng and Ing-Chao Lin are with the Department of Computer Science
and Information Engineering, National Cheng Kung University, Tainan 701, designed a highly parallelized hardware CNN accelerator
Taiwan (e-mail: [email protected]). based on a BNN using XNOR for multiplication. The authors
Yun-Yang Shih is now with MediaTek Inc., Hsinchu 300, Taiwan. in [27] proposed a BNN accelerator, in which all convolutional
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2022.3193031. operations are binarized and unified, to achieve better perfor-
Digital Object Identifier 10.1109/TCSI.2022.3193031 mance and energy efficient. However, for BNN accelerators,
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4094 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

due to the limitation related to data precision when using these


weights, these CNN hardware accelerators could not achieve
the accuracy of full-precision weights.
CLIP-Q quantizes CNN weights and maintains full-
precision accuracy. Meanwhile, due to high computational
complexity, it is a trend to design a hardware accelerator
to accelerate CNNs. However, CLIP-Q in [24], [25] did not
consider the hardware characteristics, and the method used to
apply CLIP-Q when designing a CNN hardware accelerator
is not straightforward. In order to design a CNN accelerator
with CLIP-Q, we thus propose a software-hardware codesign
platform that includes both the software flow and a hardware
accelerator. Based on the results obtained from the software Fig. 1. CNN overview and details of a convolutional layer.
flow, we design an efficient CNN hardware accelerator. The
contributions of this paper can be summarized as follows:
• We propose a software-hardware codesign platform that With regard to computations, if additions are not counted,
includes both a software flow and a hardware accelerator. each convolutional layer still requires K∗K∗IC ∗OC ∗OW ∗OH
In the software flow, the parameters of CLIP-Q are multiplications, where OW is the output width, and OH is
determined. A CNN with the proposed CLIP-Q setup and the output height. The larger the input and output channels,
adjustment only requires four 8-bit weights for a layer the more memory usage and computation are required, which
and 8 bits for activation and still has the same accuracy leads to increased latency and energy consumption. There-
as full-precision CNN in Cifar-10 and Cifar-100. fore, reducing memory usage and computation requirement to
• In the hardware accelerator, we propose a simple but accelerate CNN has attracted extensive attention.
effective weight decoder to retrieve weights during con- In a CNN, weight quantization is a widely used technique
volutional operations. to reduce memory usage and computation demands. In weight
• We implement a hardware CNN accelerator with a paral- quantization, weights are constrained to a set of discrete
lel architecture and design a reconfigurable convolutional values, allowing the weights to be represented using fewer bits.
array that performs convolutional operations with various Research in quantization includes [20] and [28]. These studies
kernel sizes. quantize data into 16-bit or 8-bit fixed-point formats using
• The simulation results show that the proposed CNN flexible quantization algorithms. Therefore, memory usage is
hardware accelerator achieves better Giga Operations Per only 1/2 to 1/4 that of the original size while full-precision
Second Per Watt (GOP/S/W) than the state-of-the art accuracy is still maintained.
approach. To further reduce memory usage and computation, binary
The rest of the paper is organized as follows: Section II neural networks (BNNs) [21] and ternary neural networks
introduce the background, and Section III introduces the (TNNs) [23] have been proposed. BNN only uses +1 and
software-hardware codesign platform and the software flow on −1 to represent data, while TNN uses +1, 0, and −1 for
the platform. Section IV details the architecture of the CNN data representation. These methods only require 1 or 2 bits to
hardware accelerator. Section V introduces the experimental represent a weight or input in a CNN, which can significantly
results, and Section VI concludes the paper. reduce memory usage. However, due to the limited precision
of the data representation, the accuracy of BNNs and TNNs
II. BACKGROUND is lower than that for a CNN with full-precision weights.
A. CNN and Quantization NN
CNN has been widely used in many applications, including B. CLIP-Q
image classification [1]–[3], object detection [4]–[7], semantic CLIP-Q is a CNN quantization algorithm that combines
segmentation [8]–[11], visual question answering [12]–[15], pruning and quantization into a single learning framework.
speech recognition [16], and self-driving cars [17]. CNN is The joint pruning and quantization help CLIP-Q achieve the
mainly composed of convolutional, pooling and fully con- accuracy of full-precision weights with significantly reduced
nected layers, as shown in Figure 1. CNN performs feature memory usage. The authors in [24], [25] showed that a CNN
extractions through multiple convolutional layers and outper- with CLIP-Q can preserve the same accuracy as a CNN with
forms many current image processing methods. full-precision weights.
However, with the increasing number of layers and model Figure 2 shows the four steps in CLIP-Q. The first step
complexity in CNN, CNN requires considerable memory and is clipping, where weights that are close to 0 are pruned to
computing resources. As shown in Figure 1, with regard 0. The parameter P is the proportion of the weights that are
to memory usage, the weights of each convolutional layer pruned to 0. In Figure 2, P is set to 0.2, indicating that 20%
require K∗K∗IC ∗OC ∗Wsize bytes of memory space, where of the positive weights will be changed to 0, and 20% of the
K∗K is the kernel size; IC is the input channel; Oc is output negative weights will be changed to 0 as well. Note that the
channel, and Wsize is the number of bytes for each weight. weights that are closer to 0 are selected first. In Figure 2(b),

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4095

TABLE I
NETWORK S IZE C OMPARISON . CLIP-Q U SES E QUAL TO OR L ESS
T HAN 8 B ITS TO R EPRESENT A W EIGHT

did not consider the characteristics of the hardware, and it is


not straightforward to map a CLIP-Q enabled model to a CNN
hardware accelerator. The original CLIP-Q algorithm offers a
high degree of freedom so different parameter B is used for
different layers (weights in a layer will be quantized to 2B
segments), and this leads to inefficient hardware design as the
accelerator will have to select weights from variant length of
segments. In our design, we set the parameter B to 2 so that
model weights in all layer are quantized to 22 segments. Also,
Fig. 2. An example of CLIP-Q with 25 weights, P = 0.2, B = 2 (a) Original
each weight is represented by 8 bits so that the 32-bit memory
weights (b) Clipping (c) Partitioning (d) Average (e) Quantizing. bandwidth can be fully utilized by reading 4 weights in a cycle.
The software flow, detailed in the next section, determines the
the weights with the gray background are the weights that CNN model and related parameters that will be implemented
become 0 after clipping. in the CNN hardware accelerator. The details of the CNN
The second step is partitioning. Given a predefined hardware accelerator are explained in Section IV.
number B, this step divides the remaining weights into 2B -1
III. S OFTWARE AND H ARDWARE C ODESIGN P LATFORM
segments. In Figure 2(c), Parameter B is set to 2; hence, the
remaining weights are partitioned into three segments. The This section first gives an overview of our software and
partitioning method used in this work is the linear partitioning hardware codesign platform that contains both the software
method, which is the same as in [24], [25]. However, other flow and the hardware accelerator. Then, we present how we
partitioning methods can be used to improve accuracy. The select the neural network model for our hardware accelerator.
blue, green, and orange blocks in Figure 2(c) are three Finally, it describes how we determine the parameters of
segments after partitioning. CLIP-Q and the bit width of the weights and activations.
The third step is averaging and quantizing. First, the average
of all the numbers in each segment is computed. After that, A. Software and Hardware Codesign Platform Overview
the averages represent all the segment numbers. As shown in Figure 3 shows the overview of this platform, which con-
Figure 2(d), −1.02, −0.4, and 0.91 are the averages of these tains both a software flow and a CNN hardware accelerator.
three segments, respectively. Then, the three numbers replace In the software flow, we first select a CNN model that
all the numbers in the blue, green, and orange blocks, as shown is suitable for hardware implementation. After that, model
in Figure 2(e). Then, these weights are quantized. After information, such as the number of layers and the kernel size
CLIP-Q, if number 0 is counted, the weights of a CNN layer of each layer, are determined. Then, we set up parameters
have only 2B different numbers. Therefore, these quantized P and B of CLIP-Q, and we determine the number of bits
weights can be stored in an array with B-bit weight indexes, required for each weight and activation. Finally, the model
and these indexes are decoded to retrieve the quantized weight training is completed on GPU servers to obtain the quantized
during computation. weights.
Table I shows the network size and model accuracy of Notice that the software flow described in the previous
AlexNet [2], GoogLeNet [3], and ResNet50 [18] when a model paragraph is very flexible. Users can choose a preferred
is uncompressed and when a model is processed by CLIP-Q. CNN models, suitable CLIP-Q parameters, and bit widths for
It can be observed that the accuracy of CLIP-Q enabled quantization according to their needs. Subsequently, after a
models still matches the accuracy of uncompressed ones with proper CNN model is selected, and the bit width for weights
full-precision weights. In other words, CLIP-Q dramatically and activation can be determined. The proposed software flow
reduced the storage and computational requirements with compresses the model such that it can be easily mapped to the
minimum overhead, and it makes CLIP-Q particularly suitable hardware accelerator while maintains its model accuracy. The
for CNN hardware accelerator designs. following subsections explain the details of each step in
Although CLIP-Q offers great model compression rate and the software flow, and Section IV details the CNN hardware
model accuracy, the original algorithm proposed in [24], [25] accelerator.

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

TABLE III
A CCURACY C OMPARISON OF D IFFERENT B ITS Q UANTIZATION OF NIN

Fig. 3. Overview of software and hardware codesign platform. weights will be clipped to 0 and the one that indicates that P%
of the negative weights will be clipped to 0 as well. In this
TABLE II
work, the clipping parameter P is set to 20 according to [24],
C OMPARISON B ETWEEN N EURAL N ETWORK M ODELS . C IFAR 10 I S U SED
[25] and our experimental results.
The second parameter is B, which represents that the
weights in a layer are divided into 2B segments for further
averaging and quantizing. It also represents that the number
of weights for a layer is 2B . To reduce weight storage, we set
B to 2 for all layers in the CNN model, which is the minimum
value of B. However, the range of the weight representation
B. Neural Network Model Selection can still cover positive numbers, 0, and negative numbers.
The first step is to determine a suitable CNN model for the
hardware implementation. One important factor in determining D. Weight and Activation Width Determination
a suitable model is the size of the available on-chip block After the parameters of CLIP-Q are determined, the next
RAM memory (or BRAM). Since the latency and energy step is to determine how many bits are used to represent a
required for off-chip DRAM memory access are much greater weight and activation. In this step, we develop an in-house
than those for on-chip memory access, if the model weights tool in C++ and Python to analyze the accuracy of a 9-layer
can be stored in on-chip BRAM, the accelerator will have NIN when different bit widths are used for the weights.
lower latency and less energy consumption. The implementa- Table III shows the accuracy of the 9-layer NIN when
tion platform used in this work was Xilinx’s XC7Z0Z0 FPGA. different bit widths are used to represent a weight. The
There is only 630KB on-chip BRAM on this FPGA. When accuracy of the neural network with an 8-bit weight width
determining a model, we prefer to select a model where as is almost equal to the full precision. Hence, a weight with an
many weights as possible can be stored in the on-chip memory. 8-bit width is used in this work.
Note that based on user requirements, different models can be Asides from bit width, the position of the decimal point
chosen. directly affects the numerical representation range and preci-
Table II compares three model candidates: AlexNet [2], sion. Thus, we also need to choose an appropriate decimal
VGG7 [29], GoogLenet [3], and Network in Network [30] point position.
(NIN). The first column is the model’s name, and the second Figure 4 shows the numerical distribution of the weights of
column is the model structure. The third column is the number each layer in the full-precision CNN in the NIN model. It can
of parameters, and the fourth column is the accuracy. “Conv” be seen that the distribution of the weights of the last three
stands for the convolutional layer, and “FC” stands for the layers is the widest, ranging approximately between +4 and
fully-connected layer. As seen from Table II, NIN has the −4. Therefore, the appropriate weight must cover between +4
lowest number of weights and accuracy that is comparable to and −4 to cover the ranges of the weights. In addition, most
the other models. In addition, since there is no fully connected of the weights in the first three layers of NIN are close to
layer in the NIN, its structure is simpler than that of other three 0. In order to represent the weights of the first three layers
models. Hence, NIN was selected to be implemented in our clearly, a certain number of bits in the fraction is required to
CNN accelerator. represent the weights. Based on the observation, we choose an
8-bit data format with a 3-bit integer and a 5-bit fraction for
C. CLIP-Q Setup and Adjustment the weight. The range of the data format is +3.96875 to −4,
CLIP-Q is a weight pruning and quantization technique which is quite close to the weight distribution, and the 5-bit
that is able to maintain accuracy as full-precision weights fraction is adequate to represent the precision required for the
while significantly reducing weight storage. Hence, it is a weight.
suitable to apply CLIP-Q on a CNN. In the first step, the Finally, we determine the data format for activations. This
CNN model is selected. In the second step, the parameters is important because even though the weights are quantized
used in CLIP-Q are determined. There are two parameters: to 8 bits, if activations between each layer still use a full-
the clipping parameter P that indicates that P% of the positive precision 32-bit format, the accelerator still requires complex

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4097

Fig. 5. CNN accelerator architecture overview.

Fig. 4. Weights distribution of each layer.


BRAM stores various CNN parameters, including the input
TABLE IV channel size, the output channel size, the kernel size, and the
A CCURACY C OMPARISON OF D IFFERENT F RACTION P OINT P OSITION OF stride. Input BRAM and Output BRAM store the input and
8-B IT D ATA F ORMAT IN NIN M ODEL AND C LIP -Q
output data of a layer. Weight Index BRAM stores the 2-bit
weight index for that layer. Clip-Q BRAM stores the quantized
weights of each layer. The ZYNQ CPU controls the to transfer
data between the off-chip DRAM memory and the on-chip
BRAM through AXI protocol.
The main controller receives information from the CPU and
controls the entire execution. The Conv Unit is the circuit that
performs the convolutional operations. There are four convo-
computational circuits and a lot of memory. Therefore, it is
lutional modules (Conv Modules) inside the Conv Unit, and
necessary to determine the bit width and the location of
each Conv Module contains four reconfigurable convolutional
the decimal point for activations such that computation and
arrays (RCAs). Each reconfigurable convolutional array has
memory usage are reduced without significantly sacrificing
25 processing elements (PEs). The Conv Unit obtains weights
accuracy. Since the bit width of the weights is 8, the bit
through the weight decoder. The design of the weight decoder
width of the activations is also set to 8 to match the width
and reconfigurable convolutional array are detailed in later
of the weights. Regarding the decimal point location, we train
subsections.
the neural network according to the different decimal point
The execution flow of the CNN hardware accelerator is
positions to analyze the accuracy of the neural network for
as follows: First, all input and weight data will be placed
different decimal point positions in the activations, as shown
in the off-chip DRAM memory. The ZYNQ CPU controls
in Table IV. From Table IV, we can see that the format with
the DMA through the AXI BUS and puts the data into the
3-bit integers and 5-bit fractions has the highest accuracy.
corresponding BRAM. After the data is placed, the CPU
Therefore, an 8-bit activation format that has 3-bit integers and
controls the controller to begin the operations. The Conv Unit
5-bit fractions was selected. The results are consistent with the
performs the convolutional operations using the input data and
experimental results in [30]. Therefore, it is appropriate to set
the weights after decoding. The quantized activations are 8-bit.
the bit width of weights and activations between each layer to
The output is stored in the output BRAM. After a layer finishes
8 bits with a 3-bit integer and a 5-bit fraction.
computing, the role of the Input BRAM and Output BRAM
are exchanged. Therefore, each layer only has to read the
IV. CNN H ARDWARE ACCELERATOR
weights of the layer from the off-chip DRAM memory. After
This section first gives an overview of the CNN hardware the last convolutional layer finishes calculating, the DMA
accelerator architecture. Then, it details the design used in moves the results back to the off-chip DRAM memory from
the accelerator to improve the parallelism. Finally, it details the output BRAM.
input and output channel parallelism used to improve the Section III discusses the fact that after Clip-Q setup and
performance and the design of a reconfigurable convolutional adjustment, the weight bit width of each layer is the same.
array that performs convolutions with various kernel sizes. In addition, since we limit the number of segments in weight
partitioning to 4, when designing the weight decoder, we only
A. Accelerator Overview have to use a multiplexer that has 4 inputs and a 2-bit selector
Figure 5 shows an overview of the accelerator architecture. to decode the weights of each layer. Figure 6 shows the design
There are five on-chip BRAMs in the architecture. Param of the weight decoder, which contains four registers and a

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4098 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Fig. 6. Details of weights decoder.

Fig. 7. Conv unit design.


multiplexer. The four registers store the four 8-bit weights of
a layer. The Weight Index BRAM provides the 2-bit index
of the weights. Based on the weight index, the decoder can
select the proper weight that will be used in the Conv Unit.
In this way, a decoder with small area can be built. Note
that it is possible to have different weight bit widths for
each layer. However, supporting different weight bit widths for
different layers will increase the complexity of the hardware
design.

B. Input and Output Channel Parallelization


In order to improve performance, our design improves
the parallelization of memory access and the computational
operations in both input and output channels. Figure 7 shows
the detailed architecture of the Conv Unit shown in Figure 5. Fig. 8. Conv module with input and output channel parallelization.
To speed up the operations, a parallel architecture is designed.
According to the quantization results in Section III, each
input data has only 8 bits. Based on the memory bandwidth, output of different channels. This is also considered in our
a standard BRAM has a 32-bit word. Therefore, on-chip design to improve the parallelism in the output channel. The
BRAM can read or write four 8-bit data at the same time. proposed circuit is not shown but is similar to the circuit
To make most use of the input and output memory band- in the previous paragraph, which is implemented in four
width, our circuits are designed to improve the parallelization parts, corresponding to four consecutive output channels. After
of the input and output channel. Figure 8 shows the Conv input data enters the circuit, four 8-bit outputs are generated
Module design that improves the parallelism. An input data from the four circuits. Finally, by directly combining the
image is considered to be 3-dimensional (3D) data because it four 8-bit outputs into a 32-bit output, the output of four
normally has a width, a length, and channels. To improve the consecutive channels can be written to the output memory
parallelism, the channel-major layout is used in this work to in one cycle. Therefore, the required time for the computation
store the 3D input data stored in on-chip BRAM. Therefore, is only one fourth of the original time through output channel
the input data of 4 channels in the same position can be parallelization.
retrieved in the same cycle.
In Figure 8, four convolutional computing arrays execute in
parallel, for which the results are added through the adder C. Reconfigurable Convolutional Array
tree. After the computational results enter the accumulator Since the NIN model consist of convolutional layers with
(accum), they are accumulated and stored in the partial sum different kernel sizes, to adapt to various kernel sizes during
BRAM (Psum BRAM) in the Conv Module. After the last convolutional operations, we designed a reconfigurable con-
input channel data enters the Conv Module, four 8-bit output volutional array that can perform convolutional operations for
activations are generated in parallel from the four Conv various kernel sizes.
Modules and combined into a 32-bit output to be written into Figure 9 shows the hardware architecture of the reconfig-
the Output BRAM. urable convolutional array. Within the reconfigurable convolu-
Similarly, the output memory also has the same bandwidth tional array, there are 25 processing elements (PEs) where
as the input memory; that is, one cycle can write four 8-bit each PE has a multiplier, an adder, and a register. The
outputs. In addition, in the convolutional operation, the same results of multiplication and addition are saved in the register.
input data are calculated with different weights to obtain the First, each weight is stored in its corresponding PE. Then,

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4099

Fig. 9. Reconfigurable convolutional array. It contains 25 PEs that are


connected in series.
Fig. 10. Reconfigurable convolutional array for different kernel sizes.

the same input data is broadcast to each PE and multiplied


by the weight stored in the corresponding PE. Finally, the the array. In other words, an average of three clock cycles
multiplication results are added to the results from the previous will have an output. According to the row size of the kernel,
PE and are stored in the register. This architecture reduces the number of cycles needed to generate an output can be
weight movements by reusing weights, in turn reducing energy determined.
consumption.
Figure 10 shows how the reconfigurable convolution com- V. E XPERIMENTAL S ETUP AND R ESULTS
puting array carry out convolution for various kernel sizes.
This section first introduces the experimental environment.
When the kernel size is less than 5 × 5, such as 1 × 1 or
Then, it details the accuracy and performance comparison.
3 × 3, a kernel will only occupy the same amount of PEs
of its size. For instance, each 1 × 1 kernel will occupy a
PE in the array so in total twenty-five 1 × 1 kernels can A. Experimental Setup
be placed in the reconfigurable convolution array. As for The neural network is built with Python, and CLIP-Q is
3×3 kernels, each kernel occupies 9 PEs and two 3×3 kernels used to quantize model weights, where P is set to 20%, and
can be placed in the array at a time. Given the kernel size B is set to 2. The bit width of the weights and activations
equals to 5 × 5, it is obvious that all PEs are occupied for are 8-bit, and the computational data are quantized to 8 bits.
convolution. When the kernel size is greater than 5 × 5, the Therefore, during CNN inference, the data format in the
system completes the convolution by dividing the operations software and hardware computation are equivalent, and the
into several reconfigurable convolutional arrays, where each accuracies of the inference are also the same. After training,
array performs a convolutional operation on up to 25 inputs. the parameters with the highest accuracy are saved, i.e., the
Take 7 × 7 kernel size as an example, as shown in Figure model’s weight and bias are saved into a file to facilitate the
10. The kernel can be split into two smaller kernels, whose hardware implementation.
sizes are 25 and 24, respectively. After using the two kernels This design is implemented on the PYNQ-Z2 FPGA devel-
to perform the convolutional operation, the results are added opment board, where the FPGA chip is XC7Z020. The design
together to complete the convolutional operation for kernel is implemented in Verilog, and the development software is
sizes larger than 5 × 5. Hence, although there are only Vivado (v2018.3). The hardware resources utilized in the sys-
25 PEs in the proposed reconfigurable convolutional array, the tem are shown in Table V. Due to the limited number of DSPs,
reconfigurable convolutional array can perform convolutional the 8-bit multipliers are synthesized using LUT. Therefore, the
operations for kernel sizes larger than 5 × 5. utilization rate of LUT is relatively high. The DSP is mainly
To take advantage of data reuse, the sliding window is mov- used by the Controller circuit to calculate the data address.
ing downwards, so the partial sum of the previous operation Compared to DSP, implementing 8-bit multipliers with LUT
can be reused. One row of the kernel data can generate a can reduce energy consumption. The input BRAM and output
valid output. Taking a 3 × 3 kernel size as an example, one BRAM, which contain the input and output data for a layer,
output is generated after every three input data values enter account for a large part of the BRAM usage.

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

TABLE V TABLE VII


R ESOURCE U TILIZATION ON XC7Z020 C OMPARISON OF I NPUT F EATURE M AP R EADING T IMES IN
D IFFERENT K ERNEL S IZE

TABLE VI
A CCURACY C OMPARISON OF D IFFERENT Q UANTIZATION
A LGORITHMS IN THE NIN M ODEL

TABLE VIII
C OMPARISON B ETWEEN D IFFERENT I MPLEMENTATIONS ON FPGA

B. Accuracy Comparison
After quantization and the CLIP-Q fixed-segment adjust-
ments, there are only four 8-bit weights per layer, which
conserves a considerable amount of storage. However, it is
important the accuracy of the model is maintained. If the
accuracy can be maintained, the proposed CLIP-Q is suitable
for quantization implemented in the CNN accelerator. Table VI size, our reconfigurable design reduces the input data access
compares the Cifar-10 and Cifar-100 accuracy of the models time. Thus, it can also complete convolution faster than was
using different quantization methods. All the accuracies in the case in [32].
Table VI are generated by experiments in the 9-layer NIN Table VIII shows a comparison between the proposed design
model. FULL means that the 32-bit float full precision data and related work. Because the proposed 5 × 5 convolutional
format is used. 8-bit represents the input, and weight data array improves the input and output channel parallelism,
are quantized to 8-bit precision. TNN means that all data are the overall GOP/S performance was increased. Furthermore,
represented by +1, 0, −1, and BNN only uses +1 and −1 to since the multipliers were synthesized using LUT, the power
represent data. consumption was less than when using DSP. Also, we used
It can be seen from Table VI that although TNN and fewer flip-flops on the convolution circuit, which also reduced
BNN save a lot of storage space for weights and inputs, the the power consumption. According to Table VIII, our CNN
accuracy is reduced. Especially in the Cifar-100, the accuracy accelerator’s GOP/S/W were relatively higher than those in
is significantly different from the full precision. The accuracy other work, which means the proposed design had higher
of the 8-bit precision is closer to the full precision, but energy efficiency.
there is still a 2% drop in the Cifar-100 test data. However,
the adjusted 8-bit CLIP-Q has almost the same accuracy VI. R ELATED W ORK
as the full-precision, and only four 8-bit weights are needed. In this section, we discuss previously designed FPGA-based
The proposed CLIP-Q significantly reduces storage space and CNN accelerator. Angel-Eye [32] proposed a software-
can achieve almost the same accuracy as full precision.
hardware codesign for embedded CNN applications and used
a 3 × 3 convolver to handle different computational workloads
C. Performance Comparison of various kernel size; however, the utilization rate of its 3 ×
Table VII shows the required cycles to read input feature 3 convolvers is only 1/9 when dealing with 1×1 kernel. Instead
maps for various kernel sizes in the reconfigurable convolution of designing the accelerator directly. High-level synthesis was
design. In the proposed 5 × 5 reconfigurable convolutional used to generate the design with the help of roofline model
array, as long as the kernel size is not higher than 25(= 5×5), in [33]. It measured the compute and memory requirements
it is only necessary to read the input feature map once to for each layer of a CNN model and came up with suitable
complete the convolution. For convolution kernel sizes higher architectures that efficiently utilize the memory bandwidth.
than 5 × 5, the kernel is divided into smaller kernels, each However, it mapped full precision CNN models directly to
of which is equal to or smaller than 25. Therefore, the input FPGAs without considering the underlying hardware costs,
access time of large-size kernels is equal to the kernel size and common strategies such as data quantization and model
divided by 25. Compared to [32], where convolutions of pruning were not applied. An end-to-end FPGA-based CNN
different kernel sizes were completed using a 3 × 3 kernel accelerator aiming for high throughput and high resource

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4101

utilization was proposed in [34]. While different layers have [14] H. Noh, P. H. Seo, and B. Han, “Image question answering using
different compute-to-memory ratio, it proposed a batch-based convolutional neural network with dynamic parameter prediction,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
method for fully connected layer to better utilize memory pp. 30–38.
bandwidth. It adopted 16-bit data quantization for input and [15] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
weight data; however, its models were not pruned and was networks for image question answering,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 21–29.
unfriendly to resource-limited FPGAs. [16] D. Palaz et al., “Analysis of CNN-based speech recognition system using
raw speech as input,” Idiap, Martigny, Switzerland, Tech. Rep. Idiap-RR-
23-2015, 2015.
VII. C ONCLUSION [17] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
CLIP-Q significantly reduces the CNN weight storage arXiv:1604.07316.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
requirement while also maintaining accuracy. This feature image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
makes CLIP-Q suitable for a CNN. However, the current (CVPR), Jun. 2016, pp. 770–778.
[19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
CLIP-Q approach did not consider the hardware character- “Densely connected convolutional networks,” in IEEE CVPR, Jul. 2017,
istics and the method for applying CLIP-Q when designing pp. 4700–4708.
a CNN hardware accelerator was not straightforward. In this [20] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
work, we propose a software-hardware codesign platform that Gate Arrays (ACM), 2016, pp. 26–35.
includes both the software flow and the hardware accelerator. [21] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
The software flow obtained neural model parameters suitable “Binarized neural networks: Training deep neural networks with weights
and activations constrained to +1 or −1,” 2016, arXiv:1602.02830.
for hardware implementation. We also designed a CNN hard- [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
ware accelerator. The accelerator executed convolutions with “Binarized neural networks,” in Proc. Int. Conf. Neural Inf. Process.
various kernel sizes through 5×5 reconfigurable convolutional Syst., 2016, pp. 4114–4122.
[23] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
arrays and improved parallelism in both the input and output 2016, arXiv:1612.01064.
channels. The experimental results show that the proposed [24] F. Tung and G. Mori, “CLIP-Q: Deep network compression learning
CNN accelerator has higher energy efficiency than the state- by in-parallel pruning-quantization,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 7873–7882.
of-the art alternatives. [25] F. Tung and G. Mori, “Deep neural network compression by in-parallel
pruning-quantization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
no. 3, pp. 568–579, Mar. 2018.
R EFERENCES [26] H. Yonekawa and H. Nakahara, “On-chip memory based binarized
[1] Y. Wei et al., “HCP: A flexible CNN framework for multi-label image convolutional deep neural network applying batch normalization free
classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, technique on an FPGA,” in Proc. IEEE Int. Parallel Distrib. Process.
pp. 1901–1907, Jun. 2015. Symp. Workshops (IPDPSW), May 2017, pp. 98–105.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [27] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “FBNA: A fully
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. binarized neural network accelerator,” in Proc. 28th Int. Conf. Field
Process. Syst. (NIPS), 2012, pp. 1097–1105. Program. Log. Appl. (FPL), Aug. 2018, pp. 51–513.
[28] B. Jacob et al., “Quantization and training of neural networks for
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf.
[4] S. Gidaris and N. Komodakis, “Object detection via a multi-region and Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713.
[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
semantic segmentation-aware CNN model,” in Proc. IEEE Int. Conf.
large-scale image recognition,” 2014, arXiv:1409.1556.
Comput. Vis. (ICCV), Dec. 2015, pp. 1134–1142.
[30] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013,
[5] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region- arXiv:1312.4400.
based fully convolutional networks,” in Proc. Adv. Neural Inf. Process. [31] T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and
Syst., 2016, pp. 379–387. quantization for deep neural network acceleration: A survey,” Neuro-
[6] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. computing, vol. 461, pp. 370–403, Oct. 2021.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. [32] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look onto embedded FPGA,” IEEE Trans. Comput.-Aided Design Integr.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Circuits Syst., vol. 37, no. 1, pp. 35–47, Jan. 2017.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. [33] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-
“Semantic image segmentation with deep convolutional nets and fully Program. Gate Arrays, vol. 2015, pp. 161–170.
connected CRFs,” 2014, arXiv:1412.7062. [34] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks formance FPGA-based accelerator for large-scale convolutional neural
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern networks,” in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL),
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. Aug. 2016, pp. 1–9.
[10] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution [35] Z. Song et al., “DRQ: Dynamic region-based quantization for deep
residual networks for semantic segmentation in street scenes,” neural network acceleration,” in Proc. ACM/IEEE 47th Annu. Int.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, Symp. Comput. Archit. (ISCA), May 2020, pp. 1010–1021, doi:
pp. 4151–4160. 10.1109/ISCA45697.2020.00086.
[11] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated [36] X. Zhou et al., “Cambricon-S: Addressing irregularity in sparse
convolutions,” 2015, arXiv:1511.07122. neural networks through a cooperative software/hardware approach,” in
[12] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
to reason: End-to-end module networks for visual question answer- Oct. 2018, pp. 15–28, doi: 10.1109/MICRO.2018.00011.
ing,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, [37] S. Q. Zhang, B. McDanel, H. T. Kung, and X. Dong, “Training
pp. 804–813. for multi-resolution inference using reusable quantization terms,” in
[13] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A Proc. 26th ACM Int. Conf. Architectural Support Program. Lang.
neural-based approach to answering questions about images,” in Proc. Operating Syst., Apr. 2021, pp. 845–860, doi: 10.1145/3445814.
IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1–9. 3446741.

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4102 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022

Wei Cheng received the B.E. degree in computer Yun-Yang Shih received the M.S. degree in com-
engineering from The University of Hong Kong in puter science and information engineering from the
2018. He is currently pursuing the master’s degree National Cheng Kung University in 2020. He is
with the Department of Computer Science and Infor- currently with Mediatek Inc. His research interests
mation Engineering, National Cheng Kung Univer- lie in the field of very large-scale integration design
sity. His research interests lie in the field of very and deep neural network accelerators.
large-scale integration design, computer architecture,
and deep neural network accelerators.

Ing-Chao Lin (Senior Member, IEEE) received


the M.S. degree in computer science from the
National Taiwan University, Taipei, Taiwan, and
the Ph.D. degree from the Computer Science and
Engineering Department, The Pennsylvania State
University, State College, PA, USA, in 2007.
From 2007 to 2009, he was with Real Intent Inc.,
Sunnyvale, CA, USA. Since 2009, he has been with
the Department of Computer Science and Informa-
tion Engineering, National Cheng Kung University,
Tainan, Taiwan, where he is currently a Full Profes-
sor. He was a Visiting Scholar at the University of California, Santa Barbara,
in 2015; and he was a Visiting Scholar at Academia Sinica in 2017. His
current research interests include very large-scale integration design and
computer-aided design for nanoscale silicon, energy-efficient reliable system
design, and computer architecture. He has served on the technical program
committee for several conferences, such as ASP-DAC, ICCAD, ICCD, and
GLSVLSI. He has been an ACM Senior Member since May 2016. He was
awarded the Excellent Young Researcher by Chinese Institute of Electrical
Engineering in 2015, the Best Young Professionals (Formerly GOLD) by
IEEE Tainan Section in 2016, and the Humboldt Fellowship for Experienced
Researcher in 2019.

Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.

You might also like