(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA
(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA
Abstract— Convolutional neural networks (CNNs) have exponentially growing computational time of a CNN. For
achieved tremendous success in the computer vision domain example, CNN models with more than 100 layers, such as
recently. The pursue for better model accuracy drives the ResNet101 [18] and DenseNet121 [19], require a considerable
model size and the storage requirements of CNNs as well as
the computational complexity. Therefore, Compression Learning amount of computing resources and memory space. In order to
by InParallel Pruning-Quantization (CLIP-Q) was proposed to use computing resources and memory space more efficiently,
reduce a vast amount of weight storage requirements by using a quantization, which simplifies and optimizes the CNN model,
few quantized segments to represent all weights in a CNN layer. has become a popular research field.
Among various quantization strategies, CLIP-Q is suitable for Quantization [31], [35], [37] constrains a data representation
hardware accelerators because it reduces model size significantly
while maintaining the full-precision model accuracy. However, to a smaller set, for example, using an 8-bit fixed point format
the current CLIP-Q approach did not consider the hardware to represent a 32-bit floating point format. Because fewer bits
characteristics and it is not straightforward when mapped to are used to represent a number, quantization greatly reduces
a CNN hardware accelerator. In this work, we propose a storage requirements. For example, the authors in [20] use a
software-hardware codesign platform that includes a modified 16-bit and 8-bit fixed point format to represent data. Binary
version of CLIP-Q algorithm and a hardware accelerator, which
consists of 5 × 5 reconfigurable convolutional arrays with input neural networks (BNNs) [21], [22] and ternary neural networks
and output channel parallelization. Additionally, the proposed (TNNs) [23] represent data in a CNN with less than two bits,
CNN accelerator maintains the same accuracy of a full-precision which reduces the memory space requirement for more than
CNN in Cifar-10 and Cifar-100 datasets. sixteen fold.
Index Terms— Convolutional neural network, CLIP-Q, accu- Recently, many attempts have been made to deal with
racy, energy, hardware implementation. the model sparsity through model compression [35]–[37],
and Compression Learning by InParallel Pruning-Quantization
I. I NTRODUCTION (CLIP-Q) has been proposed in [24], [25]. It quantizes the
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4094 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4095
TABLE I
NETWORK S IZE C OMPARISON . CLIP-Q U SES E QUAL TO OR L ESS
T HAN 8 B ITS TO R EPRESENT A W EIGHT
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022
TABLE III
A CCURACY C OMPARISON OF D IFFERENT B ITS Q UANTIZATION OF NIN
Fig. 3. Overview of software and hardware codesign platform. weights will be clipped to 0 and the one that indicates that P%
of the negative weights will be clipped to 0 as well. In this
TABLE II
work, the clipping parameter P is set to 20 according to [24],
C OMPARISON B ETWEEN N EURAL N ETWORK M ODELS . C IFAR 10 I S U SED
[25] and our experimental results.
The second parameter is B, which represents that the
weights in a layer are divided into 2B segments for further
averaging and quantizing. It also represents that the number
of weights for a layer is 2B . To reduce weight storage, we set
B to 2 for all layers in the CNN model, which is the minimum
value of B. However, the range of the weight representation
B. Neural Network Model Selection can still cover positive numbers, 0, and negative numbers.
The first step is to determine a suitable CNN model for the
hardware implementation. One important factor in determining D. Weight and Activation Width Determination
a suitable model is the size of the available on-chip block After the parameters of CLIP-Q are determined, the next
RAM memory (or BRAM). Since the latency and energy step is to determine how many bits are used to represent a
required for off-chip DRAM memory access are much greater weight and activation. In this step, we develop an in-house
than those for on-chip memory access, if the model weights tool in C++ and Python to analyze the accuracy of a 9-layer
can be stored in on-chip BRAM, the accelerator will have NIN when different bit widths are used for the weights.
lower latency and less energy consumption. The implementa- Table III shows the accuracy of the 9-layer NIN when
tion platform used in this work was Xilinx’s XC7Z0Z0 FPGA. different bit widths are used to represent a weight. The
There is only 630KB on-chip BRAM on this FPGA. When accuracy of the neural network with an 8-bit weight width
determining a model, we prefer to select a model where as is almost equal to the full precision. Hence, a weight with an
many weights as possible can be stored in the on-chip memory. 8-bit width is used in this work.
Note that based on user requirements, different models can be Asides from bit width, the position of the decimal point
chosen. directly affects the numerical representation range and preci-
Table II compares three model candidates: AlexNet [2], sion. Thus, we also need to choose an appropriate decimal
VGG7 [29], GoogLenet [3], and Network in Network [30] point position.
(NIN). The first column is the model’s name, and the second Figure 4 shows the numerical distribution of the weights of
column is the model structure. The third column is the number each layer in the full-precision CNN in the NIN model. It can
of parameters, and the fourth column is the accuracy. “Conv” be seen that the distribution of the weights of the last three
stands for the convolutional layer, and “FC” stands for the layers is the widest, ranging approximately between +4 and
fully-connected layer. As seen from Table II, NIN has the −4. Therefore, the appropriate weight must cover between +4
lowest number of weights and accuracy that is comparable to and −4 to cover the ranges of the weights. In addition, most
the other models. In addition, since there is no fully connected of the weights in the first three layers of NIN are close to
layer in the NIN, its structure is simpler than that of other three 0. In order to represent the weights of the first three layers
models. Hence, NIN was selected to be implemented in our clearly, a certain number of bits in the fraction is required to
CNN accelerator. represent the weights. Based on the observation, we choose an
8-bit data format with a 3-bit integer and a 5-bit fraction for
C. CLIP-Q Setup and Adjustment the weight. The range of the data format is +3.96875 to −4,
CLIP-Q is a weight pruning and quantization technique which is quite close to the weight distribution, and the 5-bit
that is able to maintain accuracy as full-precision weights fraction is adequate to represent the precision required for the
while significantly reducing weight storage. Hence, it is a weight.
suitable to apply CLIP-Q on a CNN. In the first step, the Finally, we determine the data format for activations. This
CNN model is selected. In the second step, the parameters is important because even though the weights are quantized
used in CLIP-Q are determined. There are two parameters: to 8 bits, if activations between each layer still use a full-
the clipping parameter P that indicates that P% of the positive precision 32-bit format, the accelerator still requires complex
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4097
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4098 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4099
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022
TABLE VI
A CCURACY C OMPARISON OF D IFFERENT Q UANTIZATION
A LGORITHMS IN THE NIN M ODEL
TABLE VIII
C OMPARISON B ETWEEN D IFFERENT I MPLEMENTATIONS ON FPGA
B. Accuracy Comparison
After quantization and the CLIP-Q fixed-segment adjust-
ments, there are only four 8-bit weights per layer, which
conserves a considerable amount of storage. However, it is
important the accuracy of the model is maintained. If the
accuracy can be maintained, the proposed CLIP-Q is suitable
for quantization implemented in the CNN accelerator. Table VI size, our reconfigurable design reduces the input data access
compares the Cifar-10 and Cifar-100 accuracy of the models time. Thus, it can also complete convolution faster than was
using different quantization methods. All the accuracies in the case in [32].
Table VI are generated by experiments in the 9-layer NIN Table VIII shows a comparison between the proposed design
model. FULL means that the 32-bit float full precision data and related work. Because the proposed 5 × 5 convolutional
format is used. 8-bit represents the input, and weight data array improves the input and output channel parallelism,
are quantized to 8-bit precision. TNN means that all data are the overall GOP/S performance was increased. Furthermore,
represented by +1, 0, −1, and BNN only uses +1 and −1 to since the multipliers were synthesized using LUT, the power
represent data. consumption was less than when using DSP. Also, we used
It can be seen from Table VI that although TNN and fewer flip-flops on the convolution circuit, which also reduced
BNN save a lot of storage space for weights and inputs, the the power consumption. According to Table VIII, our CNN
accuracy is reduced. Especially in the Cifar-100, the accuracy accelerator’s GOP/S/W were relatively higher than those in
is significantly different from the full precision. The accuracy other work, which means the proposed design had higher
of the 8-bit precision is closer to the full precision, but energy efficiency.
there is still a 2% drop in the Cifar-100 test data. However,
the adjusted 8-bit CLIP-Q has almost the same accuracy VI. R ELATED W ORK
as the full-precision, and only four 8-bit weights are needed. In this section, we discuss previously designed FPGA-based
The proposed CLIP-Q significantly reduces storage space and CNN accelerator. Angel-Eye [32] proposed a software-
can achieve almost the same accuracy as full precision.
hardware codesign for embedded CNN applications and used
a 3 × 3 convolver to handle different computational workloads
C. Performance Comparison of various kernel size; however, the utilization rate of its 3 ×
Table VII shows the required cycles to read input feature 3 convolvers is only 1/9 when dealing with 1×1 kernel. Instead
maps for various kernel sizes in the reconfigurable convolution of designing the accelerator directly. High-level synthesis was
design. In the proposed 5 × 5 reconfigurable convolutional used to generate the design with the help of roofline model
array, as long as the kernel size is not higher than 25(= 5×5), in [33]. It measured the compute and memory requirements
it is only necessary to read the input feature map once to for each layer of a CNN model and came up with suitable
complete the convolution. For convolution kernel sizes higher architectures that efficiently utilize the memory bandwidth.
than 5 × 5, the kernel is divided into smaller kernels, each However, it mapped full precision CNN models directly to
of which is equal to or smaller than 25. Therefore, the input FPGAs without considering the underlying hardware costs,
access time of large-size kernels is equal to the kernel size and common strategies such as data quantization and model
divided by 25. Compared to [32], where convolutions of pruning were not applied. An end-to-end FPGA-based CNN
different kernel sizes were completed using a 3 × 3 kernel accelerator aiming for high throughput and high resource
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
CHENG et al.: EFFICIENT IMPLEMENTATION OF CNN WITH CLIP-Q QUANTIZATION ON FPGA 4101
utilization was proposed in [34]. While different layers have [14] H. Noh, P. H. Seo, and B. Han, “Image question answering using
different compute-to-memory ratio, it proposed a batch-based convolutional neural network with dynamic parameter prediction,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
method for fully connected layer to better utilize memory pp. 30–38.
bandwidth. It adopted 16-bit data quantization for input and [15] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
weight data; however, its models were not pruned and was networks for image question answering,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 21–29.
unfriendly to resource-limited FPGAs. [16] D. Palaz et al., “Analysis of CNN-based speech recognition system using
raw speech as input,” Idiap, Martigny, Switzerland, Tech. Rep. Idiap-RR-
23-2015, 2015.
VII. C ONCLUSION [17] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
CLIP-Q significantly reduces the CNN weight storage arXiv:1604.07316.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
requirement while also maintaining accuracy. This feature image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
makes CLIP-Q suitable for a CNN. However, the current (CVPR), Jun. 2016, pp. 770–778.
[19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
CLIP-Q approach did not consider the hardware character- “Densely connected convolutional networks,” in IEEE CVPR, Jul. 2017,
istics and the method for applying CLIP-Q when designing pp. 4700–4708.
a CNN hardware accelerator was not straightforward. In this [20] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
work, we propose a software-hardware codesign platform that Gate Arrays (ACM), 2016, pp. 26–35.
includes both the software flow and the hardware accelerator. [21] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
The software flow obtained neural model parameters suitable “Binarized neural networks: Training deep neural networks with weights
and activations constrained to +1 or −1,” 2016, arXiv:1602.02830.
for hardware implementation. We also designed a CNN hard- [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
ware accelerator. The accelerator executed convolutions with “Binarized neural networks,” in Proc. Int. Conf. Neural Inf. Process.
various kernel sizes through 5×5 reconfigurable convolutional Syst., 2016, pp. 4114–4122.
[23] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
arrays and improved parallelism in both the input and output 2016, arXiv:1612.01064.
channels. The experimental results show that the proposed [24] F. Tung and G. Mori, “CLIP-Q: Deep network compression learning
CNN accelerator has higher energy efficiency than the state- by in-parallel pruning-quantization,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 7873–7882.
of-the art alternatives. [25] F. Tung and G. Mori, “Deep neural network compression by in-parallel
pruning-quantization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42,
no. 3, pp. 568–579, Mar. 2018.
R EFERENCES [26] H. Yonekawa and H. Nakahara, “On-chip memory based binarized
[1] Y. Wei et al., “HCP: A flexible CNN framework for multi-label image convolutional deep neural network applying batch normalization free
classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, technique on an FPGA,” in Proc. IEEE Int. Parallel Distrib. Process.
pp. 1901–1907, Jun. 2015. Symp. Workshops (IPDPSW), May 2017, pp. 98–105.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [27] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “FBNA: A fully
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. binarized neural network accelerator,” in Proc. 28th Int. Conf. Field
Process. Syst. (NIPS), 2012, pp. 1097–1105. Program. Log. Appl. (FPL), Aug. 2018, pp. 51–513.
[28] B. Jacob et al., “Quantization and training of neural networks for
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. efficient integer-arithmetic-only inference,” in Proc. IEEE/CVF Conf.
[4] S. Gidaris and N. Komodakis, “Object detection via a multi-region and Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2704–2713.
[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
semantic segmentation-aware CNN model,” in Proc. IEEE Int. Conf.
large-scale image recognition,” 2014, arXiv:1409.1556.
Comput. Vis. (ICCV), Dec. 2015, pp. 1134–1142.
[30] M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013,
[5] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region- arXiv:1312.4400.
based fully convolutional networks,” in Proc. Adv. Neural Inf. Process. [31] T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and
Syst., 2016, pp. 379–387. quantization for deep neural network acceleration: A survey,” Neuro-
[6] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. computing, vol. 461, pp. 370–403, Oct. 2021.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. [32] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look onto embedded FPGA,” IEEE Trans. Comput.-Aided Design Integr.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Circuits Syst., vol. 37, no. 1, pp. 35–47, Jan. 2017.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. [33] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep
[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, convolutional neural networks,” in Proc. ACM/SIGDA Int. Symp. Field-
“Semantic image segmentation with deep convolutional nets and fully Program. Gate Arrays, vol. 2015, pp. 161–170.
connected CRFs,” 2014, arXiv:1412.7062. [34] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks formance FPGA-based accelerator for large-scale convolutional neural
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern networks,” in Proc. 26th Int. Conf. Field Program. Log. Appl. (FPL),
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. Aug. 2016, pp. 1–9.
[10] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution [35] Z. Song et al., “DRQ: Dynamic region-based quantization for deep
residual networks for semantic segmentation in street scenes,” neural network acceleration,” in Proc. ACM/IEEE 47th Annu. Int.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, Symp. Comput. Archit. (ISCA), May 2020, pp. 1010–1021, doi:
pp. 4151–4160. 10.1109/ISCA45697.2020.00086.
[11] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated [36] X. Zhou et al., “Cambricon-S: Addressing irregularity in sparse
convolutions,” 2015, arXiv:1511.07122. neural networks through a cooperative software/hardware approach,” in
[12] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
to reason: End-to-end module networks for visual question answer- Oct. 2018, pp. 15–28, doi: 10.1109/MICRO.2018.00011.
ing,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, [37] S. Q. Zhang, B. McDanel, H. T. Kung, and X. Dong, “Training
pp. 804–813. for multi-resolution inference using reusable quantization terms,” in
[13] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A Proc. 26th ACM Int. Conf. Architectural Support Program. Lang.
neural-based approach to answering questions about images,” in Proc. Operating Syst., Apr. 2021, pp. 845–860, doi: 10.1145/3445814.
IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1–9. 3446741.
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.
4102 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 10, OCTOBER 2022
Wei Cheng received the B.E. degree in computer Yun-Yang Shih received the M.S. degree in com-
engineering from The University of Hong Kong in puter science and information engineering from the
2018. He is currently pursuing the master’s degree National Cheng Kung University in 2020. He is
with the Department of Computer Science and Infor- currently with Mediatek Inc. His research interests
mation Engineering, National Cheng Kung Univer- lie in the field of very large-scale integration design
sity. His research interests lie in the field of very and deep neural network accelerators.
large-scale integration design, computer architecture,
and deep neural network accelerators.
Authorized licensed use limited to: Presidency University. Downloaded on September 27,2023 at 06:11:43 UTC from IEEE Xplore. Restrictions apply.