A CNN Accelerator On FPGA Using Depthwise Separable Convolution
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
Abstract—Convolutional neural networks (CNNs) have been GPUs, are adopted in neural network applications [1]–[12].
widely deployed in the fields of computer vision and pattern More specifically, increasing research attention is focused
recognition because of their high accuracy. However, large con- on FPGA-based CNN accelerator due to the possibility of
volution operations are computing intensive and often require
a powerful computing platform such as a graphics process- trade-off between power consumption and reconfigurability.
ing unit. This makes it difficult to apply CNNs to portable To further lighten the computing burden of standard convo-
devices. The state-of-the-art CNNs, such as MobileNetV2 and lution, depthwise separable convolution is proposed in [13].
Xception, adopt depthwise separable convolution to replace the This has been applied in MobileNetV1 [14] and later
standard convolution for embedded platforms, which signifi- MobileNetV2 [15], and thus achieved comparable results with
cantly reduces operations and parameters with only limited loss
in accuracy. This highly structured model is very suitable for much less multiply-accumulation operations and parameters.
field-programmable gate array (FPGA) implementation. In this Almost all the existed FPGA-based CNN implementation
brief, a scalable high performance depthwise separable convo- works were to explore memory bandwidth and computing
lution optimized CNN accelerator is proposed. The accelerator parallelism limitations. To conquer the limitation of memory
can be fit into an FPGA of different sizes, provided the balancing bandwidth, [2] and [3] stored the parameters in on-chip
between hardware resources and processing speed. As an exam-
ple, MobileNetV2 is implemented on Arria 10 SoC FPGA, and memory. However, as CNN goes deeper, parameters required
the results show this accelerator can classify each picture from by convolution increase sharply, which makes the on-chip
ImageNet in 3.75 ms, which is about 266.6 frames per second. memory solution inefficient. Other works like [4]–[6] allevi-
The FPGA design achieves 20x speedup if compared to CPU. ated the pressure on off-chip memory through limiting the
Index Terms—Convolutional neural network, FPGA, hardware parameters precision of the neural networks, as lower numeri-
accelerator, MobileNetV2. cal precision were proved to be sufficient for CNN [16], [17].
In [7] and [8], computing engine was optimized for highly
parallelism in computation. Reference [6] proposed a pipeline
I. I NTRODUCTION based solution for CNN for high throughput. Reference [9]
OWADAYS, convolutional neural networks (CNNs) have made a comprehensive evaluation and comparison of Altera
N become the center of interest, due to their supe-
rior performance in tasks ranging from image classification,
and Xilinx OpenCL frameworks for CNN. Reference [10]
explored the sparsity-based optimizations, which could achieve
semantic segmentation, to object detection and tracking. This up to 3x higher core energy efficiency and raise the device-
technique has also been widely used in the industry, such level energy efficiency by around 70% through data compres-
as autonomous driving, video surveillance, speech recogni- sion. Both [11] and [12] implemented separable depthwise
tion, etc. convolution with the example MobileNetV1, and achieved
CNN is a computing intensive model. It consumes huge processing speed at 7.85ms per image and 231.7 frames per
amounts of computing power during training and deploy- second (fps) respectively.
ment. In practice, Graphics Processing Units (GPUs) are often The key contributions of this brief are:
selected as the platform. However, GPU’s natural of high (1) A high performance CNN hardware accelerator frame-
power consumption limits its application in embedded scenario work is proposed where all layers are processed in a computing
such as portable devices and wearable systems. Therefore, unit named matrix multiplication engine.
Field-Programmable Gate Arrays (FPGAs) and Application- (2) The utilization of hierarchical memory structure and
Specific Integrated Circuits (ASICs), as the replacement of ping-pong on-chip buffer reduces the bandwidth limitation of
off-chip memory.
Manuscript received March 31, 2018; revised June 13, 2018 and (3) A methodology for scalable design is proposed, so that
July 17, 2018; accepted July 18, 2018. Date of publication August 17, 2018;
date of current version September 27, 2018. This work was supported in part this framework can be implemented in various FPGAs, through
by the U.S. NSF under Grant 1626236, and in part by MathWorks. This brief balancing the on-chip resources and performance.
was recommended by Associate Editor J. M. de la Rosa. (Corresponding (4) By applying the proposed framework and methods, the
author: Xinming Huang.)
The authors are with the Department of Electrical and Computer state-of-the-art CNN, MobileNetV2 [15], for the first time, is
Engineering, Worcester Polytechnic Institute, Worcester, MA 01609 USA implemented on Arria 10 SoC FPGA. The results show 266.6
(e-mail: [email protected]). frames per second and 170.6 Giga Operations Per Second
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. (GOPS) at system clock frequency of 133MHz. This represents
Digital Object Identifier 10.1109/TCSII.2018.2865896 a 20x speedup comparing to that on CPU [15].
1549-7747 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
1416 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 10, OCTOBER 2018
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
BAI et al.: CNN ACCELERATOR ON FPGA USING DEPTHWISE SEPARABLE CONVOLUTION 1417
TABLE I
M OBILE N ET V2 S TRUCTURE [15], W HERE E ACH L INE R EPRESENTS A
S EQUENCE OF 1 OR M ORE I DENTICAL (E XCEPT S TRIDE ) L AYERS .
A LL D EPTHWISE C ONVOLUTIONS U SE 3 X 3 K ERNELS
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
1418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 10, OCTOBER 2018
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
BAI et al.: CNN ACCELERATOR ON FPGA USING DEPTHWISE SEPARABLE CONVOLUTION 1419
TABLE II
R ESOURCE U SAGE OF M OBILE N ET V2 portable devices. By choosing different number of MMEs and
variable on-chip memories, this accelerator can be fit into a
large or small FPGA. As an example, the latest MobileNetV2
is implemented on Arria 10 SoC FPGA, which achieves
266.6 fps and 170.6 GOPS.
R EFERENCES
TABLE III [1] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
C OMPARISON TO OTHER I MPLEMENTATION energy-efficient reconfigurable accelerator for deep convolutional neu-
ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
[2] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Proc. 47th Annu. IEEE ACM Int. Symp. Microarchit. (MICRO), 2014,
pp. 609–622.
[3] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sen-
sor,” ACM SIGARCH Comput. Archit. News, vol. 43, no. 3, pp. 92–104,
2015.
[4] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
is chosen because it is widely selected by previous neous algorithms for accelerating deep convolutional neural networks on
works [2], [3], [6], [20]. FPGAs,” in Proc. 54th ACM/EDAC/IEEE Design Autom. Conf. (DAC),
Based on the description in Section III, 4-MME array is Austin, TX, USA, 2017, pp. 1–6.
decided to instantiate in this design after carefully balancing [5] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Automated mapping
of convolutional neural networks on FPGAs,” in Proc. ACM/SIGDA Int.
the resources usage and processing time. The weight buffer Symp. Field Program. Gate Arrays (FPGA), 2017, pp. 291–292.
size is 36Kb as a ping-pong buffer. Since the update rate of [6] H. Li et al., “A high performance FPGA-based accelerator for large-scale
weights when performing depthwise separable convolution is convolutional neural networks,” in Proc. 26th Int. Conf. Field Program.
Logic Appl. (FPL), 2016, pp. 1–9.
every M × M clock cycles. The size of intermediate feature [7] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop oper-
map buffer is 24.5Mb. ation and dataflow in FPGA acceleration of deep convolutional neural
networks,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays
(FPGA), 2017, pp. 45–54.
B. Implementation Results [8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field Program.
Fig. 12 presents the system architecture on Arria 10 SoC. Gate Arrays (FPGA), 2016, pp. 26–35.
Since HPS is not used in this design, only FPGA part is shown. [9] R. Tapiador et al., “Comprehensive evaluation of openCL-based con-
The DDR4 memory is the one connected to the FPGA part. volutional neural network accelerators in Xilinx and Altera FPGAs,”
arXiv:1609.09296 [cs], Sep. 2016.
The CNN accelerator runs at frequency 133MHz. Its adder [10] A. Aimar et al., “NullHop: A flexible convolutional neural network
tree limits this frequency. A Nios II softcore micro-processor accelerator based on sparse representations of feature maps,”
is implemented for loading weights and input images from arXiv:1706.01406v2 [cs], Mar. 2018.
[11] J. Su et al., “Redundancy-reduced mobilenet acceleration on recon-
flash memory to DDR4 external memory. An external memory figurable logic for ImageNet classification,” in Proc. Appl. Reconfig.
interface IP combined with a Modular Scatter-Gather Direct Comput. Archit. Tools Appl., 2018, pp. 16–28.
Memory Access (mSG-DMA) IP are used to bridge the buffers [12] R. Zhao, X. Niu, and W. Luk, “Automatic optimising CNN with
depthwise separable convolution on FPGA: (Abstact only),” in Proc.
in the CNN accelerator and the FPGA memory, whose max- ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2018,
imum bandwidth is 8.5GB/s. This structure avoids the host’s p. 285.
intervention during multiple transfers back and forth with [13] L. Sifre and S. Mallat, “Rigid-motion scattering for texture classifica-
tion,” arXiv:1403.1687 [cs], Mar. 2014.
DDR4 memory and makes non-continuous data movement
[14] A. G. Howard et al., “MobileNets: Efficient convolutional neu-
more efficient. The function of customized mSG-DMA con- ral networks for mobile vision applications,” arXiv:1704.04861 [cs],
troller makes it possible to drive mSG-DMA to read/write Apr. 2017.
different sizes of data from/to specific addresses, in order to [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
fit convolutions in various sizes. arXiv:1801.04381v3 [cs], Apr. 2018.
The implementation result is listed in Table II. [16] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural
Table III provides a comparison between the solution networks with low precision multiplications,” arXiv:1412.7024v5 [cs],
Sep. 2015.
proposed in this brief and other similar ones. Note that [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
MobileNetV2 has more complex structure and higher accuracy learning with limited numerical precision,” in Proc. Int. Conf. Mach.
on benchmarks. Learn. (ICML), 2015, pp. 1737–1746.
[18] F. Chollet, “Xception: Deep learning with depthwise separable convo-
lutions,” arXiv:1610.02357v3 [cs], Apr. 2017.
V. C ONCLUSION [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,”
In this brief, a high-performance, scalable CNN accelerator arXiv:1502.03167v3 [cs], Mar. 2015.
is proposed. This structure is optimized for depth separa- [20] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable and
modularized RTL compilation of convolutional neural networks onto
ble convolution, which results in remarkably less operations FPGA,” in Proc. 26th Int. Conf. Field Program. Logic Appl. (FPL),
and parameters. This makes it possible to run the CNNs on 2016, pp. 1–8.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.