0% found this document useful (0 votes)

45 views

A CNN Accelerator On FPGA Using Depthwise Separable Convolution

This document summarizes a research paper that proposes a CNN accelerator on an FPGA using depthwise separable convolutions. Key points: 1) CNNs require significant computing power but are difficult to implement on embedded devices like FPGAs due to their memory and power constraints. 2) Depthwise separable convolutions significantly reduce computations and parameters with minimal loss of accuracy, making them suitable for FPGA implementation. 3) The proposed accelerator uses a matrix multiplication engine and hierarchical memory structure to optimize for parallelism and reduce bandwidth limitations of off-chip memory. 4) It implements MobileNetV2 on an Arria 10 FPGA and achieves 266 frames/second, a 20x speed

Uploaded by

Tuhin Karak

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

A CNN Accelerator On FPGA Using Depthwise Separable Convolution

Uploaded by

Tuhin Karak

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO.

10, OCTOBER 2018 1415

A CNN Accelerator on FPGA Using

Depthwise Separable Convolution
Lin Bai , Student Member, IEEE, Yiming Zhao, and Xinming Huang , Senior Member, IEEE

Abstract—Convolutional neural networks (CNNs) have been GPUs, are adopted in neural network applications [1]–[12].
widely deployed in the fields of computer vision and pattern More specifically, increasing research attention is focused
recognition because of their high accuracy. However, large con- on FPGA-based CNN accelerator due to the possibility of
volution operations are computing intensive and often require
a powerful computing platform such as a graphics process- trade-off between power consumption and reconfigurability.
ing unit. This makes it difficult to apply CNNs to portable To further lighten the computing burden of standard convo-
devices. The state-of-the-art CNNs, such as MobileNetV2 and lution, depthwise separable convolution is proposed in [13].
Xception, adopt depthwise separable convolution to replace the This has been applied in MobileNetV1 [14] and later
standard convolution for embedded platforms, which signifi- MobileNetV2 [15], and thus achieved comparable results with
cantly reduces operations and parameters with only limited loss
in accuracy. This highly structured model is very suitable for much less multiply-accumulation operations and parameters.
field-programmable gate array (FPGA) implementation. In this Almost all the existed FPGA-based CNN implementation
brief, a scalable high performance depthwise separable convo- works were to explore memory bandwidth and computing
lution optimized CNN accelerator is proposed. The accelerator parallelism limitations. To conquer the limitation of memory
can be fit into an FPGA of different sizes, provided the balancing bandwidth, [2] and [3] stored the parameters in on-chip
between hardware resources and processing speed. As an exam-
ple, MobileNetV2 is implemented on Arria 10 SoC FPGA, and memory. However, as CNN goes deeper, parameters required
the results show this accelerator can classify each picture from by convolution increase sharply, which makes the on-chip
ImageNet in 3.75 ms, which is about 266.6 frames per second. memory solution inefficient. Other works like [4]–[6] allevi-
The FPGA design achieves 20x speedup if compared to CPU. ated the pressure on off-chip memory through limiting the
Index Terms—Convolutional neural network, FPGA, hardware parameters precision of the neural networks, as lower numeri-
accelerator, MobileNetV2. cal precision were proved to be sufficient for CNN [16], [17].
In [7] and [8], computing engine was optimized for highly
parallelism in computation. Reference [6] proposed a pipeline
I. I NTRODUCTION based solution for CNN for high throughput. Reference [9]
OWADAYS, convolutional neural networks (CNNs) have made a comprehensive evaluation and comparison of Altera
N become the center of interest, due to their supe-
rior performance in tasks ranging from image classification,
and Xilinx OpenCL frameworks for CNN. Reference [10]
explored the sparsity-based optimizations, which could achieve
semantic segmentation, to object detection and tracking. This up to 3x higher core energy efficiency and raise the device-
technique has also been widely used in the industry, such level energy efficiency by around 70% through data compres-
as autonomous driving, video surveillance, speech recogni- sion. Both [11] and [12] implemented separable depthwise
tion, etc. convolution with the example MobileNetV1, and achieved
CNN is a computing intensive model. It consumes huge processing speed at 7.85ms per image and 231.7 frames per
amounts of computing power during training and deploy- second (fps) respectively.
ment. In practice, Graphics Processing Units (GPUs) are often The key contributions of this brief are:
selected as the platform. However, GPU’s natural of high (1) A high performance CNN hardware accelerator frame-
power consumption limits its application in embedded scenario work is proposed where all layers are processed in a computing
such as portable devices and wearable systems. Therefore, unit named matrix multiplication engine.
Field-Programmable Gate Arrays (FPGAs) and Application- (2) The utilization of hierarchical memory structure and
Specific Integrated Circuits (ASICs), as the replacement of ping-pong on-chip buffer reduces the bandwidth limitation of
off-chip memory.
Manuscript received March 31, 2018; revised June 13, 2018 and (3) A methodology for scalable design is proposed, so that
July 17, 2018; accepted July 18, 2018. Date of publication August 17, 2018;
date of current version September 27, 2018. This work was supported in part this framework can be implemented in various FPGAs, through
by the U.S. NSF under Grant 1626236, and in part by MathWorks. This brief balancing the on-chip resources and performance.
was recommended by Associate Editor J. M. de la Rosa. (Corresponding (4) By applying the proposed framework and methods, the
author: Xinming Huang.)
The authors are with the Department of Electrical and Computer state-of-the-art CNN, MobileNetV2 [15], for the first time, is
Engineering, Worcester Polytechnic Institute, Worcester, MA 01609 USA implemented on Arria 10 SoC FPGA. The results show 266.6
(e-mail: [email protected]). frames per second and 170.6 Giga Operations Per Second
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. (GOPS) at system clock frequency of 133MHz. This represents
Digital Object Identifier 10.1109/TCSII.2018.2865896 a 20x speedup comparing to that on CPU [15].
1549-7747 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
1416 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 10, OCTOBER 2018

Fig. 2. Bottleneck operations in different strides.

in case of stride length of 1, the number of weights needed

for standard convolution is [14]
WSC = K × K × N × P (1)
and the corresponding number of operations is
OSC = M × M × K × K × N × P (2)
In case of depthwise separable convolution, the total number
of weights is

Fig. 1. Comparison of different convolution types.

WDSC = K × K × N + N × P (3)
and the total number of operations is
This brief is organized as follows. Section II provides ODSC = M × M × K × K × N + M × M × N × P (4)
fundamental knowledge of depthwise separable convolution,
followed by one of its application, MobilNetV2. Section III Thus, the reduction factors on weights and operation are
describes the architecture of the accelerator, including the calculated in (5)-(6):
matrix multiplication engine, and on-chip buffer organiza- WDSC 1 1
tion. System implementation and its results are discussed in FW = = + 2 (5)
WSC P K
Section IV. The conclusion is given in Section V. ODSC 1 1
FO = = + 2 (6)
OSC P K
II. D EPTHWISE S EPARABLE C ONVOLUTION
One of the typical application of depthwise separable con-
Depthwise separable convolution was first introduced volution is MobileNetV2, the successor of MobileNetV1 [14].
in [18]. As one kind of the factorized convolutions, depthwise Comparing to its first version, the newly proposed
separable convolution factorizes the standard convolution into MobileNetV2 further decreased the number of weights by
a depthwise convolution plus a pointwise convolution. Fig. 1 shrinking the output channels in some layers. It also improves
demonstrates how the standard convolution (SC), depthwise its performance through importing one more pointwise convo-
convolution (DWC) and pointwise convolution (PWC) work. lution layer before the depthwise separable convolution. The
In standard convolution, each input channel has to do a con- new operation is called bottleneck (Fig. 2).
volution with one specific kernel, and then the result is the The network structure of MobileNetV2 is illustrated in
sum of the convolution results from all channels. While in Table I.
depthwise separable convolution case, depthwise convolution
is the first step, performing the convolution for each input
III. S YSTEM D ESIGN
channel individually. The next step is to do convolution in
pointwise, which is actually a standard convolution with kernel A. Architecture Overview
size 1 × 1. Comparing to standard convolution, using depth- The block diagram in Fig. 3 gives an overview of this
wise separable convolution considerably reduces the number accelerator. The proposed matrix multiplication engine (MME)
of mathematical operations and the number of parameters. array in this brief is responsible for all the CNN operations,
As it is shown in Fig. 1, considering the input feature map including convolution, normalization, ReLU and pooling. All
with size M × M × N and kernel size K × K × N × P, the parameters and input images are stored on off-chip

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
BAI et al.: CNN ACCELERATOR ON FPGA USING DEPTHWISE SEPARABLE CONVOLUTION 1417

TABLE I
M OBILE N ET V2 S TRUCTURE [15], W HERE E ACH L INE R EPRESENTS A
S EQUENCE OF 1 OR M ORE I DENTICAL (E XCEPT S TRIDE ) L AYERS .
A LL D EPTHWISE C ONVOLUTIONS U SE 3 X 3 K ERNELS

Fig. 5. Line buffer in MME.

Fig. 6. Adder tree modes for different convolution.

Fig. 3. Block diagram of accelerator system.

Fig. 7. Block diagram of adder tree.

Fig. 4. Block diagram of an MME. it is illustrated by Fig. 5. The implementation length is

(K − 1) × M + K.
2) Adder Tree: Adder tree is configurable to do the sum-
memory. A ping-pong weight buffer is placed between MME ming operation in depthwise or pointwise (Fig. 6). In Fig. 7,
array and memory to maximize the bandwidth. Biases are black lines or blocks are shared by both types of convolution.
loaded to the registers in MME array. Feature map buffer Blue part is used when doing depthwise convolution. While
stores all the intermediate feature maps to avoid the latency red part works if pointwise convolution is selected. All the
brought by off-chip memory read and write. The accelerator biases all added in this stage.
is controlled by a general finite state machine (FSM). 3) Standard Convolution: To avoid losing too much infor-
mation, standard convolution is adopted to do the first layer
B. Matrix Multiplication Engine convolution. Therefore, this accelerator is adapted to be able
In this brief, each MME consists of 32 slices line buffer, 32 to do the standard convolution with input feature map chan-
slices 3 × 3 multiplier array, 1 adder tree, 1 normalization nel is 3. For vision applications, the channel number of input
(Norm) block, 1 ReLU block and 1 pooling block (Fig. 4). In feature map is always 3.
each convolution, MME loads the feature maps and weights to 4) Depthwise Convolution: Depthwise convolution per-
line buffers. After multiplication in multiplier array, adder tree forms convolution for each feature map separately. As shown
sums the products according to the selected convolution type. in Fig. 8, adder tree is configured to sum up the products from
The following operations are optional normalization, ReLU each slice of multiplier array in parallel. For one MME, the
and pooling. output channel number is 32.
1) Line Buffer: The working length of line buffer can 5) Pointwise Convolution: Pointwise convolution is actu-
be selected by control FSM to fit different input sizes, as ally standard convolution with kernel size 1 × 1 (Fig. 9). To

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.
1418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 10, OCTOBER 2018

Fig. 8. Depthwise convolution in MME.

Fig. 11. Weight buffer in ping-pong structure.

Fig. 9. Pointwise convolution in MME.

Fig. 12. System architecture of the FPGA design.

Fig. 10. Divide-conquer for large matrix multiplication.

high bandwidth. Contrarily, external memory has the capabil-

fully take advantage of all the multipliers in MME, the input ity to store large amount of data but with the penalty of limited
feature map is divided into several M × M × 32 sub-matrices, bandwidth. Therefore, in this proposed accelerator, we adapt
and these sub-matrices are shifted into line buffers one after the hierarchical memory methodology. Weight buffer loads the
another. This idea comes from divide and conquer algorithm needed parameters from external memory before each convo-
in large matrix multiplication illustrated in Fig. 10, which con- lution starts. This, on one hand, reduces the latency caused by
sists in dividing large matrix into several small matrices and parameters loading, and on the other hand, avoids the latency
sum the results up after doing small matrix multiplication. For brought the limited bandwidth of external memory. Besides,
one MME, it is able to do M 2 × 32 and 32 × 9 multiplication weight buffer is built as a ping-pong buffer (Fig. 11), which
at once. The adder tree sums up the 32 products in each cell means that when weight buffer 1 outputs data for convolution,
as revealed by Fig. 9. Thus the output channel number is 9. the weight buffer 2 loads the data from external memory for
6) Normalization: After training, parameters of batch nor- the next one and vice versa.
malization are fixed [19]. Thus the complex normalization is Intermediate feature maps is another way chosen during
downgraded into multiplication and add operation. system design to reduce processing time. Its size depends on
7) Pooling: Average pooling and max pooling are treated the number of MME instantiated and the size of feature map.
differently. As pixels of a feature map channel are output one
by one, average pooling could be easily calculated by adding
IV. R ESULTS
one more multiply-accumulate stage by a factor of 1/S, where
S is average pooling size. On the other hand, max pooling The proposed accelerator architecture (Fig. 12) is demon-
needs one more comparison stage. strated by implementing the MobileNetV2 network on the
8) ReLU: Same as the pooling layer, a ReLU stage is Arria 10 SoC Development Kit (10AS066N3F40E2SG), which
added after the normalization stage. Three options: no ReLU, contains 251680 ALMs, 2131 M20K, and 1687 DSP blocks.
standard ReLU and ReLU6 are selectable. The design consideration will be described below and then
followed by implementation results with resource utilization.
C. Memory Organization
To have an efficient memory organization, one has to A. Implementation Consideration
balance on-chip memory resources and external memory band- As mentioned in Section I, lower numerical precision
width. On-chip memory is limited on FPGA but supplies very is sufficient for CNN. So 16-bit quantization strategy

TABLE II
R ESOURCE U SAGE OF M OBILE N ET V2 portable devices. By choosing different number of MMEs and
variable on-chip memories, this accelerator can be fit into a
large or small FPGA. As an example, the latest MobileNetV2
is implemented on Arria 10 SoC FPGA, which achieves
266.6 fps and 170.6 GOPS.

R EFERENCES
TABLE III [1] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
C OMPARISON TO OTHER I MPLEMENTATION energy-efficient reconfigurable accelerator for deep convolutional neu-
ral networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
[2] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Proc. 47th Annu. IEEE ACM Int. Symp. Microarchit. (MICRO), 2014,
pp. 609–622.
[3] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sen-
sor,” ACM SIGARCH Comput. Archit. News, vol. 43, no. 3, pp. 92–104,
2015.
[4] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge-
is chosen because it is widely selected by previous neous algorithms for accelerating deep convolutional neural networks on
works [2], [3], [6], [20]. FPGAs,” in Proc. 54th ACM/EDAC/IEEE Design Autom. Conf. (DAC),
Based on the description in Section III, 4-MME array is Austin, TX, USA, 2017, pp. 1–6.
decided to instantiate in this design after carefully balancing [5] S. I. Venieris and C.-S. Bouganis, “fpgaConvNet: Automated mapping
of convolutional neural networks on FPGAs,” in Proc. ACM/SIGDA Int.
the resources usage and processing time. The weight buffer Symp. Field Program. Gate Arrays (FPGA), 2017, pp. 291–292.
size is 36Kb as a ping-pong buffer. Since the update rate of [6] H. Li et al., “A high performance FPGA-based accelerator for large-scale
weights when performing depthwise separable convolution is convolutional neural networks,” in Proc. 26th Int. Conf. Field Program.
Logic Appl. (FPL), 2016, pp. 1–9.
every M × M clock cycles. The size of intermediate feature [7] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop oper-
map buffer is 24.5Mb. ation and dataflow in FPGA acceleration of deep convolutional neural
networks,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays
(FPGA), 2017, pp. 45–54.
B. Implementation Results [8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field Program.
Fig. 12 presents the system architecture on Arria 10 SoC. Gate Arrays (FPGA), 2016, pp. 26–35.
Since HPS is not used in this design, only FPGA part is shown. [9] R. Tapiador et al., “Comprehensive evaluation of openCL-based con-
The DDR4 memory is the one connected to the FPGA part. volutional neural network accelerators in Xilinx and Altera FPGAs,”
arXiv:1609.09296 [cs], Sep. 2016.
The CNN accelerator runs at frequency 133MHz. Its adder [10] A. Aimar et al., “NullHop: A flexible convolutional neural network
tree limits this frequency. A Nios II softcore micro-processor accelerator based on sparse representations of feature maps,”
is implemented for loading weights and input images from arXiv:1706.01406v2 [cs], Mar. 2018.
[11] J. Su et al., “Redundancy-reduced mobilenet acceleration on recon-
flash memory to DDR4 external memory. An external memory figurable logic for ImageNet classification,” in Proc. Appl. Reconfig.
interface IP combined with a Modular Scatter-Gather Direct Comput. Archit. Tools Appl., 2018, pp. 16–28.
Memory Access (mSG-DMA) IP are used to bridge the buffers [12] R. Zhao, X. Niu, and W. Luk, “Automatic optimising CNN with
depthwise separable convolution on FPGA: (Abstact only),” in Proc.
in the CNN accelerator and the FPGA memory, whose max- ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA), 2018,
imum bandwidth is 8.5GB/s. This structure avoids the host’s p. 285.
intervention during multiple transfers back and forth with [13] L. Sifre and S. Mallat, “Rigid-motion scattering for texture classifica-
tion,” arXiv:1403.1687 [cs], Mar. 2014.
DDR4 memory and makes non-continuous data movement
[14] A. G. Howard et al., “MobileNets: Efficient convolutional neu-
more efficient. The function of customized mSG-DMA con- ral networks for mobile vision applications,” arXiv:1704.04861 [cs],
troller makes it possible to drive mSG-DMA to read/write Apr. 2017.
different sizes of data from/to specific addresses, in order to [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
fit convolutions in various sizes. arXiv:1801.04381v3 [cs], Apr. 2018.
The implementation result is listed in Table II. [16] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neural
Table III provides a comparison between the solution networks with low precision multiplications,” arXiv:1412.7024v5 [cs],
Sep. 2015.
proposed in this brief and other similar ones. Note that [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
MobileNetV2 has more complex structure and higher accuracy learning with limited numerical precision,” in Proc. Int. Conf. Mach.
on benchmarks. Learn. (ICML), 2015, pp. 1737–1746.
[18] F. Chollet, “Xception: Deep learning with depthwise separable convo-
lutions,” arXiv:1610.02357v3 [cs], Apr. 2017.
V. C ONCLUSION [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,”
In this brief, a high-performance, scalable CNN accelerator arXiv:1502.03167v3 [cs], Mar. 2015.
is proposed. This structure is optimized for depth separa- [20] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable and
modularized RTL compilation of convolutional neural networks onto
ble convolution, which results in remarkably less operations FPGA,” in Proc. 26th Int. Conf. Field Program. Logic Appl. (FPL),
and parameters. This makes it possible to run the CNNs on 2016, pp. 1–8.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on January 30,2024 at 05:00:53 UTC from IEEE Xplore. Restrictions apply.

GN 49 of 1971
100% (1)
GN 49 of 1971
13 pages
Chemical Engineering Design Project - Potash Production - The Design of A Rod Mill
No ratings yet
Chemical Engineering Design Project - Potash Production - The Design of A Rod Mill
44 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
A CNN Accelerator on FPGA Using Depthwise
No ratings yet
A CNN Accelerator on FPGA Using Depthwise
5 pages
UNPU an Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
No ratings yet
UNPU an Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
13 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
No ratings yet
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
8 pages
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
No ratings yet
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
12 pages
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
No ratings yet
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
13 pages
(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
No ratings yet
(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
14 pages
FP-BNN-on-FPGA
No ratings yet
FP-BNN-on-FPGA
15 pages
10190052
No ratings yet
10190052
11 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
C Flex Lions
No ratings yet
C Flex Lions
9 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
No ratings yet
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
14 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
4 pages
A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks
No ratings yet
A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks
19 pages
Reconfigurable_VLSI_Design_Architecture_for_Deep_Learning_Established_Forelimb_and_Hindlimb_Gesture_Recognition_for_Rehabilitation_Application
No ratings yet
Reconfigurable_VLSI_Design_Architecture_for_Deep_Learning_Established_Forelimb_and_Hindlimb_Gesture_Recognition_for_Rehabilitation_Application
10 pages
Finn
No ratings yet
Finn
10 pages
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
13 pages
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
No ratings yet
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
12 pages
2017.01.jssc.eyeriss_design
No ratings yet
2017.01.jssc.eyeriss_design
12 pages
Mhamdan Publication
No ratings yet
Mhamdan Publication
7 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
Kanoria Shubham Anil 2023HT01569
No ratings yet
Kanoria Shubham Anil 2023HT01569
9 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
No ratings yet
Accelerating Deep Convolutional Neural Networks Using Number Theoretic Transform
12 pages
1 s2.0 S0141933124000322 Main
No ratings yet
1 s2.0 S0141933124000322 Main
7 pages
FPGA Design for Object Detection
No ratings yet
FPGA Design for Object Detection
12 pages
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
13 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
Bonnard Et Al-2020-On Building A CNN-based Multi-View Smart Camera For Real-Time Object Detection
No ratings yet
Bonnard Et Al-2020-On Building A CNN-based Multi-View Smart Camera For Real-Time Object Detection
33 pages
Jlpea 12 00011 v3
No ratings yet
Jlpea 12 00011 v3
16 pages
Applied Acoustics: Konstantinos Gkanos, Finnur Pind, Hans Henrik Brandenborg Sørensen, Cheol-Ho Jeong
No ratings yet
Applied Acoustics: Konstantinos Gkanos, Finnur Pind, Hans Henrik Brandenborg Sørensen, Cheol-Ho Jeong
9 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
Power Efficient Design of High-Performance Convolu
No ratings yet
Power Efficient Design of High-Performance Convolu
14 pages
Walther U GLOBECOM 99
No ratings yet
Walther U GLOBECOM 99
5 pages
A Two-Stage Operand Trimming Approximate
No ratings yet
A Two-Stage Operand Trimming Approximate
11 pages
A Convolutional Neural Network Accelerator Architecture
No ratings yet
A Convolutional Neural Network Accelerator Architecture
5 pages
sensors-23-02045
No ratings yet
sensors-23-02045
16 pages
cao2019
No ratings yet
cao2019
5 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
Approximate NoC and Memory Controller Architectures For GPGPU Accelerators
No ratings yet
Approximate NoC and Memory Controller Architectures For GPGPU Accelerators
15 pages
A High Performance Reconfigurable Hardware Archite (5)
No ratings yet
A High Performance Reconfigurable Hardware Archite (5)
17 pages
Stripes Bit-Serial Deep Neural Network Computing
No ratings yet
Stripes Bit-Serial Deep Neural Network Computing
12 pages
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
No ratings yet
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
24 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
THE CCOAF Baby of Darshna Tudankar Premature Baby01
No ratings yet
THE CCOAF Baby of Darshna Tudankar Premature Baby01
11 pages
Blanca, Jonald R. Assignment1
No ratings yet
Blanca, Jonald R. Assignment1
5 pages
Stu w02b Beginners Guide To Reverse Engineering Android Apps PDF
No ratings yet
Stu w02b Beginners Guide To Reverse Engineering Android Apps PDF
22 pages
David Cleary Vegan Fat Loss Guide
100% (1)
David Cleary Vegan Fat Loss Guide
48 pages
Culminating Activity: Creative Nonfiction
No ratings yet
Culminating Activity: Creative Nonfiction
2 pages
Xii 4 Electromagnetic Induction
No ratings yet
Xii 4 Electromagnetic Induction
3 pages
174-Article Text-1125-1-10-20121226 (Dragged) 2
No ratings yet
174-Article Text-1125-1-10-20121226 (Dragged) 2
1 page
4 Master Class - ID Creep Wave Method
No ratings yet
4 Master Class - ID Creep Wave Method
22 pages
Readme GibbsCAM 2012 Plus
No ratings yet
Readme GibbsCAM 2012 Plus
8 pages
Fundamental Unit of Life - Class 9 Notes Padhle
No ratings yet
Fundamental Unit of Life - Class 9 Notes Padhle
14 pages
Studi Kasus
No ratings yet
Studi Kasus
8 pages
2020-09-23 - Project Notification Form (PNF) - Dorchester Bay City PDF
No ratings yet
2020-09-23 - Project Notification Form (PNF) - Dorchester Bay City PDF
187 pages
The Level of Due Dilligence in The Land Processess
No ratings yet
The Level of Due Dilligence in The Land Processess
3 pages
Schmitt Trigger
No ratings yet
Schmitt Trigger
4 pages
Product Brochure of Graphite Electrode 1
No ratings yet
Product Brochure of Graphite Electrode 1
11 pages
Muar Bio Term 2 Trial
No ratings yet
Muar Bio Term 2 Trial
12 pages
Common Admission To PG Programmes of Farm Universities of Karnataka
No ratings yet
Common Admission To PG Programmes of Farm Universities of Karnataka
35 pages
MT Educare Science Physics Homework Solutions
100% (1)
MT Educare Science Physics Homework Solutions
5 pages
Chemical
No ratings yet
Chemical
5 pages
Add.Schools Guidelines 1975 vouched account file(1)
No ratings yet
Add.Schools Guidelines 1975 vouched account file(1)
8 pages
Examples of Multiple Choice Math Exam Questions For Grade 3 High School Along With Answer Keys and Explanations
No ratings yet
Examples of Multiple Choice Math Exam Questions For Grade 3 High School Along With Answer Keys and Explanations
6 pages
Ticket - Abibus
No ratings yet
Ticket - Abibus
1 page
Halloween Pumpkin Gnome
100% (7)
Halloween Pumpkin Gnome
16 pages
Vajirkar Mrutyunjay M 200412 MS
No ratings yet
Vajirkar Mrutyunjay M 200412 MS
159 pages
Marietta Directive
No ratings yet
Marietta Directive
4 pages
20988-Article Text-86354-2-10-20240228
No ratings yet
20988-Article Text-86354-2-10-20240228
7 pages
Com269 Text Assignment
No ratings yet
Com269 Text Assignment
7 pages
Code of Business Conduct Ethics
75% (4)
Code of Business Conduct Ethics
22 pages

A CNN Accelerator On FPGA Using Depthwise Separable Convolution

Uploaded by

A CNN Accelerator On FPGA Using Depthwise Separable Convolution

Uploaded by

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO.

10, OCTOBER 2018 1415

A CNN Accelerator on FPGA Using

Fig. 2. Bottleneck operations in different strides.

in case of stride length of 1, the number of weights needed

Fig. 1. Comparison of different convolution types.

Fig. 5. Line buffer in MME.

Fig. 6. Adder tree modes for different convolution.

Fig. 3. Block diagram of accelerator system.

Fig. 7. Block diagram of adder tree.

Fig. 4. Block diagram of an MME. it is illustrated by Fig. 5. The implementation length is

Fig. 8. Depthwise convolution in MME.

Fig. 11. Weight buffer in ping-pong structure.

Fig. 9. Pointwise convolution in MME.

Fig. 12. System architecture of the FPGA design.

high bandwidth. Contrarily, external memory has the capabil-

You might also like