2020 Hong
2020 Hong
Article
InSight: An FPGA-Based Neuromorphic Computing
System for Deep Neural Networks †
Taeyang Hong , Yongshin Kang and Jaeyong Chung *
Department of Electronic Engineering, Incheon National University, Incheon 22012, Korea;
[email protected] (T.H.); [email protected] (Y.K.)
* Correspondence: [email protected]
† Our system, InSight, is named as a concatenation of “In” from “In”cheon National University and “Sight”
representing its ability to see.
Received: 28 September 2020; Accepted: 27 October 2020; Published: 30 October 2020
Abstract: Deep neural networks have demonstrated impressive results in various cognitive
tasks such as object detection and image classification. This paper describes a neuromorphic
computing system that is designed from the ground up for energy-efficient evaluation of deep
neural networks. The computing system consists of a non-conventional compiler, a neuromorphic
hardware architecture, and a space-efficient microarchitecture that leverages existing integrated
circuit design methodologies. The compiler takes a trained, feedforward network as input,
compresses the weights linearly, and generates a time delay neural network reducing the number of
connections significantly. The connections and units in the simplified network are mapped to silicon
synapses and neurons. We demonstrate an implementation of the neuromorphic computing system
based on a field-programmable gate array that performs image classification on the hand-wirtten
0 to 9 digits MNIST dataset with 99.37% accuracy consuming only 93uJ per image. For image
classification on the colour images in 10 classes CIFAR-10 dataset, it achieves 83.43% accuracy
at more than 11× higher energy-efficiency compared to a recent field-programmable gate array
(FPGA)-based accelerator.
Keywords: deep learning; deep neural networks; efficient deep learning; neuromorphic
computing system
1. Introduction
Deep convolutional neural networks (CNNs) have shown state-of-the-art results on various
tasks in computer vision, and their performance has become comparable to humans in some specific
applications [1]. However, they contain a huge number of weight parameters (e.g., 108 , [2]), and the
inference by the models is computationally expensive. This makes it problematic to deploy these
models to embedded platforms, where computing power, memory, storage, and energy are limited.
It is also problematic to evaluate the large models at the server side. For example, processing images
and videos uploaded by millions of users require massive amounts of computation, and running a
data center that supports such computation carries enormous costs including cooling expenses and
electricity bills.
To cope with these issues, there has been an enormous amount of research efforts put into CNN
acceleration hardware such as GPUs [3], field-programmable gate arrays (FPGAs) [4,5] and ASICs [6,7]
very recently. Traditionally, the processing elements (PEs) of hardware accelerators are complex and
large in area, and the design of the accelerators has focused on maximizing the utilization of a small
number of the PEs considering the limited external memory bandwidth. Most development of CNN
accelerators has also been in that way. At some performance point, most accelerators based on the Von
Neumann architecture become memory bound irrespective of the type and the number of available
compute units. To execute CNNs, it is required to fetch a number of weight parameters. In addition, it is
required to write a large amount of intermediate data to dynamic random-access memory (DRAM)s
and read it back so the performance of CNNs accelerators are also often limited by the off-chip
bandwidth. Thus, there have been attempts to reduce the memory access [6]. This traditional approach
is practical and can be applicable today.
However, if we have numerous PEs as our brains do, we may need a radically different architecture
from the traditional one. Processing elements over several thousands are not considered practical
today, but it is becoming so. It is now well-known that 8-bit fixed point arithmetic instead of 32-bit
floating point is sufficient to run CNNs without some loss in accuracy [8]. The multipliers on CNN
accelerators can be replaced by barrel shifters [9]. In addition, the device community actively performs
research on neuromorphic devices such as memristors [10]. Thus, a new computer architecture to
exercise millions of PEs needs to be developed for the near future. In such a system, operations can be
simply mapped into PEs rather than being scheduled, and the dataflow architecture can be a baseline.
Recent neuromorphic architectures are aligned with this direction, although some of them such as
BrainScaleS [11] and Neurogrid [12] are designed for brain simulation. TrueNorth [13] aims at both
real-world applications and brain simulation [14] and equips with 256 million synapses, which not
only store the synaptic weights but also serve 1-bit compute units. Recent neuromorphic systems
employ detailed neural models such as the leaky integrated-and-fire model [15], in part because they
should be used for brain simulation. However, the recent success of deep learning tells us that the
detailed model may not be necessary for a high predictive performance for today’s applications.
This paper presents a novel neuromorphic computing system that is designed solely for the
execution (i.e., inference or prediction) of deep neural networks. Since our neuromorphic system is
not made for brain simulation, we employ the perceptron as the neural model. Even if we do not
employ a detailed neural model at the neuron (circuit) level, and we adopt FPGAs as the backend,
we are inspired by our brain at the architectural level and our system is fundamentally closer to
the neuromorphic systems than the traditional accelerators in the sense that it is designed for many,
small processing elements. Thus, we will call our system the neuromorphic computing system.
The contributions of our neuromorphic computing system are summarized as follows:
• We implement a complete, fully-functional, non-Von Neumann system that can execute modern
deep CNNs and compare it with various existing computing systems including existing
neuromorphic systems, FPGAs, GPUs, and CPUs. This reveals that the neuromorphic approach is
worth to explore despite the progress of the conventional specialized systems.
• The dataflow architecture enabled by the one-to-one mapping between operations and compute
units has the fundamental scalability issue, although it does not require any array-type memory
access. However, this work increases the capacity by adopting model compression and word-serial
structures for 2D convolution.
• We demonstrate that neural networks can be implemented in the neuromorphic fashion efficiently
without the crossbars for synapses. This is possible because we can convert dense neural networks
into sparse neural networks.
The rest of the paper is organized as follows. Section 2 introduces neural networks and
neuromorphic systems. Section 3 explains the software part of the system and Section 4 discusses the
hardware part. Section 5 shows experimental results, and Section 6 concludes the paper.
2. Background
N
y = f ( ∑ wi x i + b ) (1)
i
where y is the activation of the neuron, N is the number of inputs, xi is the input activation, wi is
the weight, b is the bias, and f is the activation function. This model is loosely connected to biological
neurons and synapses. The weight represents a synaptic strength. The activation represents the
firing rate of a neuron. The product and the sum are associate with the post-synaptic current and the
membrane potential, respectively. The bias is associated with a threshold value, above which a neuron
starts firing. Although this model is coarse and highly abstracted, it provides the best predictive
performance in practical machine learning applications.
where W ∈ R N × M is the weight matrix, b ∈ R M is the bias, and f is the activation function.
Figure 1 depicts a directed acyclic graph that represents a feedforward neural network of 3 layers.
The circles (lines) represent neurons (connections). The hidden and output layers are fully-connected.
W
W'
Kw
H Kh H'
N M
Figure 2. Neurons in convolutional layers are arranged in a 3D space and have spatially
local connections.
Neural Network
BUS Input Model
•••
Output
Match
•••
Figure 3. In Von Neumann architecture, the input/output(I/O) operations between the memory and
the processor lead to processing bottlenecks and significant energy consumption. In neuromorphic
architectures, neural network models are considered the software and mapped onto hardware neurons
and synapses.
J. Low Power Electron. Appl. 2020, 10, 36 5 of 18
In Von Neumann architecture, the weight parameters are stored in the off-chip memory
and the intermediate results of computation are written to the memory and read it back. In the
neuromorphic architecture, data flow from input to output continuously. In the traditional computers,
a small number of processing elements are time-multiplexed, whereas in the neuromorphic architecture,
synaptic operations associated with a connection are dedicated to the corresponding synapse.
Thus, it requires at least as many synapses as the number of connections. TrueNorth [16] belongs to
this neuromorphic architecture.
Linear layer
W
=
Kw-1
W
c c
b
b a Delay line
a
f
…
a
e b
…
Kh d Kh-1 c
d
i
…
e
h f
g
g
Kw h
…
Figure 5. Convolutional layers are converted into time delay neural nets (TDNNs).
Despite the activation-serial processing of the TDNN, our neuromorphic system will have a
much higher performance if it is compared to the conventional accelerators. In the neuromorphic
J. Low Power Electron. Appl. 2020, 10, 36 7 of 18
X Y
Delay line Delay line Delay line
HⅹW
Time Step t
Figure 6. The TDNN conversion allows us to have simple structures even for convolutional layers.
In spite of the word-serial processing of the TDNN, the performance is much higher than those of the
conventional accelerators because the processing of each layer is pipelined.
and a delay or a neuron (a connection) is represented in a vertex (an edge). The weights are annotated to
the edges. The flow after the TDNN synthesis is similar to that for typical DSP custom hardware design
methodology [23]. Floating-point weights are converted into fixed-point weights (Step 5). For the
fixed-point weights, we use a given fractional bit-width, while the integer bit-width is determined
automatically per layer to be large enough not to cause overflow. Therefore, the total bit-width
(i.e., the word-length) of weights varies per layer. Then, the TDNN simulator written in a high-level
language runs for a given bit-width of activations and given inputs (Step 6). During this process,
it checks if any overflow occurs in the activations. It also evaluates the final accuracy reflecting the
finite-precision effects of both the weights and the activations, and generates the expected outputs
for the given inputs. Finally, the DAG is converted into a Verilog netlist (Step 7). Each vertex
in the DAG corresponds to an instance of pre-designed modules in the netlist and the instances are
connected following the edges in DAG. The inputs and the expected outputs are used for the functional
verification of the netlist and pre-designed modules.
Trained Net
1. Layer-wise Tensor Factorization DL Library Backend
Theano
2. Network Pruning
Simplified Net with low accuracy Caffe
3. Fine-tuning
TensorFlow
Simplified Net with high accuracy
4. TDNN Architecture Synthesis
Time Delay Neural Net as a Graph Module DB
5. Weight Quantization
Figure 7. Toolflow.
4. Neuromorphic Hardware
A fully connected neural net with N input neurons and N output neurons has N 2 connections,
and existing neuromorphic systems equipped with large arrays (e.g., memristor, X-bar, or SRAM) to
implement these dense connections efficiently. However, to implement sparsely connected nets
in the neuromorphic fashion, it is more efficient to use a set of small arrays than a large array
as also pointed out in [24]. Thus, for the compressed (simplified) nets, we need to design a new
neuromorphic hardware architecture different from existing large array-based architectures. However,
instead of designing a new neuromorphic architecture, we can leverage FPGAs since they are already
designed for multi-level circuits with sparse connections. In addition, they have programmable
interconnects. All we need to turn an FPGA into a neuromorphic computing system is to implement
the neuron model in logic. We implement the neuron and the synapse as a bit-serial adder and a
bit-serial multiplier, respectively. Then, we can convert a simplified neural net into a logic circuit as
J. Low Power Electron. Appl. 2020, 10, 36 9 of 18
shown in Figure 8. The sparse connection is essential because otherwise the resulting logic circuit
cannot be routed due to the routing congestion. The bit-serial multipliers and adders not only
reduce the circuit area, but also allow us to manage the interconnection between layers. In addition,
the buffer size for the pipelining across layers becomes only M bits for a layer of M neurons. Note
that a convolutional layer with M output channels is converted into a TDNN with M neurons.
Thus, the convolutional layer requires only a M bit buffer to enable the pipelining. Once the
convolutional layer produces a M-bit code, the next layer can start some processing irrespective
of the output map size. This is possible because our system employs the word-serial and bit-serial
architecture for 2D convolution.
Let nw and n a be the bit-widths of weights and activations, respectively. The proposed
neuromorphic hardware is a composition of building blocks and the building blocks in our system
transmit and receive each digit of an activation serially. In our implementation, the least significant bit
(LSB) of an activation comes first. To represent signed real numbers, we use the fixed-point format and
let mw and m a denote the fractional bit-widths of weights and activations, respectively. We manually
design the following modules as the building blocks:
• Synapse: consists of a nw -bit register to store the weight and a bit-serial multiplier, which mainly
comprises full adders and a register to store intermediate results. For minimum area, we employ
the semi-systolic multiplier [25], which approximately requires nw AND gates, nw full adders,
and 2nw flip-flops.
• Neuron: with k inputs comprises a k-input bit-serial adder to sum up the outputs of the k synapses
connected to the neuron. We prepare neurons with various k. Adding a bias and evaluating the
activation function is related to the function of biological neurons, so it may be natural for the
neuron to have units for those functions. However, in our system, many neurons do not require
those functions and we implement them in separate modules.
• Delay element: is realized by a 1-bit n a stage shift register.
• Bias: consists of an nw -bit register to store the bias value and a full-adder, whose inputs are fed
by a module input, a selected bit of the register, and the carries out at the previous cycle.
• Relu: zeros out negative activations. Since the sign bit comes last, this module should have n a
cycle latency at minimum.
• Max: compares k input activations by using bit-serial substractors. If k = 2, one bit-serial
substractor is used. The comparison is done after the MSB of activations is received, so it should
also have n a cycle latency at minimum. If k > 2, we can create a tree of the two-input max modules,
but this increases the latency. For minimum latency, we can perform k(k − 1)/2 comparisons
in parallel.
• Pool: performs subsampling, handles borders, and pads zeros. It keeps track of the spatial
coordinates of the current input activation in the activation map and invalidates the output
activations depending on the border handling and the stride. It can also replace invalidated
activations by valid zeros for zero-padding.
1-bit adder
as neuron
1 pp w Synapse
n 1
Synapse
Figure 8. The simplified network is converted into a logic circuit. The 4-bit output of this layer is sent
to the next layer and is consumed immediately, keeping the buffer size between layers to a minimum.
J. Low Power Electron. Appl. 2020, 10, 36 10 of 18
clk
synapse_in ma fractional ma fractional
@layer 0
phase ø0 ø1 ø0 ø1
synapse
mw+ma fractional part mw+ma fractional part
output
neuron_out mw+ma fractional part mw+ma fractional part
@layer 0
synapse_in na + nw - mw
mw+ma fractional part
@layer 1
layer 0
valid
layer 1
valid
Figure 9. The (n a + nw )-bit result is aligned by the pipeline registers, which truncates the least
significant mw bits.
5. Experimental Results
We implement the neural network compiler (NNC) in Python language. Neural networks are
trained off-line using NVIDIA Titan X GPU and are fed into the compiler. We chose Artix 7 100T and
Kintex 7 325T field-programmable gate array (FPGA) as the target platforms. Our experiments can
be considered in two ways. First, FPGAs can be considered a general-purpose, not-highly-optimized
neuromorphic processor, and the experiments are regarded as making software for the processor in
part using existing hardware synthesis tools. FPGAs already have plastic connections, and processing
elements and memories are mixed in space. The SRAM-based look-up tables (LUTs) serve as memories
as well as processing elements. Second, our experiments can also be considered to build a prototype
of an application-specific neuromorphic processor. While the neural network compiler generates a
hardware model in Verilog in an application-specific fashion (e.g., weight parameters are hard-coded),
we can easily extend it into a general-purpose hardware model if programmable interconnects
are available. We elaborate the generated hardware models for a target platform using Xilinx Vivado.
The shift registers in delay elements are refined into a LUT-based shift register, not a chain of flip-flops,
and even long shift registers are implemented efficiently. We measure the power consumption of the
FPGA boards at a 12 V power supply. For the dynamic power consumption of the FPGA chips, we turn
on and off the entire clock distribution and measure the difference. The static power consumption
is obtained from the power report of Vivado. The FPGA logic operates at 1 V and the efficiency of
12 V–1 V conversion is assumed to be 85% for the chip power measurement.
5.1. Benchmarks
To demonstrate our approach, we use three neural networks. To describe the NN architectures,
the fully-connected layer and the convolutional layer are denoted by F and C, respectively. For the
J. Low Power Electron. Appl. 2020, 10, 36 11 of 18
MNIST hand-written digit classification, we use a softmax regression (1F) and a 3-layer convolutional
neural network (2C1F). They achieve 92.23% and 99.57% for the test set of MNIST, respectively. For the
CIFAR-10 natural image classification, we use a 6-layer convolutional neural network (4C2F), which is
trained with data augmentation. We pad zeros to make the size of the input 40 × 40, and randomly
crop it to be 32 × 32. We also use random flip. The 4C2F has 4 convolutional layers, a fully connected
layer and a final softmax layer. Each convolutional layer has filters of size 3 × 3. The 2nd and the 4th
convolutional layers are followed by 4 × 4 max pooling layers with a stride of 2. We use the rectified
linear unit as the activation function. We preprocess the data using global contrast normalization only.
We train this network for 150 epochs. The weight decay (l2 regularization) is set to 0.002; the learning
rate is set to 0.1 initially; we use batch normalization. This network achieves 89.10% classification
accuracy in the test set.
Table 1. We combine tensor factorization with pruning and simplify the convolutional net.
Original Simplified
Config. Params SOPs Config. Params SOPs
conv1-3 9 9.22 K
conv3-64 1.73 K 1.56 M conv3-27 186 167 K
conv1-64 351 316 K
Subtotal 1.73 K 1.56 M Subtotal 0.55 K 493 K
conv1-64 719 647 K
conv3-64 36.9 K 28.9 M conv3-64 705 553 K
conv1-64 847 664 K
Subtotal 36.9 K 28.9 M Subtotal 2.27 K 1.86 M
maxpool
conv1-64 1263 213 K
conv3-128 73.7 K 8.92 M conv3-128 899 109 K
conv1-128 1586 192 K
Subtotal 73.7 K 8.92 M Subtotal 3.75 K 514 K
conv1-128 2689 325 K
conv3-128 147 K 17.8 M conv3-128 964 117 K
conv1-128 2717 329 K
Subtotal 147 K 17.8 M Subtotal 6.37 K 771 K
maxpool
conv1-128 1617 26 K
FC-256 524 K 524 K conv4-256 701 701
conv1-256 1497 1497
Subtotal 524 K 524 K Subtotal 4K 28 K
FC-10 1166 1166
FC-10 2.56 K 2.56 K
FC-10 84 84
Subtotal 2.56 K 2.56 K Subtotal 1.25 K 1.25 K
J. Low Power Electron. Appl. 2020, 10, 36 12 of 18
Table 1. Cont.
Original Simplified
Config. Params SOPs Config. Params SOPs
Total
Conv total 260 K 57.2 M Conv total 13 K 3.64 M
FC total 527 K 527 K FC total 5K 29.3 K
Net total 787 K 58 M Net total 18 K 3.7 M
0.836
Classification Accuracy
1.00
2.10
Total Power (W)
0.835
LUT Utilization
0.95
2.05
0.834
0.90
0.833
2.00
0.85
0.832
1.95
0.831 0.80
0.98
Classification Accuracy
0.83 2.10
LUT Utilization
0.82 2.05
0.94
0.92
0.81 2.00
0.90
0.80 1.95
0.88
5.5. Results
Table 2 summarizes the results for the three networks. All the implementations operate at
160 MHz. Thus, the designs take one input pixel at 5 MHz (160 MHz/2 × n a ) speed seamlessly.
The classification accuracy (Accu) is evaluated for the original models, the simplified models and
the implementations using the test set. Both CIFAR-10 and MNIST have 10,000 samples in the test
set. The implementation accuracy is the predictive performance of the actual system and reflects
the finite-precision effects. Our Python/numpy-based simulator runs for an implementation model
of the TDNNs and measures the accuracy generating expected outputs of the final softmax layer.
These outputs are validated through RT-level simulation using Synopsys VCS. The two convolutional
NNs have a larger number of connections (Conn) than the number of parameters (Param) due to the
weight sharing. For the TDNNs, the number of units (Unit) and the number of delay units (Delay) are
shown as well. Note that the units, connections, and delay units of the TDNNs are mapped one-to-one
J. Low Power Electron. Appl. 2020, 10, 36 14 of 18
onto the neurons, synapses, and delay elements in the implementations, respectively. Thus, they also
indicate the numbers of neurons, synapses and delay elements. For the original NNs, the number
of connections equals the number of required synaptic operations (SOPs), while they are different
in TDNNs. One synaptic operation (SOP) is translated to 1 MAC in fixed-point systems and 2 FLOPs
in floating-point systems. The total area (Area) of the implementation is measured by the number
of look-up tables (LUTs). Most LUTs in the FPGA are used to implement synapses. The hardware
cost is mainly determined by the number of synapses (thus, the number of connections in the TDNN).
Since each synapse performs 5M SOPs per second, the theoretic (actual) performance of 4C2F comes to
87 G (19 G) SOPs per second.
Table 3 compares our system with existing neuromorphic systems for the MNIST task.
Throughput is measured by images per second. Energy-efficiency (Energy) is evaluated by energy
per image. Table 4 compares our system with existing computing systems and shows that it is a
very different type of computing systems from existing ones. As in [6], we assume that our system
targets at latency-critical applications, and compare it to the other systems when the batch size is
one. In addition, when we compare the energy-efficiency of our system with that of the GPU, we use
the board power instead of the chip power since it is not available for the GPU. Our system gives
125×, 4.7×, and 9.2× speed up over the CPU, the desktop GPU, and the FPGA accelerator for the
entire network, respectively. It also provides 2168×, 105×, and 62.5× higher energy-efficiency over the
CPU, the GPU, and the FPGA. Even if we exclude the improvement by the simplification, our system
provides 1.68× speed up over the FPGA, and 19.1×, and 11.4× higher energy efficiency over the GPU
and the FPGA. The improvements over the FPGA accelerator have been achieved by using almost the
same FPGA devices in a different way. Our system based on the off-the-shelf chip is even comparable
to TrueNorth that is based on a custom chip. Our system provides a slightly higher accuracy and 3.9×
speed-up at only 2.75× lower energy-efficiency.
Table 2. An original feedforward neural network is simplified and rolled into a TDNN by the neural
network compiler (NNC). The units, connections, and delays of the TDNN are mapped to neurons,
synapses, and delay elements in the field-programmable gate array (FPGA).
Original Model
Net Task
Accu Param SOPs Conn Unit Delay
1F MNIST 0.9234 7840 7840 7840
2C1F MNIST 0.9957 0.1 M 6M 6M - -
4C2F CIFAR 0.8910 0.8 M 58 M 58 M
Simplified TDNN Model
Net Task
Accu Param SOPs Conn Unit Delay
1F MNIST 0.9265 3000 17 K 3161 243 7047
2C1F MNIST 0.9938 4000 0.4 M 4567 1173 5020
4C2F CIFAR 0.8358 18K 3.7 M 17338 3125 4274
FPGA Implementation
Net Task
Accuracy Area Power
1F MNIST 0.9209 29 K 0.72 W
2C1F MNIST 0.9937 49 K 0.54 W
4C2F CIFAR 0.8343 177 K 2.14 W
J. Low Power Electron. Appl. 2020, 10, 36 15 of 18
A TDNN model is refined to a network of neurons, synapses, and delay elements preserving
the topology, and the IC design tool places them in a space and connects them automatically.
This provides an interesting visualization of neural networks. The 4C2F on the FPGA is shown
in Figure 11a–c. The network on the FPGA is stimulated by images in the test set of CIFAR-10,
which are transferred via UART, and the output of the network is visualized in a screen as shown in
Figure 11d. The neuromorphic system successfully classifies images in real-time, and the results match
with the simulation exactly.
LAYER 1
LAYER 2
LAYER 3
LAYER 4
VGA out
LAYER 6
LAYER 5
Downloading images via UART
Figure 11. Neural network on a chip and its demonstration. (a) Our system is a network of three types
of building blocks (delays in red, neurons in green, and synapses in blue). The synapses occupy most
area. (b) The connectivity of the building blocks is shown. (c) A function is dedicated to processing
elements in a specific region as in our brain. (d) Our neuromorphic system based on an FPGA classifies
images into 10 categories at the speed of 4882 images per second consuming only 4.92 W at the
board-level. It uses a non-Von Neumann architecture; any external memory is not used in this system.
In addition, the internal block RAMs on the FPGA are not used except those for the frame buffer.
6. Conclusions
In this paper, we have presented a neuromorphic computing system that is newly designed from
the microarchitecture to the compiler in order to forward-execute neural networks with minimum
energy consumption. This neuromorphic system can scale by simply using a larger FPGA. Since more
than 10 times larger FPGAs than the target platform are available in the market, with our approach,
it now becomes easy to build neuromorphic computing systems that can execute neural networks
with more than 7 million real-valued parameters, fully leveraging existing integrated circuit design
techniques. We believe that a neuromorphic chip derived from FPGAs (or, an FPGA tailored towards
the proposed circuits) serves as a practical processor for large-scale deep neural networks such as
AlexNet and VGG. Although we have mapped the time delay neural networks generated by the neural
network compiler into FPGAs, they can also be mapped into emerging neuromorphic devices such
as memristors. We thus believe that the proposed computing systems can also serve as a research
platform for high-level design studies until new neuromorphic devices are available widely.
Author Contributions: Conceptualization, J.C.; methodology, T.H. and Y.K.; software, T.H. and Y.K.;
validation, T.H. and Y.K.; formal analysis, J.C.; investigation, T.H. and J.C.; data curation, T.H. and Y.K.;
writing—original draft preparation, T.H. and J.C.; writing—review and editing, J.C.; visualization, T.H.;
supervision, J.C.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to
the published version of the manuscript.
Funding: This work was supported by the Institute for Information and Communications Technology Promotion
funded by the Korea Government under Grant 1711073912.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Ciresan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification.
In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Providence, RI, USA, 16–21 June 2012; pp. 3642–3649.
2. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition,
localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229.
3. Chetlur, S.; Woolley, C.; Vandermersch, P.; Cohen, J.; Tran, J.; Catanzaro, B.; Shelhamer, E.
Cudnn: Efficient primitives for deep learning. arXiv 2014, arXiv:1410.0759.
J. Low Power Electron. Appl. 2020, 10, 36 17 of 18
4. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015.
5. Alwani, M.; Chen, H.; Ferdman, M.; Milder, P. Fused-layer CNN accelerators. In Proceedings of the
2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan,
15–19 October 2016; pp. 1–12.
6. Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference
engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on
Computer Architecture, Seoul, Korea, 18–22 June 2016; pp. 243–254.
7. Shin, D.; Lee, J.; Lee, J.; Yoo, H.J. 14.2 DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for
general-purpose deep neural networks. In Proceedings of the 2017 IEEE International Solid-State Circuits
Conference (ISSCC), San Francisco, CA, USA, 5–9 February 2017; pp. 240–241.
8. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean,
J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.
arXiv 2016, arXiv:1603.04467.
9. Miyashita, D.; Lee, E.H.; Murmann, B. Convolutional neural networks using logarithmic data representation.
arXiv 2016, arXiv:1603.01025.
10. Jo, S.H.; Chang, T.; Ebong, I.; Bhadviya, B.B.; Mazumder, P.; Lu, W. Nanoscale memristor device as synapse
in neuromorphic systems. Nano Lett. 2010, 10, 1297–1301. [CrossRef] [PubMed]
11. Schemmel, J.; Bruderle, D.; Grubl, A.; Hock, M.; Meier, K.; Millner, S. A wafer-scale neuromorphic hardware
system for large-scale neural modeling. In Proceedings of the 2010 IEEE International Symposium on
Circuits and Systems (ISCAS), Paris, France, 30 May–2 June 2010; pp. 1947–1950.
12. Benjamin, B.V.; Gao, P.; McQuinn, E.; Choudhary, S.; Chandrasekaran, A.R.; Bussat, J.M.; Alvarez-Icaza, R.;
Arthur, J.V.; Merolla, P.A.; Boahen, K. Neurogrid: A mixed-analog-digital multichip system for large-scale
neural simulations. Proc. IEEE 2014, 102, 699–716. [CrossRef]
13. Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.;
Guo, C.; Nakamura, Y.; et al. A million spiking-neuron integrated circuit with a scalable communication
network and interface. Science 2014, 345, 668–673. [CrossRef] [PubMed]
14. Cassidy, A.S.; Alvarez-Icaza, R.; Akopyan, F.; Sawada, J.; Arthur, J.V.; Merolla, P.A.; Datta, P.; Tallada, M.G.;
Taba, B.; Andreopoulos, A.; et al. Real-time scalable cortical computing at 46 giga-synaptic OPS/watt with
∼100× Speed Up in Time-to-Solution and ∼100,000× Reduction in Energy-to-Solution. In Proceedings
of the International Conference for High Performance Computing, Networking, Storage and Analysis,
New Orleans, LA, USA, 16–21 November 2014; pp. 27–38.
15. Cassidy, A.S.; Merolla, P.; Arthur, J.V.; Esser, S.K.; Jackson, B.; Alvarez-Icaza, R.; Datta, P.; Sawada, J.;
Wong, T.M.; Feldman, V.; et al. Cognitive Computing Building Block: A Versatile and Efficient Digital
Neuron Model for Neurosynaptic Cores. In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–10.
16. Arthur, J.V.; Merolla, P.A.; Akopyan, F.; Alvarez, R.; Cassidy, A.; Chandra, S.; Esser, S.K.; Imam, N.; Risk, W.;
Rubin, D.B.; et al. Building block of a programmable neuromorphic substrate: A digital neurosynaptic core.
In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia
10–15 June 2012; pp. 1–8.
17. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014,
arXiv:1409.1556.
18. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149.
19. Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization.
arXiv 2014, arXiv:1412.6115.
20. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
J. Low Power Electron. Appl. 2020, 10, 36 18 of 18
21. Chung, J.; Shin, T. Simplifying Deep Neural Networks for Neuromorphic Architectures. In Proceedings of
the 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 5–9 June 2016.
22. Bosi, B.; Bois, G.; Savaria, Y. Reconfigurable pipelined 2-D convolvers for fast digital signal processing.
IEEE Trans. Very Large Scale Integr. 1999, 7, 299–308. [CrossRef]
23. Cmar, R.; Rijnders, L.; Schaumont, P.; Vernalde, S.; Bolsens, I. A methodology and design environment
for DSP ASIC fixed point refinement. In Proceedings of the Conference on Design, Automation and Test
in Europe, Munich, Germany, 9–12 March 1999; ACM: New York, NY, USA, 1999; p. 56.
24. Wen, W.; Wu, C.R.; Hu, X.; Liu, B.; Ho, T.Y.; Li, X.; Chen, Y. An EDA framework for large scale hybrid
neuromorphic computing systems. In Proceedings of the 52nd Annual Design Automation Conference,
San Francisco, CA, USA, 8–12 June 2015; ACM: New York, NY, USA, 2015; p. 12.
25. Agrawal, E. Systolic and Semi-Systolic Multiplier. MIT Int. J. Electron. Commun. Eng. 2013, 3, 90–93.
26. Esser, S.K.; Appuswamy, R.; Merolla, P.; Arthur, J.V.; Modha, D.S. Backpropagation for energy-efficient
neuromorphic computing. In Proceedings of the Advances in Neural Information Processing Systems,
Montreal, Canada, 7–12 December 2015; pp. 1117–1125.
27. Neil, D.; Liu, S.C. Minitaur, an event-driven FPGA-based spiking network accelerator. IEEE Trans. Very
Large Scale Integr. Syst. 2014, 22, 2621–2628. [CrossRef]
28. Esser, S.K.; Merolla, P.A.; Arthur, J.V.; Cassidy, A.S.; Appuswamy, R.; Andreopoulos, A.; Berg, D.J.;
McKinstry, J.L.; Melano, T.; Barch, D.R.; et al. Convolutional Networks for Fast, Energy-Efficient
Neuromorphic Computing. arXiv 2016, arXiv:1603.08270.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).