Artificial Intelligence Hardware Design - Challenges and Solutions
Artificial Intelligence Hardware Design - Challenges and Solutions
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or other
damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
ISBN: 9781119810452
10 9 8 7 6 5 4 3 2 1
v
Contents
Author Biographies xi
Preface xiii
Acknowledgments xv
Table of Figures xvii
1 Introduction 1
1.1 Development History 2
1.2 Neural Network Models 4
1.3 Neural Network Classification 4
1.3.1 Supervised Learning 4
1.3.2 Semi-supervised Learning 5
1.3.3 Unsupervised Learning 6
1.4 Neural Network Framework 6
1.5 Neural Network Comparison 10
Exercise 11
References 12
2 Deep Learning 13
2.1 Neural Network Layer 13
2.1.1 Convolutional Layer 13
2.1.2 Activation Layer 17
2.1.3 Pooling Layer 18
2.1.4 Normalization Layer 19
2.1.5 Dropout Layer 20
2.1.6 Fully Connected Layer 20
2.2 Deep Learning Challenges 22
Exercise 22
References 24
vi Contents
3 Parallel Architecture 25
3.1 Intel Central Processing Unit (CPU) 25
3.1.1 Skylake Mesh Architecture 27
3.1.2 Intel Ultra Path Interconnect (UPI) 28
3.1.3 Sub Non-unified Memory Access Clustering (SNC) 29
3.1.4 Cache Hierarchy Changes 31
3.1.5 Single/Multiple Socket Parallel Processing 32
3.1.6 Advanced Vector Software Extension 33
3.1.7 Math Kernel Library for Deep Neural Network (MKL-DNN) 34
3.2 NVIDIA Graphics Processing Unit (GPU) 39
3.2.1 Tensor Core Architecture 41
3.2.2 Winograd Transform 44
3.2.3 Simultaneous Multithreading (SMT) 45
3.2.4 High Bandwidth Memory (HBM2) 46
3.2.5 NVLink2 Configuration 47
3.3 NVIDIA Deep Learning Accelerator (NVDLA) 49
3.3.1 Convolution Operation 50
3.3.2 Single Data Point Operation 50
3.3.3 Planar Data Operation 50
3.3.4 Multiplane Operation 50
3.3.5 Data Memory and Reshape Operations 51
3.3.6 System Configuration 51
3.3.7 External Interface 52
3.3.8 Software Design 52
3.4 Google Tensor Processing Unit (TPU) 53
3.4.1 System Architecture 53
3.4.2 Multiply–Accumulate (MAC) Systolic Array 55
3.4.3 New Brain Floating-Point Format 55
3.4.4 Performance Comparison 57
3.4.5 Cloud TPU Configuration 58
3.4.6 Cloud Software Architecture 60
3.5 Microsoft Catapult Fabric Accelerator 61
3.5.1 System Configuration 64
3.5.2 Catapult Fabric Architecture 65
3.5.3 Matrix-Vector Multiplier 65
3.5.4 Hierarchical Decode and Dispatch (HDD) 67
3.5.5 Sparse Matrix-Vector Multiplication 68
Exercise 70
References 71
Contents vii
5 Convolution Optimization 85
5.1 Deep Convolutional Neural Network Accelerator 85
5.1.1 System Architecture 86
5.1.2 Filter Decomposition 87
5.1.3 Streaming Architecture 90
5.1.3.1 Filter Weights Reuse 90
5.1.3.2 Input Channel Reuse 92
5.1.4 Pooling 92
5.1.4.1 Average Pooling 92
5.1.4.2 Max Pooling 93
5.1.5 Convolution Unit (CU) Engine 94
5.1.6 Accumulation (ACCU) Buffer 94
5.1.7 Model Compression 95
5.1.8 System Performance 95
5.2 Eyeriss Accelerator 97
5.2.1 Eyeriss System Architecture 97
5.2.2 2D Convolution to 1D Multiplication 98
5.2.3 Stationary Dataflow 99
5.2.3.1 Output Stationary 99
5.2.3.2 Weight Stationary 101
5.2.3.3 Input Stationary 101
5.2.4 Row Stationary (RS) Dataflow 104
5.2.4.1 Filter Reuse 104
5.2.4.2 Input Feature Maps Reuse 106
5.2.4.3 Partial Sums Reuse 106
viii Contents
Author Biographies
Albert Chun Chen Liu is Kneron’s founder and CEO. He is Adjunct Associate
Professor at National Tsing Hua University, National Chiao Tung University, and
National Cheng Kung University. After graduating from the Taiwan National
Cheng Kung University, he got scholarships from Raytheon and the University of
California to join the UC Berkeley/UCLA/UCSD research programs and then
earned his Ph.D. in Electrical Engineering from the University of California Los
Angeles (UCLA). Before establishing Kneron in San Diego in 2015, he worked in
R&D and management positions in Qualcomm, Samsung Electronics R&D
Center, MStar, and Wireless Information.
Albert has been invited to give lectures on computer vision technology and
artificial intelligence at the University of California and be a technical reviewer
for many internationally renowned academic journals. Also, Albert owned more
than 30 international patents in artificial intelligence, computer vision, and image
processing. He has published more than 70 papers. He is a recipient of the IBM
Problem Solving Award based on the use of the EIP tool suite in 2007 and IEEE
TCAS Darlington award in 2021.
Oscar Ming Kin Law developed his interest in smart robot development in 2014.
He has successfully integrated deep learning with the self-driving car, smart
drone, and robotic arm. He is currently working on humanoid development. He
received a Ph.D. in Electrical and Computer Engineering from the University of
Toronto, Canada.
Oscar currently works at Kneron for in-memory computing and smart robot
development. He has worked at ATI Technologies, AMD, TSMC, and Qualcomm
and led various groups for chip verification, standard cell design, signal integrity,
power analysis, and Design for Manufacturability (DFM). He has conducted dif-
ferent seminars at the University of California, San Diego, University of Toronto,
Qualcomm, and TSMC. He has also published over 60 patents in various areas.
xiii
Preface
With the breakthrough of the Convolutional Neural Network (CNN) for image
classification in 2012, Deep Learning (DL) has successfully solved many complex
problems and widely used in our everyday life, automotive, finance, retail, and
healthcare. In 2016, Artificial Intelligence (AI) exceeded human intelligence that
Google AlphaGo won the GO world championship through Reinforcement
Learning (RL). AI revolution gradually changes our world, like a personal
computer (1977), Internet (1994), and smartphone (2007). However, most of the
efforts focus on software development rather than hardware challenges:
●● Big input data
●● Deep neural network
●● Massive parallel processing
●● Reconfigurable network
●● Memory bottleneck
●● Intensive computation
●● Network pruning
●● Data sparsity
This book shows how to resolve the hardware problems through various design
ranging from CPU, GPU, TPU to NPU. Novel hardware can be evolved from those
designs for further performance and power improvement:
●● Parallel architecture
●● Streaming Graph Theory
●● Convolution optimization
●● In-memory computation
●● Near-memory architecture
●● Network sparsity
●● 3D neural processing
xiv Preface
Acknowledgments
First, we would like to thank all who have supported the publication of the book.
We are thankful to Iain Law and Enoch Law for the manuscript preparation and
project development. We would like to thank Lincoln Lee and Amelia Leung for
reviewing the content. We also thank Claire Chang, Charlene Jin, and Alex Liao
for managing the book production and publication. In addition, we are grateful to
the readers of the Chinese edition for their valuable feedback on improving the
content of this book. Finally, we would like to thank our families for their support
throughout the publication of this book.
Table of Figures
Introduction
With the advancement of Deep Learning (DL) for image classification in 2012 [1],
Convolutional Neural Network (CNN) extracted the image features and success-
fully classified the objects. It reduced the error rate by 10% compared with the
traditional computer vision algorithmic approaches. Finally, ResNet showed the
error rate better than human 5% accuracy in 2015. Different Deep Neural Network
(DNN) models are developed for various applications ranging from automotive,
finance, retail to healthcare. They have successfully solved complex problems and
widely used in our everyday life. For example, Tesla autopilot guides the driver for
lane changes, navigating interchanges, and highway exit. It will support traffic
sign recognition and city automatic driving in near future.
In 2016, Google AlphaGo won the GO world championship through
Reinforcement Learning (RL) [2]. It evaluated the environment, decided the
action, and, finally, won the game. RL has large impacts on robot development
because the robot adapts to the changes in the environment through learning
rather than programming. It expands the robot role in industrial automation. The
Artificial Intelligent (AI) revolution has gradually changed our world, like a per-
sonal computer (1977)1, the Internet (1994)2 and smartphone (2007).3 It signifi-
cantly improves the human living (Figure 1.1).
1 Apple IIe (1997) and IBM PC (1981) provided the affordable hardware for software
development, new software highly improved the working efficiency in our everyday life and
changed our world.
2 The information superhighway (1994) connects the whole world through the Internet that
improves personal-to-personal communication. Google search engine makes the information
available at your fingertips.
3 Apple iPhone (2007) changed the phone to the multimedia platform. It not only allows people
to listen to the music and watch the video, but also integrates many utilities (i.e. e-mail,
calendar, wallet, and note) into the phone.
1.1 Development History
Neural Network [3] had been developed for a long time. In 1943, the first com-
puter Electronic Numerical Integrator and Calculator (ENIAC) was constructed
at the University of Pennsylvania. At the same time, a neurophysiologist, Warren
McCulloch, and a mathematician, Walter Pitts, described how neurons might
work [4] and modeled a simple neural network using an electrical circuit. In 1949,
Donald Hebb wrote the book, The Organization of Behavior, which pointed out
how the neural network is strengthened through practice (Figure 1.2).
In the 1950s, Nathanial Rochester simulated the first neural network in IBM
Research Laboratory. In 1956, the Dartmouth Summer Research Project on
Artificial Intelligence linked up artificial intelligence (AI) with neural network for
joint project development. In the following year, John von Neumann suggested to
implement simple neuron functions using telegraph relays or vacuum tubes. Also,
ImageNet challenge
30
25
20
Top-5 error
15
10
0
0) 1) 2) 3) 4) 4)
5)
01 01 01 01 01 01
an
01
(2 (2 2 2 (2 2
m
t( i( t(
(2
CE 6
Hu
C Ne a
NE rif -1 Ne
et
XR ex a G le
Cl
N
Al VG g
es
oo
R
During ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) [9],
University of Toronto researchers applied CNN Model, AlexNet [1], successfully
recognized the object and achieved a top-5 error rate 10% better than traditional
computer vision algorithmic approaches. For ILSVRC, there are over 14 million
images with 21 thousand classes and more than 1 million images with bounding
box. The competition focused on 1000 classes with a trimmed database for image
classification and object detection. For image classification, the class label is
assigned to the object in the images and it localizes the object with a bounding box
for object detection. With the evolution of DNN models, Clarifia [10], VGG-16 [11],
and GoogleNet [12], the error rate was rapidly reduced. In 2015, ResNet [13]
showed an error rate of less than 5% of human level accuracy. The rapid growth of
deep learning is transforming our world.
The brain consists of 86 billion neurons and each neuron has a cell body (or soma)
to control the function of the neuron. Dendrites are the branch-like structures
extending away from the cell body responsible for neuron communication. It receives
the message from other neurons and allows the message to travel to the cell body. An
axon carries an electric impulse from the cell body to the opposite end of the neuron,
an axon terminal that passes the impulse to another neuron. The synapse is the
chemical junction between the axon terminal of one neuron and dendrite where the
chemical reaction occurs, excited and inhibited. It decides how to transmit the mes-
sage between the neurons. The structure of a neuron allows the brain to transmit the
message to the rest of the body and control all the actions (Figure 1.4).
The neural network model is derived from a human neuron. It consists of the
node, weight, and interconnect. The node (cell body) controls the neural network
operation and performs the computation. The weight (axon) connects to either a
single node or multiple ones for signal transfer. The activation (synapse) decides
the signal transfer from one node to others.
The neural network models are divided into supervised, semi-supervised, and
unsupervised learning.
Dendrites
X1
Synapses
W1
W2
X2
Cell body
Wn
Xn
Axon Activation
Weight Wʹ1
X1 W1 Xʹ1
Y = f (∑ W X)
Wʹ2
W2 Xʹ2
X2
Node Wʹnʹ
Wn
Threshold
Xn function Xʹnʹ
Layer 1 Layer 2
successful training, the network predicts the outputs based on unknown inputs.
The popular supervised models are CNN and Recurrent Neural Network (RNN)
including Long Short-Term Memory Network (LSTM).
Regression is typically used for supervised learning; it predicts the value based
on the input dataset and finds the relationship between the input and output.
Linear regression is the popular regression approach (Figure 1.5).
Input
Original dataset
Output
Input
Regression
Input
Original dataset
Output
Input
Clustering
4 Iain Law, Enoch Law and Oscar Law, LEGO Smart AI Robot, https://www.youtube.com/
watch?v=NDnVtFx-rkM.
Table 1.1 Neural network framework.
TensorFlowa Google Brian Team 2015 Linux, macOS, Windows Python, C++
Caffeb Berkeley Vision and Learning 2013 Linux, macOS, Windows Python, C++, Matlab
Center
Microsoft Cognitive Microsoft 2016 Linux, Windows Python (Keras), C++, BrainScript
Toolkitc
Torchd Ronan Collobert, Koray 2002 Linux, macOS, Windows, iOS, Lua, LuaJIT, C, C++
Kavukcuoglu, Clement Farabet Android
PyTorche Adam Paszke, Sam Gross, 2016 Linux, macOS, Windows Python, C, C++, CUDA
Soumith Chintala, Gregory
Chanan
MXNetf Apache Software Foundation 2015 Linux, macOS, Windows, AWS, Python, C++, Julia, Matlab,
iOS, Android Javascript, Go, R, Scala, Perl, Clojure
g
Chainer Preferred Networks 2015 Linux, macOS Python
Kerash Francois Chollet 2015 Linux, macOS, Windows Python, R
Deeplearning4ji Skymind Engineering Team 2014 Linux, macOS, Windows, Python(Keras), Java, Scala, Clojure,
Android Kotlin
Matlabj MathWorks Linux, macOS, Windows C, C++, Java, Matlab
a
http://www.tensorflow.org.
b
http://caffe.berkeleyvision.org.
c
http://github.com/Microsoft/CNTK.
d
http://torch.ch.
e
http://pytorch.org.
f
https://mxnet.apache.org.
g
http://chainer.org.
h
http://keras.io.
i
http://deeplearning4j.org.
j
http://mathworks.com.
●●
●●
NASNet-Large
SE-ResNeXt-101(32×4d)
Inception-ResNet-v2
80 Inception-v4 SENet-154
SE-ResNeXt-50(32×4d) DualPathNet-131
Xception DualPathNet-98
SE-ResNet-101 SE-ResNet-152 ResNeXt-101(64×4d)
ResNeXt-101(32×4d) ResNet-152
SE-ResNet-50 Inception-v3
DenseNet-201 ResNet-101 FB-ResNet-152
DenseNet-161
Caffe-ResNet-101 VGG-19_BN
ResNet-50
DualpathNet-68 VGG-16_BN
DenseNet-169
75 DenseNet-121
MobileNet-v2 VGG-11_BN
VGG-19
70 ResNet-18 VGG-16
MobileNet-v1
ShuffleNet VGG-11
GoogLeNet
65
60
1 M 5 M 10 M 50 M 75 M 100 M 150 M
SqueezeNet-v1.1
puting or High-Performance Computing (HPC) processor.
SqueezeNet-v1.0
AlexNet
55
0 5 10 15 20 25
Operations (G-FLOPs)
Device Architecture (CUDA) which fully utilize the powerful GPU to perform
All the popular Neural Network frameworks support NVIDIA Compute Unified
mat. Most of the deep learning accelerators are optimized for inference through
only takes few seconds to a minute to predict the output using fixed-point for-
The inference predicts the output through a trained neural network model. It
accuracy. It takes few hours to few days to train the network using cloud com-
1.4 Neural Network Framewor
After the AlexNet [1] is emerged in 2012, various neural network models are
developed. The models become larger, deeper, and more complex. They all
demand intensive computation and high memory bandwidth. Different neural
network models are compared [14, 15] in terms of computational complexity,
model efficiency, and memory utilization (Figures 1.7–1.9).
To improve computational efficiency, new deep learning hardware architec-
tures are developed to support intensive computation and high-memory band-
width demands. To understand the deep learning accelerator requirements, CNN
is introduced in the next chapter. It highlights the design challenges and discusses
the hardware solutions.
NASNet-Large
SENet-154
SE-ResNeXt-101(32×4d)
Inception-ResNet-v2
80 Inception-v4
DualPathNet-131
ResNeXt-101(64×4d) DualPathNet-98 SE-ResNeXt-50(32×4d)
SE-ResNet-101 Xception
SE-ResNet-152 ResNeXt-101(32×4d)
ResNet-152
ResNet-101 SE-ResNet-50
Inception-v3
FB-ResNet-152 DenseNet-161 DenseNet-201
DualPathNet-68
Caffe-ResNet-101 ResNet-50
75 DenseNet-169 DenseNet-121
VGG-19_BN NASnet-A-Moblie
Top-1 accuracy (%)
VGG-16_BN BN-Inception
ResNet-34
VGG-19
VGG-13_BN MoblieNet-v2
VGG-16
VGG-11_BN
70 VGG-13
MoblieNet-v1
VGG-11 ResNet-18
ShuffleNet
GoogLeNet
65
60
SqueezeNet-v1.1
SqueezeNet-v1.0
Alexnet
55
7
8
9
20
30
40
50
60
0.5
0.6
0.7
0.8
0.9
1 10
Figure 1.8 Neural network top 1 accuracy density vs. model efficiency [14].
Exercise 11
1600
VGG-19_BN
VGG-16_BN VGG-19
VGG-13_BN VGG-16
VGG-11_BN VGG-13
VGG-11
1400
Maximum GPU memory utilization (MB)
1.10
pe:
Slo SENet-154
SE-ResNet-152
1200 1.1
5
SE ResNet 101 NASNet A Large pe:
Slo
Xception SE-ResNeXt-101(32×4d)
SE-ResNet-50
SE-ResNeXt-50(32×4d)
1000 SqueezeNet-v1.0 DualPathNet-131
FB-ResNet-152
ShuffleNet ResNet-152 ResNeXt-101(64×4d)
Inception-ResNet-v2
SqueezeNet-v1.1 DenseNet-169 ResNet-101 DualPathNet-98
Inception-v4 AlexNet
DenseNet-161
800 ResNet-34
CaffeResNet-101
ResNeXt-101(32×4d)
DenseNet-201 ResNet-50
Inception-v3
DenseNet-121ResNet-18
GoogLeNet BN-Inception
MoblieNet-v1 DualpathNet-68
MoblieNet-v2 NASNet-A-Mobile
600
400
0 100 200 300 400 500 600
Parameters (Mb)
Figure 1.9 Neural network memory utilization and computational complexity [14].
Exercise
1 Why is the deep learning approach better than the algorithmic one?
2 How does deep learning impact the automotive, finance, retail, and health-
care industries?
3 How will deep learning affect the job market in the next ten years?
References
1 Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet Classification
with Deep Convolutional Neural Network. NIPS.
2 Silver, D., Huang, A., Maddison, C.J. et al. (2016). Mastering the game of go with
deep neural networks and tree search. Nature: 484–489.
3 Strachnyi, K. (2019). Brief History of Neural Networks Analytics Vidhya,
23 January 2019 [Online].
4 McCulloch, W.S. and Pitts, W.H. (1943). A logical calculus of the ideas immanent
in nervous activity. The Bulletin of Mathematical Biophysics 5 (4): 115–133.
5 Rosenblatt, F. (1958). The perceptron – a probabilistic model for information
storage and organization in the brain. Psychological Review 65 (6): 386–408.
6 Minsky, M.L. and Papert, S.A. (1969). Perceptrons. MIT Press.
7 Hopfield, J.J. (1982). Neural networks and physical systems with emergent
collective computational abilities. Proceeding of National Academy of Sciences 79:
2554–2558.
8 LeCun, Y., Bottou, L., and Haffnrt, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86 (11): 2278–2324.
9 Russakobsky, O., Deng, J., and Su, H., et al. (2015). ImageNet Large Scale Visual
Recognition Challenge. arXiv:1409.0575v3.
10 Howard, A.G. (2013). Some Improvements on Deep Convolutional Neural
Network Based Image Classification. arXiv:1312.5402v1.
11 Simonyan, K. and Zisserman, A. (2014). Very Deep Convolutional Networks for
Large-Scale Image Recognition. arXiv:14091556v6.
12 Szegedy, C., Liu, W., Jia, Y., et al. (2015). Going deeper with convolutions. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 1–9.
13 He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 770–778.
14 Bianco, S., Cadene, R., Celona, L., and Napoletano, P. (2018). Benchmark
Analysis of Representative Deep Neural Network Architecture.
arXiv:1810.00736v2.
15 Canziani, A., Culurciello, E., and Paszke, A. (2017). An Analysis of Deep Neural
Network Models for Practical Applications. arXiv:1605.07678v4.
13
Deep Learning
The classical Deep Neural Network (DNN), AlexNet1 [1] is derived from LeNet [2]
with wider and deeper layers. It consists of eight layers: the first five layers are the
convolutional layer with the nonlinear activation layer, Rectified Linear Unit (ReLU).
It is followed by the max-pooling layer to reduce the kernel size and the Local Response
Normalization (LRN) layer to improve the computation stability (Figure 2.1). The last
three layers are fully connected layers for object classification (Figure 2.2).
Why does AlexNet show better image classification than LeNet? The deeper DNN
model allows the feature maps [3] evolved from the simple feature maps to the com-
plete ones. Therefore, the deeper DNN model achieves better top 1 accuracy during
ILVSRC competition. The major drawback of the DNN model is intensive computa-
tion [4] and high memory bandwidth demands. The convolution requires 1.1 billion
computations that occupy about 90% of computational resources (Figure 2.3).
In this section, it briefly discusses the general neural network layer functions [5–8].
It includes convolutional layer, activation layer, pooling layer, batch normaliza-
tion, dropout, and fully connected layer.
1 AlexNet image input size should be 227 × 227 × 3 rather than 224 × 224 × 3 typo in
original paper.
224
128 128
48 192 192
2048 2048
55 27
55 13
27 13 13 13
13 13
224 1000
3 Max
Max Max 192 192 128
48 pooling 128
pooling pooling 2048 2048
2 1 1 1
Stride 4
4096 4096 1000
Layer 1
Layer 2
Layer 3
Layer 4 Layer 5
Figure 2.3 Deep neural network AlexNet feature map evolution [3].
16 2 Deep Learning
1 2 0 0
3 4 0 1
1 2 0 0
3 4 2 3
1 2 0 0
3 4 4 0
1 2 0 5
Feature map 0 0 0 0 0 0
3 4 0 9
Filter 1 2 3 4 0 1 2 3 4 0
4 18 12
1 2 5 6 7 8 1 2 0 5 6 7 8 0 1 2 6 7
46 94 44
3 4 9 10 11 12 3 4 0 9 10 11 12 0 3 4 10 11
26 44 16
13 14 15 16 0 13 14 15 16 0
Result
Convolution 0 0 0 0 0 0 1 2 8 0
Zero-padding 3 4 12 0
1 2 0 13
3 4 0 0
1 2 14 15
3 4 0 0
1 2 16 0
3 4 0 0
Dot product
together as a batch to improve the filter weights’ reuse. The output is called output
feature maps (ofmaps). For some network models, an additional bias offset is
introduced. The zero-padding is used for edge filtering without the feature size
reduction (Figure 2.4). The stride is designed for a sliding window to avoid the
large output. The convolution is defined as
Y X W (2.1)
K 1M 1N 1
yi, j ,k xsi m , sj n, k wm,n,k i, j ,k (2.2)
k 0m 0n 0
W M 2P
W 1 (2.3)
S
H N 2P
H 1 (2.4)
S
D K (2.5)
2.1 Neural Network Laye 17
where
yi,j,k is the output feature maps, width W′, height H′, depth D′ at i, j, with kth filter
x m,n,k is the input feature maps, width W, height H, depth D at m, n with kth filter
wm,n is the kth stacked filter weights with kernel size M (vertical) and N
(horizontal)
β is the learning bias with n bias
P is the zero-padding size
S is the stride size
1
Sigmoid Y x
(2.6)
1 e
ex e x
Hyperbolic tangent Y (2.7)
ex e x
where
Y is the activation output
X is the activation input
18 2 Deep Learning
Conventional non-linear
1 1
activation functions
–1 0 1 –1 0 1
–1 –1
Symoid Hyperbolic
function tangent
Modern non-linear
1 1 1
activation functions
–1 0 1 –1 0 1 –1 0 1
–1 –1 –1
x x≥0
Y = max(0,x) y = max(α x,x) y= α (ex – 1)x < 0
Rectified linear Leaky rectified Exponential
unit (ReLU) linear unit linear unit
W M
W 1 (2.13)
S
H N
H 1 (2.14)
S
where
yi,j,k is the pooling output at position i, j, k with width W′ and height H′
xm,n,k is the pooling input at position m, n, k with width W and height H
M, N are the pooling size with width M and height N
2.1 Neural Network Laye 19
1 6 10 2
14 7 3 13
8 15 4 16
12 5 9 11
14 13 7 7
15 16 10 10
a ix , y
bxi , y (2.15)
n
min N 1, i 2
2
k n
a ix , y
j max 0, i
2
where
bxi , y is the normalization output at location x, y
a ix , y is the normalization input at location x, y
α is the normalization constant
β is the contrast constant
k is used to avoid singularities
N is the number of channels
20 2 Deep Learning
Batch Normalization
xi
yi (2.16)
2 2
1n 1
xi (2.17)
ni 0
2 1n 1
2
xi (2.18)
ni 0
where
Yi is the output of Batch Normalization with depth n
Xi is the input of Batch Normalization with depth n
μ and σ are the statistical parameters collected during training
α, ϵ and ϒ are training hyper-parameters
55 27 13
13 13
27 13 13
55 13
3 Max
Max Max 192 192
48 pooling 128 128 pooling 2048 2048
pooling
M 1N 1
yi wi,m,n xm,n (2.19)
m 0n 0
where
yi is the fully connected output at position i
xm,n is the fully connected layer input with width M and height N
wi,m,n is the connected weight between output yi and input xm,n
●● Big Input Data: For AlexNet, the 227 × 227 × 3 pixels input image convolves with
96 filters with weights 11 × 11 × 3 and a stride of 4 pixels for the first convolu-
tional layer. It imposes an intense computational requirement. The high-quality
image further increases the loading for deep learning processing (Figure 2.8).
●● Deep Network: AlexNet consists of eight layers. The depth of the model is
increased dramatically (i.e. ResNet-152 has 152 layers). It increases the training
time from hours to days as well as the inference response.
●● Massive Parallel Processing: The convolution occupies over 90% of computa-
tional resource;2 the massively parallel architecture is required to speed up the
overall computation. The traditional central processing unit (CPU) is no longer
to handle such requirements.
●● Reconfigurable Network: The evolution of deep learning requires a reconfigur-
able network to fulfill different model demands.
●● Memory Bottleneck: The high memory access becomes the deep learning
challenge. Various memory schemes are developed to resolve this issue.
●● Intensive Computation: The deep learning hardware applies floating-point arith-
metic for training and inference. It requires complex hardware to process the data
with long runtime. It demands a better scheme to speed up the computation.
●● Network Pruning: Many filter weights have zero or near-zero value. Network
pruning is required to eliminate the unused connections for a more effective
network structure.
●● Data Sparsity: The deep learning accelerator must ignore the ineffectual zero in
order to improve the overall performance.
Exercise
1 Why does the deep learning model often employ a convolutional layer?
Layer Size Filter Depth Stride Padding Number of parameters Forward computation
Conv1 + ReLu 3 × 227 × 227 11 × 11 96 4 (11 × 11 × 3 + 1) × 96 = 34 944 (11 × 11 × 3 + 1) × 96 × 55 × 55 = 105 705 600
Max pooling 96 × 55 × 55 3×3 2
Norm 96 × 27 × 27
Conv2 + ReLu 5×5 256 1 2 (5 × 5 × 96 + 1) × 256 = 614 656 (5 × 5 × 96 + 1) × 256 × 27 × 27 = 448 084 224
Max pooling 256 × 27 × 27 3×3 2
Norm 256 × 13 × 13
Conv3 + ReLu 3×3 384 1 1 (3 × 3 × 256 × 1) × 384 = 885 120 (3 × 3 × 256 × 1) × 384 × 13 × 13 = 14 958 280
Conv4 + ReLu 384 × 13 × 13 3×3 384 1 1 (3 × 3 × 384 + 1) × 384 = 1 327 488 (3 × 3 × 384 + 1) × 384 × 13 × 13 = 224 345 472
Conv5 + ReLu 384 × 13 × 13 3×3 256 1 1 (3 × 3 × 384 + 1) × 256 = 884 992 (3 × 3 × 384+1) × 256 × 13 × 13 = 149563648
Max pooling 256 × 13 × 13 3×3 2
Dropout 256 × 6 × 6
(rate 0.5)
FC6 + ReLu 256 × 6 × 6 = 37 748 736 256 × 6 × 6 × 4 096 = 37 748 736
Dropout (rate 4096
0.5)
FC7 + ReLu 4 096 × 4 096 = 16 777 216 4 096 × 4 096 = 16 777 216
FC8 + ReLu 4096 4 096 × 1000 = 4 096 000 4 096 × 1000 = 4 096 000
1000 classes
Overall 623 691 52 = 62.3 million 1 135 906 176 = 1.1 billion
Conv: 3.7 million (6%) Conv: 1.08 billion (95%)
FC: 58.6 million (94%) FC: 58.6 million (5%)
24 2 Deep Learning
10 How can the convolutional layer be modified for the fully connected one?
References
1 Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). ImageNet Classification with
Deep Convolutional Neural Network. NIPS.
2 LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Convolutional networks and
applications in vision. Proceedings of 2010 IEEE International Symposium on
Circuits and Systems, 253–256.
3 Zeiler, M.D. and Fergus, R. (2013). Visualizing and Understanding Convolutional
Networks. arXiv:1311.2901v3.
4 Gao, H. (2017). A walk-through of AlexNet, 7 August 2017 [Online]. https://
medium.com/@smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637.
5 邱錫鵬, “神經網絡與深度學習 (2019). Neural Networks and Deep Learning,
github [Online]. https://nndl.github.io.
6 Alom, M.Z., Taha, T.M., Yakopcic, C., et al. (2018). The History Began from
AlexNet: A Comprehensive Survey on Deep Learning Approaches.
arXiv:803.01164v2.
7 Sze, V., Chen, Y.-H., Yang, Y.-H., and Emer, J.S. (2017). Efficient processing of deep
neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12):
2295–2329.
8 Abdelouahab, K., Pelcat, M., Serot, J., and Berry, F. (2018). Accelerating CNN
Inference on FPGA: A Survey. arXiv:1806.01683v1.
25
Parallel Architecture
This chapter describes several popular parallel architectures, Intel CPU, NVIDIA
GPU, Google TPU, and Microsoft NPU. Intel CPU is designed for general compu-
tation and NVIDIA GPU is targeted for graphical display. However, they both
employ a new memory structure with software support for deep learning applica-
tions. Custom design, Google TPU, and Microsoft NPU apply novel architectures
to resolve deep learning computational challenges.
1 New Vector Neural Network Instruction (VNNI) in 2nd-generation Intel Xeon scalable family.
3.1 Intel Central Processing Unit (CPU 27
New Xeon processor broke the record that trained the ResNet-50 model in
31 minutes and AlexNet model in 11 minutes. Compared with the previous-
generation Xeon processor, it improves training 2.2× and inference 2.4× through-
put with the ResNet-18 model using the Intel NeonTM framework.
Figure 3.1 Intel Xeon processor ES 2600 family Grantley platform ring architecture [3].
28 3 Parallel Architecture
1×16/2×8/4×4
2×UPI × 20@ 1×16/2×8/4×4 PCIe@ 8GT/s 1×UPI × 20@ 1×16/2×8/4×4
10.4GT/s PCIe@8GT/s ×4 DMI 10.4GT/s PCIe@ 8GT/s
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
3×DDR4 2666
3×DDR4 2666
DDR4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR4
DDR4 DDR4
DDR4 SKX Core SKX Core SKX Core SKX Core DDR4
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
CHA – Caching and Home Agent; SF – Snoop Filter; LLC – Last Level Cache;
SKX Core – Skylake Server Core; UPI – Intel’ UltraPath Interconnect
Figure 3.2 Intel Xeon processor scalable family Purley platform mesh architecture [3].
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
Intel UPI
DDR4 Core Core DDR4
socket addresses are uniformly distributed across LLC banks independent of SNC
mode. The overall efficiency is improved through large LLC storage.
Two-domain configuration consists of SNC domains 0 and 1. Each supports
half of the processor cores, half of LLC banks, and one memory controller with
three DDR4 channels. It allows the system to effectively schedule the tasks and
allocate the memory for optimal performance (Figure 3.7).
3.1 Intel Central Processing Unit (CPU 31
1×16/2×8/4×4
2×UPI × 20@ 1×16/2×8/4×4 PCIe@ 8GT/s 1×UPI × 20@ 1×16/2×8/4×4
10.4GT/s PCIe@8GT/s ×4 DMI 10.4GT/s PCIe@ 8GT/s
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
3×DDR4 2666
3×DDR4 2666
DDR4 MC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC CHA/SF/LLC MC DDR4
DDR4 DDR4
DDR4 SKX Core SKX Core SKX Core SKX Core DDR4
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
SKX Core SKX Core SKX Core SKX Core SKX Core SKX Core
Unlike its predecessor, the data is copied to both MLC and LLC banks. Due to the
non-inclusive LLC nature, a snoop filter keeps track of cache line information for
the absence of a cache line in the LLC bank.
Core Core
2.50
Baseline-1worker/node 1-node: 4 workers 2-nodes: 8 workers 4-nodes: 16 workers
2.13
2.08 2.07 2.06
2.00
2.00 1.92 1.92
1.86 1.88 1.88
1.83 1.86
1.76 1.77
Relative performance (Proj.TTT)
1.70
1.51
compared to baseline
1.50
1.31 1.34
0.50
0.00
GoogLeNet-V1 InceptionV3 ResNet-50 ResNet-152 AlexNet(BS:512) VGG16
VPMADDWD
512 bits
16 bits
A0 B0 A1 B1
SRC 1 A0 A1
× ×
B0 B1
× ×
SRC 2
+
DST A0B0 + A1B1
+
VPADDD
32 bits
A0B0 + A1B1 C0
SRC 1 A0B0 + A1B1
+
+
SRC 2 C0
VPDPWSSD
512 bits
16 bits
A0 B0 A1 B1
A0 A1
SRC 1
× ×
× ×
SRC 2 B0 B1
SRC 3 C0
+
A0B0 + A1B1+ C0
Figure 3.12 Intel AVX-512 with VNNI 16 bits FMA operation (VPDPWSSD).
Quantized Quantized
fp32 s8 s32 fp32
fp32->s8 fp32->s32
Weight, bias
u8->u8/s8/s32
Quantized Output
u8 convolution
fp32->u8 Inputs Outputs data type
inner product
Inputs
s8/u8 s32
Slow path
Quantized
s8/u8->u8/u8
s32->fp32
Medium path
Outputs Inputs
fp32
Quantized normalization
Data Outputs s32->u8
convolution fp32
prefetching, data reuse, cache blocking, data layout, vectorization, and register
blocking. Prefetching and data reuse avoid the same data multiple fetching to
reduce the memory access. Cache blocking fits the data block into the cache to
maximize the computation. Data layout arranges data consecutively in memory to
avoid unnecessary data gather/scatter during the looping operation. It provides
better cache utilization and improves prefetching operation. The vectorization
restricts the outer looping dimension to be a multiple of SIMD width and inner
looping over groups of SIMD width for effective computation. With optimal MKL-
DNN library, the overall training and inference using Intel Xeon scalable proces-
sor (Intel Xeon Platinum 8180 Processor) significantly outperform its predecessor
(Intel Xeon Processor ES-2699 v4) (Figure 3.13).
Recently, the Intel MKL-DNN library has been updated for inference with lower
numerical precision [6]. It implements eight bits convolution with unsigned eight
bits (u8) activation with signed eight bits (s8) weights to speed up the overall oper-
ations. For training, it still supports 32 bits floating-point convolution to achieve
better accuracy. This approach optimizes the numerical precision for both train-
ing and inference.
The quantization transforms the non-negative activation and weight from 32
bits floating point to signed eight bits integer. It first calculates the quantization
factors:
255
Qx (3.3)
Rx
255
Qw (3.4)
Rw
where
Rx is the maximum of the activation x
Rw is the maximum of the weights w
Qx is the quantization factor of the activation x
Qw is the quantization factor of the weights w
Then, the quantified activation x, weights w, and bias b is rounded to the near-
est integer
a xu 8 Qx x fp32 0, 255
(3.5)
ws8 Qw w fp32 128,127
(3.6)
where
xu8 is the unsigned eight bits integer activation
xfp32 is the 32 bits floating-point activation
ws8 is the signed eight bits integer weights
wfp32 is the 32 bits floating-point weights
bs32 is the signed 32 bits integer bias
‖ ‖ is the rounding operation
The integer computation is done using eight bits multiplier and 32 bits accumu-
lator with the rounding approximation
where
yfp32 is 32 bits floating-point output
D is the dequantization factor
It is modified to support the activation with a negative value
255
Qx (3.16)
Rx
Where
x′ is the activation with a negative value
Rx′ is the maximum of the activation
Qx′ is the quantization factor of the activation
The activation and weights are changed
shift W fp32
b fp32 b fp32 (3.19)
Qx
where
b′s32 is the 32 bits signed integer supported the negative activation
b′fp32 is the 32 bits floating-point bias supporting the negative activation shift
which performs the shift left operation to scale up the number
With the same eight bits multipliers and 32 bits accumulators, the calculation is
defined as
ys 32 Qx Qw ysp32 (3.24)
1
y fp32 ys 32 (3.12)
Qx Qw
1
D (3.14)
Qx Qw
With the lower numerical precision Intel MKL-DNN library, both training and
inference throughput are twice as the predecessor. The software optimization dra-
matically improves the overall system performance with less power (Figures 3.14
and 3.15).
77 85
80
59 55 60
60
39 45 45
40 34
24 21
20
0
Caffe ResNet-50 Caffe VGG-16 Caffe Inception v3 TensorFlow ReNet-50 TensorFlow VGG-16 TensorFlow Inception v3
BS = 128 BS = 128 BS = 32 BS = 128 BS = 128 BS = 96
Intel Optimized Caffe TensorFlow
431
400
305
300
198 231 193
198
200 150
125 118
88 73
100
0
Caffe ResNet-50 Caffe VGG-16 Caffe Inception v3 TensorFlow Reset-50 TensorFlow VGG-16 tensorFlow Inception v3
BS = 64 BS = 128 BS = 128 BS = 128 BS = 64 BS = 128
Intel Optimized Caffe TensorFlow
NVIDIA Graphics Processing Unit (GPU) is widely applied for deep learning
applications (i.e. image classification, speech recognition, and self-driving vehi-
cle) due to effective floating-point computation and high-speed memory support
(Figure 3.16). With new Turing architecture [7], it speeds up the deep learning
training and inference arithmetic operation to 14.2 TFLOPS (FP32) and supports
high-speed NVLink2 for HBM2 (2nd-generation High Bandwidth Memory) with
900 Gb/s bandwidth. Turing Multi-Process Service further improves overall per-
formance through a hardware accelerator. New neural graphics framework
NVIDIA NGXTM with NGX DLSS (Deep Learning Super-Sample) accelerates and
enhances the graphics, rendering, and other applications (Table 3.2).
The key features of Turing architecture based on TU102 GPU (GeForce
RTX-2080) are listed as follows:
●● Six Graphics Processing Clusters (GPC)
●● Each GPC has Texture Processing Clusters (TPC) with two Streaming
Multiprocessors (SM) per TPC
●● Total 34 TPC and 68 SM
●● Each SM has 64 CUDA Cores, eight Tensor Cores, and an additional 68 Ray
Tracing (RT) Cores
●● GPU clock speed: 1350 MHz
●● 14.2 TFLOPS of single-precision (FP32) performance
●● 28.5 TFLOPS of half-precision (FP16) performance
●● 14.2 TIPS concurrent with FP, through independent integer execution units
●● 113.8 Tensor TFLOPS
GPC GPC GPC TPC
TPC TPC TPC TPC TPC TPC SM SM
Warp scheduler dispatch Warp scheduler dispatch
Tensor Tensor
INT32 FP32 INT32 FP32
Core Core
Tensor Tensor
INT32 FP32 INT32 FP32
core core
SM SM
Data cache shared menory
RT core
Sum with
FP16 Full precision FP 32 Convert to
storage/input product accumulator FP 32 result
more products
F16
× + F32
F16
F32
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
D=
A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3
+ C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
Pascal Turing Tensor Core Turing Tensor Core Turing Tensor Core
FP16 INT 8 INT 4
For 16 × 16 matrix multiplication [10], it first divides the matrix into eight
thread groups of four threads. Each group computes 8 × 4 blocks through four sets
of operations. Through eight group computations, it creates 16 × 16 matrix
(Figure 3.20).
Matrices A and B are first divided into multiple sets, then executes the instruc-
tion in Set 0, followed by Set 1, Set 2, and Set 3. Finally, it correctly computes 4 ×
8 elements in Matrix D (Figure 3.21).
3.2 NVIDIA Graphics Processing Unit (GPU 43
Group 0 Group 2
Group 4 Group 6
Group 1 Group 3
Group 5 Group 7
Set 0 × =
Set 1 × =
Set 2 × =
Set 3 × =
A B D
32×8×16 = × +
D A B C
32×8 32×16 16×8 32×8
8×32×16 = × +
D A B C
8×32 8×16 16×32 8×32
d0 d1 g0
F 2, 2 (3.15)
d1 d2 g1
m1 m2
F 2, 2 (3.16)
m2 m3
m1 d0 d1 g0 (3.17)
m2 d1 g0 g1 (3.18)
m3 d1 d2 g1 (3.19)
The standard algorithm requires four multiplications and two additions but
Winograd transform requires three multiplications and five additions
g0
d0 d1 d2
F 2, 3 g1 (3.20)
d1 d2 d3
g2
3.2 NVIDIA Graphics Processing Unit (GPU 45
m1 m2 m3
F 2, 3 (3.21)
m2 m3 m4
m1 d0 d2 g0 (3.22)
g0 g1 g2
m2 d1 d2 (3.23)
2
g0 g1 g2
m3 d2 d1 (3.24)
2
m4 d1 d3 g2 (3.25)
The standard algorithm requires six multiplications and four additions and
Winograd transform only requires four multiplications, 12 additions, and two
shift operations. F(2, 3) is preferred over F(2, 2) for convolution computation due
to its efficiency.
Winograd 1D algorithm is nested to support Winograd 2D convolution F(m ×
m, r × r) with r × r filter and the m × m input activation. It is partitioned into
(m + r − 1) × (m + r − 1) tiles with r − 1 elements overlap between the neighboring
tiles. The standard algorithm uses m × m × r × r multiplications but Winograd 2D
convolution only requires (m + r − 1) × (m + r − 1) multiplications.
For F(2 × 2, 3 × 3) operations, the standard algorithm performs 2 × 2 × 3 ×
3 multiplications (total 36) but Winograd 2D convolution only requires (2 + 3 − 1)
× (2 + 3 − 1) multiplications (total 16). It is optimized for convolution and reduces
the number of multiplications by 36/15 = 2.25 using 3 × 3 filter. However, the
drawback of Winograd 2D convolutions requires different computations for
various filter sizes.
32 thread warp
does not interact with others. After the matrix multiplication, it regroups the sub-
sets into the same group to obtain the results (Figure 3.24).
New SMT also supports the independent thread scheduling with scheduler opti-
mizer in its own program counter and stack information. It determines how to
group the active threads with the same warp to fully utilize the SIMT unit. For
example, a new synchronization feature called syncwarp allows the thread to
diverge and reconverge at a fine granularity to maximize parallel efficiency.
if (threadIdX.X < 4) {
Diverge
A;
B; A;
} else {
X;
X;
Y;
} B;
Time
Z;
__ syncwrap() Y;
Z;
Z;
sync
sync Reconverge
Interposer
Substrate
Compared to HBM1, HBM2 supports up to eight DRAM dies per stack. The
memory size is increased from 2–8 Gb per die. The memory bandwidth is also
increased from 125–180 Gb/s. The high-memory bandwidth dramatically
enhances the overall GPU performance.
CPU CPU
PCIe PCIe
switch switch
PCIe
NVLink2
PCIe
switch
GPU GPU
GPU GPU
PCIe
NVLink2
CPU
GPU GPU
PCIe
NVLink2
CPU
PCIe
NVLink2
GPU
3.3 NVIDIA Deep Learning Accelerator (NVDLA 49
The NVLink2 successfully links CPUs and GPUs together to construct DGX-1
supercomputer. For example, NVIDIA DGX-1 with 8 GPUs (Tesla V100) and 2
CPUs (Intel Xeon E5-2698 v4 2.2 GHz) achieves over 1 PFLOPS performance for
deep learning applications.
CSB/Interrupt
Interface
Convolution
Convolution
buffer
Bridge DMA
2 http://nvdla.org
3 NVDLA accelerator is an open-source project which can be implemented using FPGA approach.
50 3 Parallel Architecture
Source w,h,c
Result w,h,c (3.26)
n
min(C 1,c
j 2 Source 2
n w , h ,i
n i max( 0,c
2)
3.3 NVIDIA Deep Learning Accelerator (NVDLA 51
CPU
IRQ CSB
System
DRAM
NVDLA
DBB
Interface
CPU
Microcontroller
System
IRQ CSB DRAM
NVDLA
SRAM DBB
interface interface
SRAM
The large system model employs an additional coprocessor with the memory
interface to support multiple task local operations. The memory interface con-
nects to the high-bandwidth memory to reduce host loading.
Parser
Compiler
Optimizer
Loadable
User-Mode Driver
(UMD)
Submit
Kernel-Mode Driver
(KMD)
Configure
NVDLA
3.4 Google Tensor Processing Unit (TPU 53
hardware. In order to run the models in NVDLA, it is divided into two modes,
User Mode Driver (UMD) and Kernel Mode Driver (KMD). UMD loads the com-
plied NVDLA loadable and submits the jobs to KMD. KMD schedules the layer
operation and configures each functional block for inference operations.
Google successfully deployed Tensor Processing Unit (TPU) [15, 16] to resolve the
growing demand for speech recognition in a datacenter in 2013. TPU is evolved
from standalone v1 to cloud v2/v3 [17, 18] to support a wide range of deep learn-
ing applications today. The key features of TPU v1 are listed as below:
●● 256 × 256 eight bits MAC unit
●● 4 Mb on-chip Accumulator Memory (AM)
●● 24 Mb Unified Buffer (UB) – activation memory
●● 8 Gb off-chip weight DRAM memory
●● Two 2133 MHz DDR3 channels
TPU v1 handles six different neural network applications that account for 95%
of the workload:
●● Multi-Layer Perceptron (MLP): The layer is a set of nonlinear weighted sum of
prior layer outputs (fully connected) and reuses the weights
●● Convolutional Neural Network (CNN): The layer is a set of nonlinear weighted
sum of spatial nearby subsets of prior layer outputs and reuses the weights
●● Recurrent Neural Network (RNN): The layer is a set of nonlinear weighted sum
of prior layer outputs and previous set. The popular RNN is Long Short-Term
Memory (LSTM) that determines what state can forget or pass to the next layer
and reuses the weights (Table 3.3)
Layers
Nonlinear Ops/Weight Batch
Name Conv Pool FC Vector Total function Weights byte size Deploy (%)
DDR3 DRAM
14 Gb/s 30 Gb/s
DDR3 Interface Weight FIFO
Unified
103 b/s Systolic 167 Gb/s
buffer Matrix Multiply
array
(Local Unit
14 Gb/s 14 Gb/s data
PCIe activation (MMU)
Host setup
Gen3 × 16 storage)
Inferface
Interface
Control Accumulators
Activation
167 Gb/s
Instructions Normalize/pool
Control Control
Host Accumulators
interface (4k×256×32b)
2% 6%
DRAM DRAM
Control Activation
port port
unit pipeline
DDR3 DDR3
2% 6%
3% 3%
Control
Data
+ Adder
Partial sum
Register
+ + + +
X22 X21 W12 W22 W32 Y21 Y22 = W21 W22 W23 × X21 X22
IEEE 32 Bits Single Precision Floating Point Format : FP32 Range: ~1e-38 to ~3e38
S E E E E E E E E M M M M M M M M M M M M M M M M M M M M M M M M
IEEE 16 Bits Half Precision Floating Point Format : FP16 Range: ~5.96e-8 to 65504
S E E E E E M M M M M M M M M M
Google 16 Bits Brain Floating Point Format : BFP16 Range: ~1e-38 to ~3e38
S E E E E E E E E M M M M M M M
Log-Log scale
MLP0
CNN0
CNN1
LSTM0
1
LSTM1
MLP1
MLP0
CNN0
0.1 CNN1
LSTM0
1 10 100 1000 LSTM1
4 more
Operational intensity: Ops/weight byte (log scale)
Vector Scale
unit unit
Core
8 Gb 8Gb
DDR3 DDR3
256 × 256
Multiply matrix unit
(MMU)
format (FP32) is used for both MXU inputs and outputs. However, MXU performs
BFP16 multiplication internally.
4 Google applies software approach to optimize the TensorFlow subgraph. Blaize GSP and
Graphcore IPU realize the similar approach through hardware design in Chapter 4.
3.5 Microsoft Catapult Fabric Accelerato 61
Bias
Addition ReLU
Filter
weights
Multiplication
Feature
maps
Softmax
Label
With the growth of speech search demand, Microsoft initializes the Brainwave
project to develop low-cost and highly efficient datacenter with Azure Network
using the Catapult fabric [24–26] accelerator for different applications
(Figure 3.45):
PCIe3 × 8 PCIe3 2 × 8
Accelerator
NIC FPGA DRAM
card
Deep learning
Frontend models
interface
Portable graph
intermediate representation
Graph splitter
and optimizer
Transformed graph
intermediate representation
Packet
deployment
Hardware
microservice
4 Gb DDR3 4 Gb DDR3
RSU
256 Mb QSPI
Role config flash
Host PCIe JTAG
CPU core
Softcore IC
DMA XCVR
engine
SEU
Inter-FPGA router
2 2 2 2
MRF VRF
Tile engine
MVM
VRF
Multifunction
A
unit
Vector
Network
xbar × VRF arbitration
network DRAM
+ VRF
issues. It converts FP16 input data into Microsoft narrow precision format (MS-
FP8/MS-FP9) like Google BF16 format. MS-FP8/MS-FP9 are referred to eight bits
and nine bits floating-point format where the mantissa is truncated to two or three
bits only. This format can achieve higher accuracy with a better dynamic range. It
employs multiple tile engines to support native size matrix-vector multiplication.
The input data is loaded into the VRFs and the filter weights are stored in the
MRFs for multiplication (Figure 3.51).
Each tile engine is supported by a series of the Dot-Product Engine (DPE). It
multiplies the input vector with one row in matrix tile. The results are fed to an
0.80
0.70
0.60
0.50
Model 1 Model 2 Model 3
(GRU-based) (LSTM-based) (LSTM-based)
FP16 to BFP
converter
Scheduler
Accumulator Accumulator Accumulator
Vector
addition/reduction
BFP to FP 16
converter
Vector input
Vector
register
file
Fan-out tree
Vector output
Top-level etc
O(T×N) scheduler
MVM
O(R×C×T×N) scheduler
MFU
O(E×R×C×T×N) Decoder Decoder Decoder Decoder scheduler
etc
DRAM 1 Fused
Row length Row ID Vector value
Slot 1 accumulator
Column ID +
Matrix value Output
Slot N/2 +
Memory word buffer
Row length Row ID Vector value
Matrix
Column ID +
fetcher Fused
Matrix value accumulator
Slot N/2 +1
Slot N Output
+
Memory word buffer
Row length Row ID Vector value
DRAM 1 +
Column ID
Matrix value
(b)
(a) CSR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 Values A B C D E F G H I J K L M N O P
0 A B Column indices 0 3 3 4 5 1 5 7 2 6 7 3 7 1 5 6
1 C Row pointer 0 2 3 5 8 11 13 15 16
2 D E
3 F G H (c)
4 I J K CISR 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
5 L M Values A C D F B I E G L J N H M K O P
6 N O Column indices 0 3 4 1 3 2 5 5 3 6 1 7 7 7 5 6
7 P Row lengths 2 1 2 3 2 3 2 1
Slot number 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Memory
width
Figure 3.55 (a) Sparse Matrix; (b) CSR Format; and (c) CISR Format.
total number of nonzero elements. If the array contains the same pointer for two
elements, it means that there is no zero between the two elements. This approach
is difficult for parallel multiplication. An additional buffer is required to store
intermediate data for parallel multiplication and the sequential decoding steps are
used to determine the row boundaries.
CISR allows simple parallel multiplication hardware design. It applies four
channel slots for data scheduling. The first nonzero elements (A, C, D, F) are
placed in the first slots. The corresponding column indices are placed in the same
order in the indices array. Once the row elements are exhausted, the next two row
elements are assigned to the empty slots. The process is repeated until all nonzero
elements are placed in the slots. The empty slots in the channel are filled with
padding zero. The third array shows the row length. The static row scheduling is
controlled by software that simplifies the hardware design.
For CISR decoding, the decoder first initializes the sequential row ID and sets
the counter to the row length FIFO. The counter decrements every cycle and
places the row ID in channel row ID FIFO for parallel multiplication. When the
counter reaches zero, all the row IDs are processed, and the new row ID is
assigned. The process repeats until the matrix row length array is exhausted and
the matrix is fully decoded.
Exercise
1 Why does Intel choose mesh configuration over the ring network?
2 What are the advantages of the Intel new AXV-512 VNNI instruction set?
Reference 71
7 Why does Google change 256 × 256 MMU to 128 × 128 MXU?
10 Which is the best approach among Intel, Google, and Microsoft numerical
precision format?
References
1 You, Y., Zhang, Z., Hsieh, C.-J. et al. (2018). ImageNet Training in Minutes.
arXiv:1709.05011v10.
2 Rodriguez, A., Li, W., Dai, J. et al. (2017). Intel® Processors for Deep Learning
Training. [Online]. Available: https://software.intel.com/en-us/articles/
intel-processors-for-deep-learning-training.
3 Mulnix, D. (2017). Intel® Xeon® Processor Scalable Family Technical Overview.
[Online]. Available: https://software.intel.com/en-us/articles/intel-xeon-processor-
scalable-family-technical-overview.
4 Saletore, V., Karkada, D., Sripathi, V. et al. (2018). Boosting Deep Learning Training
& Inference Performance on Intel Xeon and Intel Xeon Phi Processor. Intel.
5 (2019). Introduction to Intel Deep Learning Boost on Second Generation Intel
Xeon Scalable Processors. [Online]. Available: https://software.intel.com/en-us/
articles/introduction-to-intel-deep-learning-boost-on-second-generation-intel-
xeon-scalable.
6 Rodriguez, A., Segal, E., Meiri, E. et al. (2018). Lower Numerical Precision Deep
Learning Interence and Training. Intel.
7 (2018). Nvidia Turing GPU Architecture -Graphics Reinvented. Nvidia.
8 (2017). Nvidia Tesla P100 -The Most Advanced Datacenter Accelerator Ever Built
Featuring Pascal GP100, the World’s Fastest GPU. Nvidia.
72 3 Parallel Architecture
9 (2017). Nvidia Tesla V100 GPU Architecture -The World’s Most Advanced Data
Center GPU. Nvidia.
10 Oh, N. (2018). The Nvidia Titan V Deep Leaering Deep Dive: It’s All About
Tensor Core. [Online].
11 Lavin, A. and Gray, S. (2016). Fast algorithms for convolutional neural networks.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
4013–4021.
12 Winograd, S. (1980). Arithmetic Complexity of Computations, Society for
Industrial and Applied Mathematics (SIAM).
13 NVDLA Primer. [Online]. Available: http://nvdla.org/primer.html.
14 Farshchi, F., Huang, Q. and Yun, H. (2019). Integrating NVIDLA Deep Learning
Accelerator (NVDLA) with RISC-V SoC on FireSim. arXiv:1903.06495v2.
15 Jouppi, N.P., Young, C., Patil, N. et al. (2017). In-Datacenter Performance
Analysis of a Tensor Processing Unit. arXiv:1704.04760v1.
16 Jouppl, N.P., Young, C., Patil, N. et al. (2018). A Domain-Specific Architecture for
Deep Neural Network. [Online].
17 Teich, P. (2018). Tearing Apart Google’s TPU 3.0 AI Processor. [Online].
18 System Architecture. [Online]. Available: http://cloud.google.com/tpu/docs/
system-architecture.
19 Kung, H. (1982). Why systolic architecture? IEEE Computer 15 (1): 37–46.
20 Kung, S. (1988). VLSI Systolic Array Processors. Prentice-Hall.
21 Dally, W. (2017). High Performance Hardware for Machine Learning, Conference
on Neural Information Processing Systems (NIPS) Tutorial.
22 Williams, S., Waterman, A., and Patterson, D. (2009). Roofline: An insightful
visual performance model for floating-point programs and multicore
architecture. Communications of the ACM 52 (4): 65–76.
23 XLA (2017). TensorFlow, complied, Google Developers (6 March 2017) [online].
24 Putnam, A., Caulfield, A.M., Chung, E.S. et al. (2015). A reconfigurable fabric for
accelerating large-scale datacenter services. IEEE Micro 35 (3): 10–22.
25 Caulfield, A.M., Chung, E.S., Putnam, A. et al. (2016). A cloud-scale acceleration
architecture. 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), pp. 1–13.
26 Fowers, J., Ovtcharov, K., Papamichael, M. et al. (2018). A configurable cloud-
scale DNN processor for real-time AI. ACM/IEEE 45th Annual Annual
International Symposium on Computer Architecture (ISCA), 1–14.
27 Chung, E., Fowers, J., Ovtcharov, K. et al. (2018). Serving DNNs in real time at
datacenter scale with project brainwave. IEEE Micro 38 (2): 8–20.
28 Fowers, J., Ovtcharov, K., Strauss, K. et al. (2014). A high memory bandwidth
FPGA accelerator for sparse matrix-vector multiplication. 2014 IEEE 22nd
Annual International Symposium on Field-Programmable Custom Computing
Machines, 36–43.
73
Ai j Ai 1 j Ui (4.1)
where
ai is the input data ai = ( j, Ui) and updates the signal A[j]
Ai is the signal after ith item in the stream
Ui may be positive or negative to indicate the signal arrival or departure
During the stream, it measures signal A with the functions:
●● Process time per item ai in the stream (Process)
●● Computing time for A (Compute)
●● Storage space of At at time t (Storage) (Figure 4.1)
Currently, the research focuses on streaming graph algorithms [4] and different
partitions [5] to enhance the overall efficiency.
Storage
is available
No
Yes
Insert new
task
Process
the task
Remove
complete task
No
All tasks
are completed
Compute
Yes Transit
End Storage
5 A
Set all v.visited = false
System
controller Graph streaming Special
Command Graph processor math Data
unit scheduler cache
Execution tile
DMA
Thread
scheduler PE0 PE1 PE2 PE3 Special
operation
Arbiter unit
Instruction
unit
•
•
•
Scalar pipeline
Move pipeline
Flow control
Memory operation
State management
0 0
A A
1 2
1 2
B C B C
3 4 5
3 4 5
D D
6
6
A B C D 1
A 2
1 3 5 6 3
B 4
2 4 C 5
D 6
Time Time
B W X B W X
* *
P P
+ +
Y Y
P= WX Y= P+B
Y= WX+B
Graphcore develops Intelligence Processing Unit (IPU) [6–9] which adopts a simi-
lar approach as Blaize GSP that applies the graph theory to perform the fine-
grained operation with a massive parallel thread for deep learning applications. It
offers Multiple Instruction Multiple Data (MIMD) parallelism with distributive
local memory.
Tile
Local
Core
memory
IPU IPU
link link
IPU IPU
link link
PCI IPU
link link
IPU Exchange
PCI IPU
link link
IPU IPU
link link
IPU IPU
link link
16 N N
8 8
× 16 +=
4 N N
4 × 4 +=4
Card-to-card Card-to-card
links links
Card-to-card Card-to-card
links links
and two links are reserved for intra-band transfer. It connects to the host through
the PCIe-4 link (Figure 4.11).
E C S
E E C C S S
E E C C S S
E C S
Tile
count
Superstep Time
Figure 4.13 Intelligence processing unit bulk synchronous parallel execution trace [9].
Exercise 83
Chip 2
Chip 1
Sync
Host I/O
Compute phase
Sync
Exchange phase
Sync
Host I/O
Sync (1 tile abstains)
Sync
Exercise
3 Why does Blaize choose Depth First (DF) rather than Breadth-First (BS)
scheduling approach?
6 What are the advantages of the graph streaming DNN processor over CPU
and GPU?
References
1 Chung, E., Fowers, J., Ovtcharov, K. et al. (2018). Serving DNNs in real time at
datacenter scale with project brainwave. IEEE Micro 38 (2): 8–20.
2 Fowers, J., Ovtcharov, K., Strauss, K., et al. (2014). A high memory bandwidth
FPGA accelerator for sparse matrix-vector multiplication. IEEE 22nd Annual
International Symposium on Field-Programmable Custom Computing
Machines, 36–43.
3 Blaize. Blaize graph streaming processor: the revolutionary graph-native
architecture. White paper, Blaize [Online].
4 Blaize. Blaize Picasso software development platform for graph streaming
processor (GSP): graph-native software platform. White Paper, Blaize [Online].
5 Muthukrishnan, S. (2015). Data streams: algorithm and applications.
Foundations and Trends in Theoretical Computer Science 1 (2): 117–236.
6 McGregor, A. (2014). Graph stream algorithms: a survey. ACM SIGMOD Record
43 (1): 9–20.
7 Abbas, Z., Kalavri, V., Cabone, P., and Vlassov, V. (2018). Streaming graph
partitioning: an experimental study. Proceedings of the VLDB Endowment 11 (11):
1590–1603.
8 Cook, V.G., Koneru, S., Yin, K., and Munagala, D. (2017). Graph streaming
processor: a next-generation computing architecture. In: Hot Chip. HC29.21.
9 Knowles, S. (2017). Scalable silicon compute. Workshop on Deep Learning at
Supercomputer Scale.
10 Jia, Z., Tillman, B., Maggioni, M., and Scarpazza, D.P. (2019). Dissecting the
Graphcore IPU. arXiv:1912.03413v1.
11 NIPS (2017). Graphcore – Intelligence Processing Unit. NIPS.
85
Convolution Optimization
Buffer bank
Instruction Column
DRAM decoder buffer
Remap
Output layer
Column
Scratch pad
I(Fi–1) I(1) I(0) buffer
W(0,0)
The results are sent to the ACCU buffer and accumulated in the scratchpad. The
accelerator repeats the process with the next filter weights until all the input fea-
ture maps are computed.
Io X , Y I di X xi , Y yi (5.1)
i
88 5 Convolution Optimization
F0 F1
F2 F3
where
Io is the output image
Idi is the ith decomposed filter output images
(X, Y) is the current output address
(xi, yi) is the ith decomposed filter shift address
The filter decomposition is derived from the equations:
3K 1
F3k a, b f i, j Ii a i, b j
i 0
K 1K 1 2 2
F3k a, b f 3i l, 3 j m Ii a 3i l, b 3 j m
i 0 j 0 l 0m 0
K 1K 1
F3k a, b F3ij a 3i, b 3 j (5.2)
i 0 j 0
2 2
F3ij a, b f 3i l, 3 j m Ii a 3i l, b 3 j m
m 0l 0
0 i K 1, 0 k K 1 (5.3)
where
F3k(a, b) is a filter with kernel size 3K
f(i, j) is the filter and (i, j) is the weight relative position
Ii(a + 3i + l, b + 3j+m) is the image pixel position
F3ij is K2 different 3×3 kernel size filter
Filter Decomposed
decompose output
F0
F0 conv
F5×5
F1 Combined
F0 conv
output
+
F2
F0 conv
F3
F0 conv
To maximize the buffer bank output bandwidth, the sixteen rows of data are
divided into two sets, odd number channel data set and even ones. Two FIFO
buffer is paired with each data set and transferred eight input rows to ten overlap-
ping output rows. It enables eight 3 × 3 CU engine to run in parallel with over-
lapped data to improve the overall performance (Figure 5.6).
Output 0
F02 Data In
Output 1
F12
5.1.4 Pooling
The pooling function can be separated into average and max pooling and con-
structed differently in the DCNN accelerator.
l K 1K 1
O io r c I io ii r i c j W io ii i j
ii 0 i 0 j 0
O(0,1)
3×3
V(O,0) E(0,1) ADD
CU
. .
. . Output
. O(7,1)
. A
3×3
X(O,7) E(7,1) ADD
CU
Column
buffer
O(0,2)
3×3
X(E,0) E(0,2) ADD
CU
. .
. . Output
. O(7,2)
. B
3×3
X(E,7) E(7,2) ADD
CU
1
if ii io
W io ii i j K2 (5.5)
if ii io
0
where
ii is the input channel number
io is the output channel number
(r, c) is the output feature row and column position
W is the filter weight matrix
K is the average pooling window size
Max pooling
Stride Stride
Scratchpad
A0
R8 R0 A0
A1
R9 R1 A1
A2
R10 R2 A2
A3
R11 R3
A4
Max A3
R12 R4 pooling A4
A5
R13 R5 A5
A6
R14 R6 A6
A7
R15 R7 A7
Output
EN
Size Size
A0
A0
Comparator
A0
3 × 3 CU
Bus
decoder Control
PE PE PE
logic
(0,0) (0,1) (0,2)
Psum
FC (3 × 3 Conv)
addr/ Filter/ PE PE PE
Adder
stride coeff/ (1,0) (1,1) (1,2)
address
Data_In<0>
Data_In<1> MUX PE PE PE
Data_In<2> (2,0) (2,1) (2,2)
Data
Psum
(1 × 1 Conv)
Clock
Readout
Buffer multiplexer
Ping-pong
Buffer A Buffer B
buffer
1 Stanford EIE accelerator in Section 8.1 and MIT Eyeriss accelerator in Section 5.2.
96 5 Convolution Optimization
Neural network
model
Network
Training
Prune
connectivity Pruning
Weight
update
9X – 13X
reduction
Weight
cluster
Quantization
Quantize
model
27X – 31X
reduction
Weight
Encode
Encoding
Model
compression
35X – 49X
reduction
5.2 Eyeriss Accelerator
fmaps
fmaps PE PE PE PE
Convolution
2D filter Feature map Output
1 2 3
1 2 1 2
* 4 5 6 =
3 4 3 4
7 8 9
Multiplication
Toeplitz matrix
1D vector 1 2 4 5 Output
2 3 5 6
1 2 3 4 × = 1 2 3 4
4 5 7 8
5 6 8 9
1 2 3
1 2 1 2
4 5 6
3 4 3 4
7 8 9
Processing element
Register
4 3 2 1
5 4 2 1
The first two elements of both fmaps and corresponding ifmaps are loaded into
the spad and performs multiplication to create the psums and are stored in local
spads for the next multiplication (Figure 5.15).
The next two elements are loaded to PE for multiplication again and accumu-
lated with the stored psums (Figure 5.16).
Until all the elements are exhausted for the multiplication, the PE outputs the
result to ofmaps. The same fmaps and next ifmaps are loaded into the spads to
start new multiplication (Figure 5.17).
1 2 3
1 2 1 2
4 5 6
3 4 3 4
7 8 9
Processing element
Register
4 3 2 1
5 4 2 1
1 2 3
1 2 1 2
4 5 6
3 4 3 4
7 8 9
Processing element
Register
4 3
5 4
1 2 3
1 2 1 2
4 5 6
3 4 3 4
7 8 9
Processing element
Register
4 3 2 1
6 5 3 2
the output stationary. From the index looping,2 the output index stays constant
until the multiplication between the ifmaps and fmaps are exhausted (Figures 5.18
and 5.19).
Output
partial sum
Filter weights
Input feature maps
Output stationary
12
for (o=0;o<O; o++)
10 for (w=0; w<W; w++)
8
Index
6
4 Outputs
2 Inputs
Weights
0
0 10 20 30 40
Cycle
Filter weights
Output
partial sum
Input feature maps
Weight stationary
15
for (w=0; w < W; w++)
for (o=0; o<O; o++)
10 outputs[o] =
Index
5
output
0
0 10 20 30 40
Cycle
Output
Filter weights
Input feature maps partial sum
Input stationary
12
for (i=0; i<I; i++)
10 for (w=0; w<W; w++)
if (i–w) > 0
8
outputs[i–w] += inputs[i]
Index
6 *weights[w];
4
outputs
2 inputs
weights
0
0 10 20 30 40 50
Cycle
fmaps
#2
ifmaps
PE2,1 PE2,2 PE2,3 PE2,1 PE2,2 PE2,3 PE2,1 PE2,2 PE2,3
#1
fmaps
#3 ifmaps
PE3,1 PE3,2 PE3,3 #2 PE3,1 PE3,2 PE3,3 PE3,1 PE3,2 PE3,3
Filter reuse
Feature maps Input feature maps 1 & 2 Partial sums 1 & 2
Row 1 * Row 1 Row 1 = Row 1 Row 1
ifmaps and same fmaps to generate the psums. The psums are stored in PE spads
for further computation. It minimizes the fmaps data movement for energy reduc-
tion (Figure 5.25).
Multiple filter
Feature maps 1
Feature maps 1 Input feature maps Partial sums 1
Feature maps 2
Filter reuse
Feature maps 1 & 2 Input feature maps Partial sums 1 & 2
* Row 1 =
Multiple Channel
resolve the memory bottleneck. The data is stored in 64 bits RLC format with
three pairs of Runs and Levels. five bits Run represents the max number of 31
consecutive zeros followed by 16 bits Level to store the nonzero data. The last bit
indicates if the word is the last one in the code (Figure 5.28).
Except for the first-layer ifmaps, all the fmaps and ifmaps are encoded using
RLC format and stored in the external memory. The accelerator reads the encoded
108 5 Convolution Optimization
Run Level
ifmaps from external memory and decodes them through RLC decoder. After the
convolution, the results pass through ReLU layer to generate the zero data. The
data are compressed using the RLC encoder to remove ineffectual zero and stored
the nonzero elements in RLC format. Finally, the encoded data is written back to
external memory. This approach introduces 5–10% overhead with 30–75% com-
pression ratio. It reduces memory access with significant energy saving.
0 1
Psum
0
scratch pad
Reset
accumulation
14 PEs
Multicast controller
PE PE PE ID Ready Enable Value
Y-bus
<col,row>
0 1
==
12 X-bus
Tag Enable 0 Value
Global
buffer
PE PE PE
X-bus
ID <col>
X-Bus PE
Row IDs Col IDs
15 31 31 31 31 31 31 31 31 31 31 31 31 31 31
0 0 1 2 3 4 5 6 0 1 2 3 4 5 6
1 0 1 2 3 4 5 6 0 1 2 3 4 5 6
2 0 1 2 3 4 5 6 0 1 2 3 4 5 6
3 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7 1 2 3 4 5 6 7
0 2 3 4 5 6 7 8 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 2 3 4 5 6 7 8
2 2 3 4 5 6 7 8 2 3 4 5 6 7 8
X-Bus PE
Row IDs Col IDs
15 31 31 31 31 31 31 31 31 31 31 31 31 31 31
15 31 31 31 31 31 31 31 31 31 31 31 31 31 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 14 15 16 17 18 19 20 21 22 23 24 25 26 31
0 15 16 17 18 19 20 21 22 23 24 25 26 27 31
0 16 17 18 19 20 21 22 23 24 25 26 27 28 31
0 17 18 19 20 21 22 23 24 25 26 27 28 29 31
0 18 19 20 21 22 23 24 25 26 27 28 29 30 31
X-Bus PE
Row IDs Col IDs
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
X-Bus PE
Row IDs Col IDs
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
4 0 1 2 3 4 5 6 7 8 9 10 11 12 31
4 1 2 3 4 5 6 7 8 9 10 11 12 13 31
4 2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 31
0 2 3 4 5 6 7 8 9 10 11 12 13 14 31
4 0 1 2 3 4 5 6 7 8 9 10 11 12 31
4 1 2 3 4 5 6 7 8 9 10 11 12 13 31
4 2 3 4 5 6 7 8 9 10 11 12 13 14 31
31 31 31 31 31 31 31 31 31 31 31 31 31 31
0 1 2 3 4 5 6 0 1 2 3 4 5 6 Cycle n + 5
0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6 Cycle n + 4
0 1 2 3 4 5 6 0 1 2 3 4 5 6
1 2 3 4 5 6 7 1 2 3 4 5 6 7 Cycle n + 3
1 2 3 4 5 6 7 1 2 3 4 5 6 7
1 2 3 4 5 6 7 1 2 3 4 5 6 7 Cycle n + 2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
2 3 4 5 6 7 8 2 3 4 5 6 7 8 Cycle n + 1
2 3 4 5 6 7 8 2 3 4 5 6 7 8
2 3 4 5 6 7 8 2 3 4 5 6 7 8 Cycle n
31 31 31 31 31 31 31 31 31 31 31 31 31 31
31 31 31 31 31 31 31 31 31 31 31 31 31 31 Cycle n + 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Cycle n + 4
2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cycle n + 3
4 5 6 7 8 9 10 11 12 13 14 15 16 17
14 15 16 17 18 19 20 21 22 23 24 25 26 31 Cycle n + 2
15 16 17 18 19 20 21 22 23 24 25 26 27 31
16 17 18 19 20 21 22 23 24 25 26 27 28 31 Cycle n + 1
17 18 19 20 21 22 23 24 25 26 27 28 29 31
18 19 20 21 22 23 24 25 26 27 28 29 30 31 Cycle n
0 1 2 3 4 5 6 7 8 9 10 11 12 31
1 2 3 4 5 6 7 8 9 10 11 12 13 31 Cycle n + 5
2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 1 2 3 4 5 6 7 8 9 10 11 12 31 Cycle n + 4
1 2 3 4 5 6 7 8 9 10 11 12 13 31
2 3 4 5 6 7 8 9 10 11 12 13 14 31 Cycle n + 3
0 1 2 3 4 5 6 7 8 9 10 11 12 31
1 2 3 4 5 6 7 8 9 10 11 12 13 31 Cycle n + 2
2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 1 2 3 4 5 6 7 8 9 10 11 12 31 Cycle n + 1
1 2 3 4 5 6 7 8 9 10 11 12 13 31
2 3 4 5 6 7 8 9 10 11 12 13 14 31 Cycle n
0 1 2 3 4 5 6 7 8 9 10 11 12 31
1 2 3 4 5 6 7 8 9 10 11 12 13 31 Cycle n + 5
2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 1 2 3 4 5 6 7 8 9 10 11 12 31 Cycle n + 4
1 2 3 4 5 6 7 8 9 10 11 12 13 31
2 3 4 5 6 7 8 9 10 11 12 13 14 31 Cycle n + 3
0 1 2 3 4 5 6 7 8 9 10 11 12 31
1 2 3 4 5 6 7 8 9 10 11 12 13 31 Cycle n + 2
2 3 4 5 6 7 8 9 10 11 12 13 14 31
0 1 2 3 4 5 6 7 8 9 10 11 12 31 Cycle n + 1
1 2 3 4 5 6 7 8 9 10 11 12 13 31
Cycle n
2 3 4 5 6 7 8 9 10 11 12 13 14 31
PE PE PE PE
PE PE PE PE
Global
buffer
PE PE PE PE
PE PE PE PE
Eyeriss v1
2D mesh
network
PE
GLB PE GLB PE
GLB PE GLB PE
GLB PE GLB PE
GLB PE GLB PE
GLB PE GLB PE
Eyeriss v2
psums SRAM
ifmaps SRAM
GLB psums SRAM
Top level control and configuration ifmaps SRAM
cluster psums SRAM
ifmaps SRAM
GLB cluster GLB cluster psums SRAM
Router Router
cluster cluster
PE cluster PE cluster ifmaps fmaps psums
different types of data3: ifmap, fmaps, and psums. The hierarchical mesh is organ-
ized with cluster arrays: GLB Cluster, Router Cluster, and PE Cluster (Table 5.3).
The hierarchical mesh supports different data movements:
●● ifmaps are loaded into the GLB cluster, they are either stored in the GLB mem-
ory or transferred to the router cluster
●● psums are stored in the GLB memory after computation, the final ofmaps are
directly written back to the external memory
●● fmaps are transferred to the router cluster and stored in PE spads for computation
The Eyeriss v2 accelerator supports two-level control logic like Eyeriss v1 accel-
erator. The top-level control directs the data transfer between the external mem-
ory and the GLB as well as the PEs and GLB. The low-level control handles all PE
operations and processes the data in parallel.
Source
Destination
Router
cluster
Destination
cluster
GLB Router
cluster cluster
PE PE
cluster cluster
Router
cluster
PE
cluster
GLB
cluster
GLB
cluster
Router
cluster
PE
cluster
1.65
1.37
1
1.05
1.41
1.03 1.17
57.4 57.4
54.3
40
20
4.0 4.0 4.0
1.0 1.0 1.0
0
AlexNet GoogleNet MobileNet
C J
A D H K
B F L
E G I
Address index 1 2 3 4 5 6 6 7 8 9 10
Data vector A B C D E F G H I J K L
Counter vector 1 0 0 0 1 2 3 1 1 0 0 0
Address vector 0 2 5 6 6 7 9 9 12
RLC one. It employs the data vector to store the nonzero elements. The counter
vector records the number of leading zero from the previous nonzero element, the
additional address vector indicates the starting address of the encoded segment. It
allows the PE to easily process the nonzero data. The filter weight sparse matrix is
used to illustrate the CSC encoding scheme. The PE reads the first column nonzero
element A with starting address 0 and one leading 0. For the second column
122 5 Convolution Optimization
Dual R/W
ports
Ifmaps Fmaps
×2
Psums In
reading, it reads the nonzero element C with starting address 24 with none leading
0. For the fifth column reading, it repeats the starting address six to indicate the
empty column before the element G. The last element shows the total number of
the elements in the data vector (Figure 5.50).
Eyeriss v2 PE is modified to skip the ineffectual zero operations. It consists of seven
pipeline stages with five spads to store the address/data of the ifmaps and fmaps as
well as psums data. Due to data dependency, it first examines the address to deter-
mine the nonzero data. It also loads the ifmaps before fmaps to skip zero ifmaps
operation. If the ifmaps are non-zero with corresponding non-zero fmaps, the data
are pipelined for computation. For zero fmaps, it disables the pipeline to save energy.
The PE supports the Single Instruction Multiple Data (SIMD) operation. It
fetches two fmaps into the pipeline for computation. It not only improves the
throughput but reuses the ifmaps to enhance the system performance.
4 The starting address is corresponding to last data vector element B with address 2.
5.2 Eyeriss Accelerato 123
Active PE Active PE #1
Active PE #2
Idle PE
Active PE #3
RS dataflow RS + dataflow
100
42.5
50
15.5
10 6.9
5
1
CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8 Overall
50
11.3
10
3.0
5
2.6
1
CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8 Overall
50
12.6
10.9
10
5.6
5
1
W
15
ll
ra
3_
4_
3_
4_
_D
_D
_D
_P
_P
_P
FC
ve
V1
V1
V1
V1
V2
V3
V4
V2
V3
V4
O
N
N
N
N
N
N
N
O
O
O
O
O
O
C
C
C
C
C
C
10
2.5
5 1.9
1.8
1
0.5
W
15
ll
ra
3_
4_
3_
4_
_D
_D
_D
_P
_P
_P
FC
ve
V1
V1
V1
V1
V2
V3
V4
V2
V3
V4
O
N
N
N
N
N
N
N
O
O
O
O
O
O
C
C
C
C
C
C
Exercise
5 What is the major difference between RLC and CSC encoding approaches?
References
1 Du, L., Du, Y., Li, Y. et al. (2018). A reconfigurable streaming deep convolutional
neural network accelerator for internet of things. IEEE Transactions on Circuits
and Systems I 65 (1): 198–208.
2 Du, L. and Du, Y. (2017). Machine learning -Advanced techniques and emerging
applications. Hardware Accelerator Design for Machine Learning, Intecopen.com,
pp. 1–14.
126 5 Convolution Optimization
3 Han, S., Mao, H. and Dally, W.J. (2016). Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding.
International Conference on Learning Representations (ICLR).
4 Chen, Y.-H., Krishna, T., Emer, J., and Sze, V. (2017). Eyeriss: an energy-efficient
reconfigurable accelerator for deep convolutional neural network. IEEE Journal of
Solid-State Circuits 52 (1): 127–138.
5 Emer, J., Chen, Y-H., and Sze, V. (2019). DNN Accelersator Architectures,
International Symposium on Computer Architecture (ISCA 2019), Tutorial.
6 Chen, Y.-H., Emer, J. and Sze, V. (2018). Eyeriss v2: A Flexible and High-
Performance Accelerator for Emerging Deep Neural Networks.
arXiv:1807.07928v1.
7 Chen, Y.-H., Yang. T.-J., Emer J., and Sze, V. (2019). Eyeriss v2: A Flexible
Accelerator for Emerging Deep Neural Networks on Mobile Devices. arXiv:
1807.07928v2.
127
In-Memory Computation
6.1 Neurocube Architecture
Vault
(16 partitions) Stacked
DRAM
TSV
Vault
controller
Bump
Logic
die
Router Processing
element
Cache
MAC
A
x
B
Vault + Y = AxB+C
controller C
Processing
element
PNG/NoC
partitions to form a vault with the vault controller. Each vault is connected to one
processing element. All the vaults are operated independently to speed up the
overall operations (Figure 6.2).
The Neurocube accelerator consists of the global controller, the Programmable
Neurosequence Generator (PNC), the 2D mesh network, and the PEs. The
Neurocube accelerator first maps the neural network model, the connection
weight, and the states into the memory stack. The host initializes the command to
PNC. It starts the state machine to stream the data from memory to the PE for
computation. The data path between the memory stack and the logic layer is
called priori.
The PE consists of eight Multiply-Accumulate (MAC) unit, the cache memory,
a temporal buffer, and a memory module to store synaptic weight. PE performs
6.1 Neurocube Architectur 129
the computation using modified 16 bits fixed-point format, 1 sign bit, 7 integer
bits, and 8 fractional. It simplifies the hardware design with less error. Consider a
neural network layer having eight neurons with three inputs from the previous
layer. It can complete the computation within three cycles. On cycle 1, each MAC
computes summation from first inputs, followed by second inputs with weights in
cycle 2. All eight neurons are updated in cycle 3.
All PEs are connected using a 2D mesh network through a single router with six
input and six output channels (four for neighbors and two for PE and memory).
Each channel has 16 depth packet buffers. A rotating daisy chain priority scheme
is used for packet distribution and the priority is updated every cycle. Each packet
has operation identification (OP-ID) indicating the sequence order and an
operation counter to control the packet sequence. An out-of-order packet is
buffered in the SRAM cache. When all the inputs are available, they are moved to
the temporal buffer for computation (Figure 6.3).
Router
North North
west Routing west
LUT
East East
South South
PE Priority PE
register
Memory Memory
No Complete
all inputs and
weights
Yes
No Complete
all layer
neurons
Global controller
Vault Programmable
Configuration neurosequence
register generator
Address
generator
Address
Vault Lookup Packet Packet
controller table logic
Data
Combinational Memory
logic address
No of No of No of
MAC connections neurons
One layer
Counter Counter Counter
done
(x,y) (nx,ny) (mac ID)
Clock
MAC iterate Connection iterate Neuron iterate
where
W is the output image width
Addrlast is the previous layer last address
To program the PNG, the host sends the command to the configuration registers.
It initializes the FSM to start the three loop operations, the MAC computation, the
connection calculation, and the layer processing. When the neuron counter equals
the total number of the layer neurons, the PNG has generated all the data address
sequence for the layer. After the last address is computed, the PNG starts to pro-
gram the next layer.
Stanford University Tetris accelerator [7] adapts MIT Eyeriss Row Stationary (RS)
dataflow with additional 3D memory – Hybrid Memory Cube (HMC) to optimize
the memory access for in-memory computation.
Addr
Cmd
TSVs
Memory Row Row Memory
bank decoder decoder bank
DRAM die
Data
TSVs
Column decoder Column decoder
Vault
Sense amplifier Inter-bank data bus
To local
Logic die vault
To remote
vault
Memory Global
Router Engine
controller buffer
Global buffer
PE PE PE PE Register file
PE PE PE PE
ALU
PE PE PE PE
PE PE PE PE
doesn’t affect array layout. This approach introduces slight latency overhead but
is still small compared to DRAM access latency.
●● Bank Accumulation: The accumulators are placed in DRAM bank. Two banks
per vault can update the data in parallel without blocking the TSV bus. The
accumulators are placed at the DRAM peripheral which doesn’t affect array
6.2 Tetris Accelerato 135
Addr/Cmd
TSVs From
DRAM
16×burst From
Memory Row + logic
bank decoder Data To Latches
TSVs DRAM
layout. The duplicate accumulator area is twice as the DRAM Die Accumulation
area, but it is still small.
●● Subarray Accumulation: The accumulators are located inside the DRAM bank
with a shared bitline. This option can eliminate the data read out from the
DRAM bank. The drawback is the large area overhead with DRAM array layout
modification.
As a result, the DRAM die and bank accumulation options are chosen for the
Tetris accelerator implementation.
Global
buffer
Move ni ifmaps
into register files
Layer i Layer i + 1
Fmap partitioning
Output partitioning
Intput partitioning
when the filter weight reuse is not significant. For a fully connected layer, the
output partitioning is used to handle large filter weight.
1.2 1.2
1.0 1.0
Normalized runtime
Normalized energy
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
L1
T1
L4
N16
T16
L1
T1
L4
N16
T16
L1
T1
L4
N16
T16
L1
T1
L4
N16
T16
L1
T1
L4
N16
T16
AlexNet ZFNet VGG16 VGG19 ResNet
PE dynamic Memory dynamic Total static
RegFile/buffer dynamic NoC dynamic Runtime
L1: LPDDR3-1, T1: TETRIS-1, L4: LPDDR3-4, N16: Neurocube-16, T16: TETRIS-16
Smart Smart
memory memory
cube cube
Smart Smart
Host memory memory
cube cube
DMA PE NT NT PE NT NeuroStream
Cluster n Cluster 2
Cluster interconnect Processing
PE
element
Pipe Accumulator
To cluster
Multiplier Adder Comparator Mux
interconnect
Operand Streaming
Command Operand1 Operand2
FIFO FPU
S0 S0
AGU
S1 + A S1 + A
S2 S2
AGU1 AGU0
HWL
F F F
K 1 1
0 T 0 T 0 T
Loop 2 Loop 1 Loop 0
input and output channels. The output dimensions of each tile are determined by
the input width/height, the filter dimensions, striding, and zero-padding param-
eters. 4D Tile applies the row-major data layout approach and changes 2D input
volume into 1D vector. It allows DMA to transfer the entire tile to the processing
cluster through a single data request (Figure 6.15).
4D Tile also accounts for the sliding filter window using the overlapping
approach. It stores both the raw tiles and augmented tiles (overlapped region) in a
row-major data format which avoids the high-speed memory data fragmentation
issue. It fetches the complete tile (raw and augmented tiles) from DRAM using a
single data request and converts part of the raw tiles into augmented one for the
next layer during DRAM writeback.
To compute the output tile, the partial sum is reused. After all the input tiles are
read, the activation and pooling are completed, the partial sum Q (also output tile)
writes back to DRAM once
Q Q X KQ (6.4)
where
Q is the partial sum
X is the input tile with X = M, N, P . . .
K is the filter weight
To convert the augmented tile for the next layer (l + 1), the tile has four regions
(raw, A, B, C), the raw region of T0(l + 1) is written back to DRAM, followed by A, B,
and C regions of T0(l + 1) after T1(l), T3(l), T4(l) are computed.
Since there is no data overlap among the augmented tiles, each cluster can exe-
cute one tile at a time to reduce the data transfer. All the tile information is stored
in the list. The PE processes every tile based on the list order. Each cluster applies
the ping-pong strategy to minimize the setup time and latency impact. When the
tile is computed inside the cluster, another tile is fetched from memory. It repeats
the process until all the tiles in the layer are computed. All the clusters are syn-
chronized before processing the next layer.
Inside the cluster, each master PE partitions the tile in the order of TXo(l), TYo(l),
and TCo(l) dimensions to avoid synchronization. TCi(l) is used for additional parti-
tions for arbitrarily sized tile and corner tile. 4D Tile performs the spatial and
temporal computation assignment inside the cluster.
Filter
ci co
xi xo
Input volume Output volume
yi yo
Input Output
Tyi ky
tile tile
Tci Tco
kx
Txi
4D tile
Raw tile Augmented tile
Txi
T0 A T1 T2
Tyi Tci Tco
B C
T3 T4 T5
Row 0
Row 1
T0 A
Row 2
Row 3
Row 4 B C Input Output
channel channel
DRAM layout for one augmented tile
Row 0 Row 1 Row 2 Row 3 Row 4
100
Avg. DRAM bandwidth (GB/S)
Total performance (GFLOPS)
Roofline limit 90
224 80
VGG16 70
ResNet50
B/s
VGG19
60
96 G
Exercise
2 What are Hybrid Memory Cube limitations for Deep Learning Application?
6 What is the major difference between Hybrid Memory Cube (HMC) and
Smart Memory Cube (SMC)?
References
1 Singh, G., Chelini, L., Corda, S., et al. (2019). Near-Memory Computating: Past,
Present and Future. arXiv:1908.02640v1.
2 Azarkhish, E., Rossi, D., Loi, I., and Benini, L. (2016). Design and evaluation of a
processing-in-memory architecture for the smart memory cube. In: Architecture of
Computing Systems – ARCS 2016, 19–31. Springer.
3 Azarkhish, E., Pfister, C., Rossi, D. et al. (2017). Logic-Base interconnect design for
near memory computing in the smart memory cube. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems 25 (1): 210–223.
4 Jeddeloh, J. and Keeth, B. (2012). Hybrid memory cube new dram architecture
increases density and performance. 2012 Symposium on VLSI Technology
(VLSIT), 87–88.
5 Kim, D., Kung, J., Chai, S., et al. (2016). Neurocube: a programmable
digital neuromorphic architecture with high-density 3D memory. 2016 ACM/
IEEE 43rd Annual International Symposium on Computer Architecture (ISCA),
380–392.
6 Cavigelli, L., Magno, M., and Benini, L. (2015). Accelerating real-time embedded
scene labeling with convolutional networks. 2015 52nd ACM/EDAC/IEEE Design
Automation Conference (DAC), 1–6.
144 6 In-Memory Computation
7 Gao, M., Yang, X., Horowitz, M., and Kozyrakis, C. (2017). Tetris: scalable and
efficient neural network acceleration with 3D memory. Proceedings of the Twenty-
second International Conference on Architectural Support for Programming
Languages and Operating Systems, 751–764.
8 Azarkhish, E., Rossi, D., Loi, I., and Benini, L. (2018). Neurostream: scalable and
energy efficient deep learning with smart memory cubes. IEEE Transactions on
Parallel and Distributed Systems 29(2): 420–434.
145
Near-Memory Architecture
7.1 DaDianNao Supercomputer
Data
North link
Tiles Tiles Tiles Tiles
eDRAM eDRAM
bank0 bank1
Tiles Tiles Tiles Tiles
16 Input 16 output
neurons neurons
NFU
eDRAM
West link East link
router
eDRAM eDRAM
bank2 bank3
Tiles Tiles Tiles Tiles
output
synapse neurons
input partial
neurons sum/gradient
NBin NBout
operations are good enough for inference but reduce the training accuracy.
Therefore, 32 bits fixed-point operations are used for training with anerror rate of
less than 1%.
The DaDianNao supercomputer is programmed with the sequence of simple
node instructions to control the tile operations with three operands: start address
(read/write), step (stride), and the number of iterations. NFU runs as two modes,
the row operating processes one row at a time and the batch learning processes
multiple rows in parallel. It benefits the synapse reuse with slower convergence.
DaDianNao supercomputer is operated with multiple node mapping. The input
neurons are first distributed to all node tiles through the fat-tree network. It
performs the convolution and pooling locally with low internode communication
except for the boundary input neurons for mapping. The local response
normalization is computed internally without any external communication.
Finally, all the results are grouped for classification with high data traffic. At the
end of every layer operations, the output neurons write back to central eDRAM
and become the input neurons for the next layer. All the operations are done using
computing-and-forward communication scheme, each node computes the input
neurons locally and sends out the results for the next operations without the
global synchronization (Figure 7.4).
Tile Tile Tile
NBin NBin NBin
eDRAM eDRAM eDRAM
Tile Tile
NBin NBin NBin
eDRAM eDRAM
Stage 1 Add
Convolution Feature
Node 0 Node 1 Node 0 Node 1
layer map
Feature
Feature map
Node 2 Node 3 Node 2 Node 3
map
Convolution Classification
100
Speedup
10
1
C S1
C S2
V1
C V2
O *
PO 4*
PO 1
LR 2
1
fu 2
G N
n
V3
L
N
N
BM
BM
ea
N
V
O
O
S
S
N
LR
N
m
ll
LA
LA
R
O
C
C
1000
100
Speedup
10
1
1
2
V1
V2
PO *
PO 1
L2
n
V3
V4
SS
SS
ea
llN
O
O
N
LR
LR
N
m
LA
LA
fu
O
G
C
C
C
100
10
1
LA 1
C S2
V1
C V2
O *
PO 4*
PO 1
LR 2
1
G N
n
V3
SS
L
N
N
BM
BM
ea
N
V
O
O
S
LR
N
m
ll
LA
fu
O
C
C
100
10
1
1
2
V1
V2
PO *
PO 1
L2
n
V3
V4
SS
SS
ea
N
O
O
N
LR
LR
N
m
ll
LA
LA
fu
O
G
C
C
C
7.2 Cnvlutin Accelerator
Cnvlutin accelerator is proposed to resolve the network sparsity with new mem-
ory architecture. It decouples the original parallel multiplication lanes into inde-
pendent groups and encodes the nonzero elements in new data format during the
operation. The new data format allows the multiplication lanes to skip over the
ineffectual zero operations and process all the data in parallel. It significantly
improves overall performance with less power.
Neuron
Cyde 0 Cyde 1
lane 0 NBin NBin
Value 3 1 3 1 3 1
offset 2 0 2 0 2 0
Neuron Synapse
NBout NBout
lane 1 lane 0 SB SB
Value 4 2 Filter lane 0 5 3 1 X 5 3 1 X
+ +
offset 2 1 Filter lane 1 –5 –3 –1 X –5 –3 –1 X
Filter
weights 0
5 3 1 4 2 9 4 2 48
6 4 2 2 1 –9 2 1 –48
Filter Synapse
weights 1 lane 1 SB SB
–5 –3 –1 Filter lane 0 6 4 2 X 6 4 2 X
+ +
–6 –4 –2 Filter lane 1 –6 –4 –2 X –6 –4 –2 X
NBin
Neuron
lane 0
From
eDRAM
Neuron
lane 15
NBout
Synapse
lane 0
Filter X
+ f
lane 0 X
Synapse
lane 15
To
eDRAM
Synapse
lane 0
Filter X
+ f
lane 15 X
Synapse
lane 15
16 adders for partial sum addition, and an additional adder for output neuron
calculation. The number of neuron lanes and filter per unit can be changed
dynamically during operation (Figure 7.11).
Cnvlutin accelerator partitions the neuron lanes and the synapse lanes into
16 independent groups. Each group contains a single neuron lane and 16 synapse
154 7 Near-Memory Architecture
NBin
Filter Synapse
lane 0 lane 15
X
+ f
Filter Synapse X
lane 15 lane 15
lanes from different filters. For every cycle, each synapse lane fetches the single
neuron pair (neuron, offset) from NBin and performs the multiplication with the
corresponding synapse entry based on its offset. The partial sums are accumulated
using 16 adders. It keeps the NFUs busy all the time and the overall performance
is dramatically improved with less power dissipation (Figure 7.12).
3 Both interleaved and brick assignment are redrawn to clarify the processing order.
7.2 Cnvlutin Accelerato 155
Unit 0 Unit 15
Neurons Filter 0 Filter 255
NL NL
NBin
NL NL
SL SL
SB Filter 0 Filter 240
SL SL
Lorem lpsum
SL SL
SB Filter 15 Filter 255
SL SL
input neuron n′ (0, 0, 0) to n′ (0, 0, 15) is first fetched to neuron lanes NL0 to NL15.
It performs the multiplication with corresponding synapse lanes SL00 to SL150 in
unit 0 to SL240 255
15 to SL15 in unit 15. If brick 0 has only one nonzero value, then the
next nonzero value n′ (1, 0, 0) is fetched into unit 0 rather than n′ (0, 0, 1). It keeps
all the units busy all the time.
Since the input neurons assignment order is changed, the order of the synapse
stored in the synapse sublane is also altered. For example, S0 (0, 0, 0) to S15 (0, 0, 0)
are stored in the first slice SL00 to SL015 for unit 0 NL0 and S240 (0, 0, 15) to S255 (0, 0,
15) are stored in the last slice SL240 255
15 to SL15 for unit 15 NL15.
Encoded data 0 7 6 5 0 0 0 0 0 0 4 3 0 0 2 1
eDRAM
Offset 0 3 2 1 0 0 0 0 0 0 2 0 0 0 2 1
Brick buffer
Broadcast
NM bank 0 n15 n2 n1 n0
neuron to the corresponding neuron lane for processing. It keeps on fetching the
brick to avoid the NM stalling and improve overall throughput (Figure 7.16).
1.7
1.6
1.5
Speedup
1.4
1.3
1.2
1.1
1.0
alex google nin vgg19 cnnM cnnS geo
1.0
0.8
Normalized power
0.6 SRAM
Logic
SB
0.4 NM
0.2
0.0
stat. dyn. tot. stat. dyn. tot.
base CNV
value> <offset, value> <offset, value>>. For example, (2, 1, 3, 4) can’t fit within
65 bits. It is stored in raw format <0, 2, 1, 3, 4>, where 0 indicates the data is raw
followed by four 16 bits number. The value (1, 2, 0, 4) can be encoded using RoE
<1 <0, 1> <1, 2><3, 4>> where 1 + (16 + 4) × 3 = 61 bits. It significantly reduces
the memory overhead.
From NM
Activation Ineffectual
Decoder 16
brick activation Encoder
0 0
0 0 3
0.125 1
0 0
0.400 1
0 0
0.500 1
0.050 1
3 16
Activation
value
1 0 1 0 1 0 0 7
1 0 1 0 0 1 1 0
0 5 1 0 1 0 0 6
Filter 0 0 9 0 3 0 4 0 8
1 0 0 1 F 0A n(x,y,i) s0(x,y,i,j)
0 3 2 0
ISB
0
(x,y,i,j)
1 1 1 0 F 0B
F 0A F 0B F 0C F 0D F 1A F 1B F 1C F 1D
0 0 0 1
+
SB(x,y,i) F 0C
9 0 3 1 4 2 8 0 9 0 3 0 4 0 8 3 Time
0 1 1 0 0 5 2 1 0 6 0 5 4 1 0 6 0
0 0 1 2 7 4 7 1
0 1 1 1 FD
0
4 0 0 0
Filter 1
1 0 0 1 F 1A
0 5 4 0
1 1 1 1
F 1B
F 0A F 0B F 0C F 0D F 1A F 1B F 1C F 1D
0 0 0 0
+ 5 2 3 1 4 2 8 0 5 4 3 0 4 0 8 3
Time
1 1 0 1 F 1C 7 4 7 1
0 0 2 0
0 1 1 0 FD
1
1 0 0 3
Exercise
1 What are the design challenges for eDRAM for chip implementation?
4 Why is the ZFNAf format better than the CSC approach for resource
utilization?
6 Can you improve the Cnvlutin brick assignment approach for data transfer?
References
1 Chen, Y., Luo, T., Liu, S. et al. (2014). DaDianNao: A machine-learning
supercomputer. 2014 47th Annual IEEE/ACM International Symposium on
Microarchitecture, 609–622.
2 Chen, T., Du, Z., Sun, N. et al. DianNao: A small-footprint high-throughput
accelerator for ubiquitous machine-learning. ASPLOS ’14, Proceedings of the 19th
162 7 Near-Memory Architecture
Network Sparsity
To eliminate the network sparsity, various approaches are proposed to skip the
ineffectual zero operation to improve the system throughput. It includes the fea-
ture map encoding/indexing, the weight sharing/pruning, and quantized predic-
tion schemes.
Activation value
Non-zero index
From NE Activation 0 s0
From SW Activation 3 s3
Relative
Pointer read Sparse matrix access Activation read/write
index Arithmetic unit
output of the adder to its input if the same accumulator is used for a consecutive
clock cycle.
The activation read/write unit contains the source and destination activation
register files for fully connected layer operation. The role of register files is
exchanged during the next layer computation.
Fine-tuned
Weights Cluster index Centroids centroids
2.09 0.98 1.48 0.09 3 0 2 1 2.0 1.98
Cluster
0.05 –0.14 –1.08 2.12 1 1 0 3 1.5 1.48
Learning
Gradient rate
–0.01 0.02 0.04 0.01 0.02 –0.01 0.01 0.04 –0.02 0.04
where
aj is the input activation vector
bi is the output activation vector
Wij is the weight matrix
The equation is rewritten with deep compression
where
Xi is the set of column j for Wij 0
Y is the set of indices j for aj 0
Iij is the shared weight index
S is the shared weight table
The multiply-and-accumulate operation is only performed for those columns
with both Wij and aj which are nonzero.
EIE accelerator applies an interleaved Compressed Sparse Column (CSC) to
encode the activation sparsity. A vector v is used to store the nonzero weights of
weight matrix W, an equal length vector z encodes the number of zeros before the
corresponding entry in v. Both entries v and z are stored in four bits format. If
there are more than 15 zeros before the nonzero entry, an additional zero is
include in the vector v. For example:
w 0, 0,1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3
v 1, 2, 0, 3
z 2, 0,15, 2
All columns v and z are stored in one large array pair with a pointer pj points to
the beginning of vector and pj+1 points to one beyond the last entry. Then, the
number of nonzero values is given by pj+1 − pj.
a 0 0 a2 0 a4 a5 0 a7
X b
0 0 0 W5,3 0 0 0 W5,7 b5 b5
0 0 0 0 W6,4 0 W6,6 0 b6 b6
Figure 8.4 Matrix W, vector a and b are interleaved over four processing elements.
its index j to all PEs. Each PE multiplies the nonzero element a with the corre-
sponding nonzero column elements wj of weight matrix W, then accumulates the
partial sum in an accumulator. The PE only scans the vector v with the pointer
position pj to pj+1 (Figures 8.4 and 8.5).
The first nonzero input activation is a2, the values a2 and its column index 2 is
broadcasted to all PEs. Each PE multiplies a2 by every nonzero values in column w2
PE0 multiplies a2 by w0,2 and w12,2
PE1 has all zero w2 without multiplication
8.2 Cambricon-X Accelerato 169
Virtual
W0,0 W8,0 W12,0 W4,1 W0,2 W12,2 W0,4 W4,4 W0,5 W12,5 W0,6 W8,7 W12,7
weight
Row 0 1 0 1 0 2 0 0 0 2 0 2 0
index
Column
0 3 4 6 6 8 10 11 13
pointer
b0 b0 w0,2 a2
b12 b12 w12,2 a2
The interleaved CSC format exploits the dynamic sparsity of activation vector a
and static sparsity of weight matrix W. It allows the PEs to identify the nonzero
value for multiplication. It not only speeds up the multiplication but also saves power.
8.2 Cambricon-X Accelerator
Figure
Figure 8.6 EIE energy
8.7 EIE timing performance comparison
efficient comparison [1]. [1].
8.2 Cambricon-X Accelerato 171
NBin PE
PE
Buffer
DMA
controller
NBout PE
Bus DMA
Processing
element Scratchpad
functional unit
×
PFFU
Neurons
+
× +
+
× +
Synapses
+
×
0 1
0 1 2 3 4 5 6
Address 2 W61
directly fed into the BCFU. After PE computation, the results are stored in BCFU
for future processing or written back to NBout (Figure 8.12).
The IM identifies the nonzero neuron in BC and only transfers the nonzero
indexed neurons for processing. There are two IM options, direct and step index-
ing. The direct indexing uses the binary string to indicate the corresponding syn-
apse state, “1 for existence and 0 for absence.” Each bit is added to create the
accumulated string. Then, it performs the AND operation between the original
and accumulated string to generate the indexed input neurons (Figure 8.13).
8.2 Cambricon-X Accelerato 173
Indexing
PE #0
Indexing
Buffer
controller Output
Input function neuron
neuron unit
PE #Tn–1
1 n0 0
0 n1 1
Indexing Indexing
0 n2 1
0 n0 n3 n1 1
1 n4 n4 n2 0
0 n5 n3 1
0 n6 n5 1
0 n7 n6 0
PE #0 PE #1
The step indexing uses the distance between the neurons to address the indexed
input neurons. The distance in the index table is added sequentially to get the
index of the input neuron (Figure 8.14).1
1 The index of input neuron is incorrect in original paper. The correct index should be 1257
rather than 1258 (Figure 8.16).
174 8 Network Sparsity
Input
neurons 0 1 1 0 0 1 0 1
0 0 + + + + + + + +
0 1 2 2 2 3 3 4
1
Output 0 1 2 2 2 3 3 4
1 neurons
0
0 1 2 0 0 3 0 4
0
1
n0 n1 n2 n3 n4 n5 n6 n7 Input neurons
0
1
MUX
n1 n2 n5 n7 Indexing results
Input
neurons 1 1 3 2
0 + + + +
1 1 2 5
1 1 1 2 5 7
Output
1 neurons
3 3 n0 n1 n2 n3 n4 n5 n6 n7 Input neurons
2
MUX
2
n1 n2 n5 n7 Indexing results
Compared between direct and step indexing approaches, the cost is increased
with sparsity. The cost of step indexing is less than the direct one in terms of area
and power.
3 dense sparse
0
LeNet–5 AlexNet VGG 16 D. NN1 D. NN2 Cifar10 Geo mean
3.0 GPU
Dense NN: DianNao Cambricon–X–dense
Sparse NN: GPU DianNao
2.5
Energy benefit (log 10)
2.0
1.5
1.0
0.5
0.0
LeNet–5 AlexNet VGG 16 D. NN1 D. NN2 Cifar10 Geo mean
8.3 SCNN Accelerator
The convolution is formulated as a nested loop with seven variables (Figure 8.18):
The Planar Tiled-Input Stationary-Cartesian Product-dense (PT-IS-CP-dense)
dataflow illustrates how to decompose the convolution nested loop for parallel
processing. It adopts Input Stationary (IS) approach. The input activations are
reused for all the filter weights’ computation to generate K output channels with
W × H output activations. With multiple C input channels, the loop order becomes
C W H K R S
The input buffer stores the input activations and filter weights for computation.
The accumulator performs a read-add-write operation to accumulate all the par-
tial sums and produce the output activations. To improve the performance, the
blocking strategy is used that the K output channels are divided into K/Kc output
C C K
S H H-S+1
R W W-R+1
K N N
for n = 1 to N
for k = 1 to K
for c = 1 to C
for w = 1 to W
for h = 1 to H
for r = 1 to R
for s = 1 to S
Out[n][k][w][h] + = in[n][c][w + r=1] [h + s –1] ×
filter [k][c][r][s]
channel groups with size Kc. It stores the filter weights and output channels in a
single output channel group.
Weight C Kc R S
Input Activation C W H
Output Activation K c W H
K
C W H Kc R S
Kc
It also exploits the spatial reuse within the PE for intra-PE parallelism. The filter
weights (F) are fetched from the weight buffer with the input activations (I) from
the input activation buffer. They are delivered to the F × I array multipliers to
compute Cartesian Product (CP) of the partial sums. Both filter weights and input
activations are reused to reduce the data access. All partial sums are stored for
further computation without any memory access.
The spatial tiling strategy is used to partition the load into a PE array for inter-
PE parallelism. The W × H input activations are divided into smaller Wt × Ht
Planar Tiles (PT) and distributed among the PEs. Each PE operates on its own set
of filter weights and input activations to generate output activations. It also sup-
ports the multiple channels processing that C × Wt × Ht are assigned to PEs for
distributive computations.
Due to the sliding window operation, it introduces the cross-tile dependency at
the tile edge. The data halos are used to solve this issue:
●● Input Halos: The PE input buffer is sized up which is slightly larger than C × Wt
× Ht to accommodate the halos. It duplicates across the adjacent PEs but the
outputs are local to each PE.
●● Output Halos: The PE accumulation buffers are also sized up which are slightly
larger than Kc × Wt × Ht to accommodate the halos. The halos contain incom-
plete partial sums that are communicated with neighbor PE to complete the
accumulation at the end of output channel computation.
The PT-IS-CP-dense dataflow is reformulated as follows (Figure 8.19):
Weights
DRAM PE PE PE
IA OA
PE PE PE
Layer
sequencer
DRAM Neighbors
IRAM (sparse)
PPU
ORAM (sparse)
halos/ReLU
IRAM indices
compress
ORAM indices
Coordinate
computation
Buffer bank
0 23 0 0 24 0
S 0 0 0 0 0 8
K
18 0 42
Data vector 23 18 42 77 24 8
Index vector 6 1 4 1 0 3 3
(a)
12 DCNN/DCNN-opt SCNN SCNN (oracle)
10
8
Speedup
6
4
2
0
Conv1 Conv2 Conv3 Conv4 Conv5 All
Per-layer network
AlexNet
(b)
14 DCNN/DCNN-opt SCNN SCNN (oracle)
12
10
Speedup
8
6
4
2
0
IC_3a IC_3b IC_4a IC_4b IC_4c IC_4d IC_4e IC_5a IC_5b
All
Per-layer network
GoogLeNet
(c)
15 DCNN/DCNN-opt SCNN SCNN (oracle)
12
9
Speedup
6
3
0
all
Conv1_1
Conv1_2
Conv2_1
Conv2_2
Conv3_1
Conv3_2
Conv3_3
Conv4_1
Conv4_2
Conv4_3
Conv5_1
Conv5_2
Conv5_3
network
Per-layer
VGGNet
accelerator outperforms the DCNN one with 2.37 ×, 2.19 × and 3.52 × using
AlexNet, GoogLeNet, and VGGNet. The gap between SCNN and SCNN (oracle)
designs is due to intra-PE fragmentation and inter-PE synchronization barrier
(Figure 8.24).
182 8 Network Sparsity
(a)
DCNN DCNN-opt SCNN
1.2
Energy (relative to DCNN)
0.8
0.6
0.4
0.2
0
Conv1 Conv2 Conv3 Conv4 Conv5 all
Per-layer
AlexNet
(b)
DCNN DCNN-opt SCNN
1.2
Energy (relative to DCNN)
0.8
0.6
0.4
0.2
0
IC_3a IC_3b IC_4a IC_4b IC_4c IC_4d IC_4e IC_5a IC_5b all
Per-layer
GoogLeNet
1.2
Energy (relative to DCNN)
0.8
0.6
0.4
0.2
0
2
l
1
_3
al
2_
3_
4_
4_
5_
1_
1_
2_
3_
4_
5_
5_
3
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
nv
Co
Co
Co
Co
Co
Co
Co
Co
Co
Co
Co
Co
Co
Per-layer
VGGNet
Compared to the energy efficiency, the SCNN accelerator is 0.89 × to 4.7 × over
DCNN design and 0.76 × to 1.9 × over DCNN-opt design dependent on the input
activation density.
8.4 SeerNet Accelerator
Microsoft SeerNet accelerator [6] is proposed to predict the feature maps sparsity
using quantization convolution. It applies the binary sparsity mask from the
quantized feature maps to speed up the inference. The feature maps F and the
filter weights W are quantized into Fq and Wq. It runs the quantized low-bit infer-
ence, Quantized Convolution (Q-Conv), and Quantized ReLU (Q-ReLU) activa-
tion to generate the sparsity mask M. Then, it performs the full precision sparse
inference over W and F to create the output feature maps F′ (Figure 8.25).
Q-ReLU
Q-Conv
Fq F'q or M
Wq Q-max-
pooling
Quantized prediction
Input Output
feature map feature map
ReLU
S-Conv
F or F'
W Max-
pooling
Sparse convolution
x
x floor 2n 1
(8.3)
M
2 0 0 1 ReLu 1 0 0 1
mask
0 5 2 0 0 1 1 0
Feature map 0 0 8 0 0 0 1 0
Q-ReLU
2 0 1 0 1 0 1 0
2 –3 –6 1
–7 5 2 0
–1 –4 8 –4
2 0 1 –3
0 0 0 0 0 0 0 0
Q-max-pooling Max-pooling
0 5 2 0 mask 0 1 1 0
0 0 8 0 0 0 1 0
2 0 0 0 1 0 0 0
1.2 –1 0.5 8 –6 3
Step 1 : Max absolute value (M)
Step 2 : Quantize value
0.3 –0.2 –0.4 2 –1 –3
X’ = floor (X /M *2 (n–1))
n=4
0.01 0.1 0.2 0 1 1
where
Y is the output feature maps
Xi is the input feature maps
Wi is the filter weights
⊗ is the convolution operator
For integer convolution
N
f Y f Wi Xi (8.5)
i
N
f Y f Wi Xi (8.6)
i
N
f Y fw 1x fw Wi f x Xi (8.7)
i
where
fx is the input feature map quantization function
fw is the filter weight quantization function
fw 1x is the dequantization function
⊗ is the integer convolution
For quantized ReLU activation (Q-ReLU), it focuses on the sign only
N
sign f Y sign fw 1x fw Wi f x Xi (8.8)
i
186 8 Network Sparsity
N
sign f Y sign fw Wi f x Xi (8.9)
i
where
sign indicates the positive or negative sign of the function
Batch normalization is used to reduce the feature maps variant shift
Y
B (8.10)
2
N
f Wi Xi f bias
i
f B f (8.13)
2
f
N
fw Wi f x Xi f bias
i
f B f (8.14)
2
f
0 0 0 0 1 0 1 0 0
0 1 0 1 1 0 0 1 1
0 1 0 0 1 0 1 0 0
Vectorized
0 1 2 3 4 5 6 7 8
0 0 0 0 0 1 0 0 1 0
1 0 1 0 1 1 0 0 1 0
2 1 0 0 0 1 1 1 0 0
Encoded
Column
4 7 1 3 4 7 0 4 5 6
index
Row
index 0 2 6
Exercise
5 Why is Cambricon-X step indexing more difficult than direct indexing for
implementation?
6 How do you apply SCNN PT-IS-CP-sparse dataflow for a fully connected layer?
8 What are the advantages and disadvantages among the network sparsity
approaches?
References
1 Han, S., Liu, X., Mao, H. et al. (2016). EIE: Efficient Inference Engine on
Compressed Deep Neural Network. arXiv:1602.01528v2.
2 Han, S., Mao, H., and Dally, W.J. (2016). Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffm.n coding. In
International Conference on Learning Representations (ICLR).
3 Han, S., Pool, J., Tran, J. et al. (2015). Learning both Weights and Connections for
Efficient Neural Networks. arXiv: 1506.02626v3.
Reference 189
4 Zhang, S., Du, Z., Zhang, L. et al. (2016). Cambricon-X: An accelerator for sparse
neural networks. 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture, 1–12.
5 Parashar, A., Rhu, M., Mukkara, A. et al. (2017). SCNN: An Accelerator for
Compressed-Sparse Convolutional Neural Network. arXiv:1708.04485v1.
6 Cao, S., Ma, L., Xiao, W. et al. (2019). SeeNet: Predicting convolutional neural
network feature-map sparsity through low-bit quantization. Conference on
Computer Vision and Pattern Recognition.
7 Dong, X., Huang, J., Yang, Y. et al. (2017). More is less: A more complicated
network with less inference complexity. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.
8 Li, H., Kadav, A., Durdanovic, I. et al. (2017). Pruning Filters for Efficient
ConvNets. arXiv: 1608.08710v3.
9 Rastergari, M., Ordonez, V., Redmon, J. et al. (2016). XNOR-Net: ImageNet
classification using binary convolutional neural networks. European Conference on
Computer Vision.
191
3D Neural Processing
This chapter introduces the novel 3D neural processing for the deep learning
accelerator. It fully utilizes the 3D Integrated Circuit (3D-IC) advantages to allow
the die with the same layer function to stack together. The 3D Network Bridge
(3D-NB) [1–5] provides an additional routing resource to support neural network
massive interconnect and power requirements. It dissipates the heat using the
Redistribution Layer (RDL) and Through Silicon Via (TSV). The Network-on-
Chip (NoC) applies a high-speed link to solve the memory bottleneck. Finally, the
power and clock gating techniques can be integrated into 3D-NB to support large
neural network processing.
3D-IC is typically divided into 2.5D interposer and 3D fully stacked architecture
(Figures 9.1 and 9.2). For the 2.5D interposer design, different dies are placed on
the interpose and connected using horizontal RDL and vertical TSV. NVIDIA
applies this approach to solve the memory bottleneck. It connects the GPU with
High Bandwidth Memory (HBM) using NVLink2 for high-speed data transfer. For
3D fully stacked design, multiple dies are stacked over each other and connected
through TSV. Neurocube and Tetris accelerator adopt this configuration using
Hybrid Memory Cube (HMC) to perform in-memory computation. It avoids off-
chip data access to speed up the overall operation. The performance is further
enhanced for NeuroStream accelerator using Smart Memory Cube (SMC).
However, the cost of 3D-IC is 20% higher than Application-Specific Integrated
Circuit (ASIC) which limits 3D-IC development (Figure 9.3).
3D-IC faces two design challenges, power and thermal problems. If multiple
dies are stacked together, the additional power rail is required to supply from the
Substrate
Microbump
DIE
TSV DIE
Bump
Substrate
PDN DIE
Microbump
DIE
TSV DIE
Bump
Substrate
lower die to the upper one resulted in the pyramid-shaped Power Distribution
Network (PDN). It occupies a large amount of routing and area. For the deep
learning accelerator, the dies with the same layer functions (convolution, pooling,
activation, and normalization) can’t be stacked together due to different physical
implementation. Stacked dies also suffer from thermal problems because the heat
is difficult to dissipate from the die center. The high temperature degrades the
overall performance.
9.2 Power Distribution Networ 193
To overcome 3D-IC design challenges, the RDL PDN with X topology is pro-
posed to improve the current flow. New 3D Network Bridge (3D-NB) provides
additional horizontal RDL/vertical TSV resource to solve the power and ther-
mal issues.
PDN is directly related with the chip performance through power rail voltage drop
called IR drop. Improper PDN introduces high-resistive power rails leading to
high IR drop. It degrades the overall performance. The chip may not function
under the worst-case IR drop.
For chip design, the maximum IR drop is set to be 10% which translates to 10%
performance degradation.
where
VPDN is the power rail voltage drop
RPDN is the power rail effective resistance
Ichip is the chip current
The chip voltage is expressed as
where
Vchip is the chip voltage
Vdd is the supply voltage
Depending on the package technology, the IR drop contour is different between
wire bond and flip-chip design. For wire bond design, the power supply bumps1
are located at the chip edge. The IR contour is concave downward where the maxi-
mum IR drop occurred at the chip center. For flip-chip design, the power supply
bumps are placed at the chip center. The IR contour is concave upward where the
maximum IR drop is found at the chip edge. Different PDNs are required for wire
bond and flip-chip designs (Figure 9.4).
PDN (layer n)
PDN (layer n–1)
PDN (layer n)
The fabrication process typically offers multiple metal layers for signal routing.
The narrow width/spacing low metal layer is designed for signal routing. The
thick top metal layer is targeted for wire bond I/O pad to withstand the bonding
of mechanical stress and also used for PDN to minimize the IR drop. PDN is con-
figured using Manhattan topology where the top and bottom metal layers are
placed orthogonally to each other. The power rails are connected using multiple
vias to reduce the effective resistance. Multiple vias also help fulfill electromag-
netic (EM) and Design for Yield (DFY) requirements (Figure 9.5).
For modified PDN [6–9], it replaces the top two thick metal layers with low
resistive RDL and configures the power rails using the X shape. It allows the cur-
rent evenly distributed over the chip for both wire bond and flip-chip designs. The
X topology is recommended for top power rail but not signal routing. It obstructs
9.3 3D Network Bridg 195
the lower metal orthogonal signal routing. With the modified PDN, it can elimi-
nate the top two thick metal power rails and reduces the 3D-IC cost almost the
same as the ASIC one.
9.3.1 3D Network-on-Chip
For a deep neural network, there are massive connections between the layers. The
node outputs are connected to the next layer inputs (Figure 9.7). Through the 3D
Network-on-Chip (3D-NoC) approach, the node information is encapsulated in the
packet and broadcasted to the network. The corresponding node fetches the packet
for data processing. The 3D Network Switch (3D-NS) is also proposed to transfer the
data in six different directions (East, South, West, North, Top, and Bottom). It is
implemented using simple back-to-back gated inverters. The gated inverter is pro-
grammed dynamically to support various network topologies (Figure 9.8).
The network is further divided into multiple-level segments. The packet is only
routed to the destination segment. The unused network is turned off to save
power. It significantly reduces the network traffic and improves the overall perfor-
mance. The network sparsity scheme is easy to integrate with the 3D-NoC
approach (Figure 9.9).
Interposer
3D-NB Memory 3D-NB Interposer
3D-NB Logic 3D-NB
RDL
3D-NB Logic 3D-NB
TSV
3D-NB Memory 3D-NB
Interposer
Substrate
Top view
Cross-section
Layer n
Layer n + 1
Top
North
NS
NS
NS NS
West East
NS
NS
Network
South switch (NS)
Bottom
Level 1
segmentation
Level 3
Level 2 segmentation
segmentation
Transmitter Receiver
+ +
– –
+ +
– –
Receiver Transmitter
Transmitter Receiver
+ +
– –
+ +
– –
Receiver Transmitter
9.4 Power-Saving Techniques
Logic
block
Virtual VSS
NMOS footer ENb
switch
Global VSS
Global VDD
3D network
EN Delay logic Delay logic Delay logic bridge
Virtual VDD
Active logic
Voltage Always-on
domain #1 logic
Voltage Voltage
domain #2 domain #n
Clock tree
Gated clock
latch
avoid an imbalanced clock due to routing congestion. 3D-NB also supports the
clock gating with a programmable quad flop. The gated clock latch controls the
clock branch on/off for power saving. The programmable quad flop groups four
flops together to share the same clock input and balances the trade-off between
routability and power saving. The area is typically 10–20% higher than the four indi-
vidual flops but it reduces ¾ clock routing with ½ clock power saving. The quad
flop is programmed to optimize the drive strength for different loading. It avoids cell
swap during Engineering Change Order (ECO) to speed up the chip design.
This chapter proposes the novel 3D neural processing which is not limited to
deep learning accelerator and further extends to other ASIC designs.
Exercise
1 What are three 3D-IC major challenges besides the power and thermal issues?
7 How do you design the programmable quad flop for 3D neural processing?
8 How do you apply the 3D neural network processing approach for other
design applications?
References
This chapter covers different neural network topologies, it includes the popular
historical configurations, Perceptron (P), Feed Forward (FF), Hopfield Network
(HN), Boltzmann Machine (BM), Support Vector Machine (SVM), Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN).
Recurrent cell
Memory cell Auto encoder (AE) Variational AE (VAE) Denoising AE (DAE) Sparse AE (SAE)
Different memory cell
Kernel
Convolution or pool
Markov chain (MC) Hopfield network (HN) Boltzmann machine (BM) Restricted BM (RBM) Deep belief network (DBN)
Deep convolutional network (DCN) Deconvolutional network (DN) Deep convolutional inverse graphics network (DCIGN)
Generative adversarial network (GAN) Liquid state machine (LSM) Extreme learning machine (ELM) Echo state network (ESN)
Deep residual network (DRN) Kohonen network (KN) Support vector machine (SVM) Neural turing machine (NTM)
Index
a c
Accumulating Matrix Product (AMP) Cambricon‐X accelerator 169–175
Unit 79–81 control processor (CP) 169
Activation 17–18 buffer controller (BC) 169, 171–172
exponential linear unit 17–18 computation unit (CU) 169, 171
hyperbolic tangent 17–18 direct memory access (DMA) 169
leaky rectified linear unit 17–18 input neural buffer
rectified linear unit 17–18 (NBin) 169, 171–172
sigmoid 17–18 output neural buffer
Advanced vector software extension (NBout) 169, 172
(AVX) 26, 33–34 Catapult fabric accelerator 61–68
AlexNet 2, 4, 10, 13–15, 22–23, Central processing unit (CPU) 25–34
110–114 Clock gating 199–200
Auto encoder (AE) 6 Compressed sparse column (CSC) 113
Compressed sparse row (CSR) 68, 70
b Condensed Interleaved Sparse
Blaize graph streaming processor Representation (CISR) 68, 70
(GSP) 73–76 Convolution 13, 16, 17
Brain floating‐point format dot product 16, 17
(BFP) 56–57 zero‐padding 16, 17
Brainware Project 61–63 Convolutional neural network
Bulk synchronous parallel (CNN) 2, 4
model 81–83 Cnvlutin accelerator 150–158
computation phase 81–82 architecture 153–154
communication phase 81–82 basic operation 151
synchronization phase 82 processing order 154–155