"Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT

Copyright © 2018 Massachusetts Institute of Technology 1
Vivienne Sze
May 23, 2018
Approaches for Energy Efficient
Implementation of Deep Neural Networks
In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang

Video is the Biggest Big Data
Need energy-eﬃcient pixel processing!
Over 70% of today’s Internet traﬃc is video
Over 300 hours of video uploaded to YouTube every minute
Over 500 million hours of video surveillance collected every day
Energy limited due
to ba1ery capacity
Power limited due
to heat dissipa8on

Increased Accuracy with Deep Learning
Deep Learning
requires signiﬁcantly
more computa5on
than previous
approaches
0
5
10
15
20
25
30
2010 2011 2012 2013 2014 2015 Human
ImageNet Top 5 Classiﬁca3on Error (%)
Large error reduc+on
due to Deep Learning
Hand-cra?ed feature-
based designs
Deep Learning-
based designs
[O. Russakovsky et al.,
IJCV, 2015]

Deep Convolutional Neural Networks
Classes
FC
Layers
Modern deep CNN: up to 1000 CONV layers
CONV
Layer
CONV
Layer
Low-level
Features
High-level
Features

CONV
Layer
CONV
Layer
Low-level
Features
High-level
Features
Classes
FC
Layers
1 – 3 layers

Classes
CONV
Layer
CONV
Layer
FC
Layers
Convolutions account for more
than 90% of overall computation,
dominating runtime and energy
consumption

High-Dimensional CNN Convolution
R
S
H
a plane of input activations
a.k.a. input feature map (fmap)
filter (weights)
W

R
filter (weights)
S
E
F
Partial Sum (psum)
Accumulation
input fmap output fmap
Element-wise
Multiplication
H
W
an output
activation

H
R
filter (weights)
S
E
Sliding Window Processing
input fmap
an output
activation
output fmap
W F

…
E
output fmap
……
many
filters (M)
Many
Output Channels (M)
M
…R
S
1
R
S
…
……
C
…
M
H
input fmap
…
……
…
C
…
C
……
…
W F

…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…R
S
R
S
…
……
C
…
C
……
…
filters
…
E
F
……
H
……
C
…
H
W
…
……
…
C
…
…
E
……
1 1
N
N
W F
Image
batch size:
1 – 256 (N)

Large Size with Varying Shapes
Layer Filter Size (R) # Filters (M) # Channels (C) Stride
1 11x11 96 3 4
2 5x5 256 48 1
3 3x3 384 256 1
4 3x3 384 192 1
5 3x3 256 192 1
AlexNet Convolu-onal Layer Conﬁgura-ons
[Krizhevsky, NIPS 2012]
34k Params 307k Params 885k Params
Layer 1 Layer 2 Layer 3
105M MACs 224M MACs 150M MACs

Popular DNNs
•  LeNet (1998)
•  AlexNet (2012)
•  OverFeat (2013)
•  VGGNet (2014)
•  GoogleNet (2014)
•  ResNet (2015)
0
2
4
6
8
10
12
14
16
18
2012 2013 2014 2015 Human
Accuracy(Top5error)
[O. Russakovsky et al., IJCV 2015]
AlexNet
OverFeat
GoogLeNet
ResNet
Clarifai
VGGNet
ImageNet: Large Scale Visual RecogniFon
Challenge (ILSVRC)

Popular DNNs
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet
(v1)
ResNet-50
Top-5 error n/a 16.4 7.4 6.7 5.3
Input Size 28x28 227x227 224x224 224x224 224x224
# of CONV Layers 2 5 16 21 (depth) 49
# of Weights 2.6k 2.3M 14.7M 6.0M 23.5M
# of MACs 283k 666M 15.3G 1.43G 3.86G
# of FC layers 2 3 3 1 1
# of Weights 58k 58.6M 124M 1M 2M
# of MACs 58k 58.6M 124M 1M 2M
Total Weights 60k 61M 138M 7M 25.5M
Total MACs 341k 724M 15.5G 1.43G 3.9G
CONV Layers increasingly important!

Training versus Inference
Training
(determine weights)
Weights
Large Datasets
Inference
(use weights)

•  Accuracy
•  Well defined dataset, DNN Model and task
•  Programmability
•  Support various DNN Models with different filter weights
•  Energy/Power:
•  Energy per operation and DRAM Bandwidth
•  Throughput/Latency
•  GOPS, frame rate, delay, batch size
•  Cost
•  Area (memory and logic size)
Key Metrics
ImageNet
DRAM
Chip
Computer
Vision
Speech
Recogni6on
[Sze et al., CICC 2017]

GPUs and CPUs Targeting Deep Learning
Xeon Phi “optimized for deep learning”
Intel Knights Landing (2016)
Intel Knights Mills (2017)
Nvidia PASCAL GP100 (2016)
Nvidia VOLTA GV100 (2017)
Use matrix multiplication libraries on CPUs and GPUs

Accelerate Matrix Multiplication
•  Implementation: Matrix Multiplication (GEMM)
•  CPU: OpenBLAS, Intel MKL, etc
•  GPU: cuBLAS, cuDNN, etc
•  Optimized by tiling to storage hierarchy

Map DNN to a Matrix Multiplication
•  Convert to matrix mult. using the Toeplitz Matrix
1 2 3
4 5 6
7 8 9
1 2
3 4
Filter Input Fmap Output Fmap
* = 1 2
3 4
1 2 3 41 2 4 5
2 3 5 6
4 5 7 8
5 6 8 9
1 2 3 4 × =
Toeplitz Matrix
(w/ redundant data)
Convolution:
Matrix Mult:
1 2 4 5
2 3 5 6
4 5 7 8
5 6 8 9
Data is repeated
Goal: Reduced number
of operations to
increase throughput

•  Goal: Bitwise same result, but reduce number of operations
•  Focuses mostly on compute
Computation Transformations

Analogy: Gauss’s Multiplication Algorithm
4 multiplications + 3 additions
3 multiplications + 5 additions
Reduce number of
multiplications, but increase
number of additions

Reduce Operations in Matrix Multiplication
•  Winograd [Lavin, CVPR 2016]
–  Pro: 2.25x speed up for 3x3 filter
–  Con: Specialized processing depending on filter size
•  Fast Fourier Transform [Mathieu, ICLR 2014]
–  Pro: Direct convolution O(No
2Nf
2) to O(No
2log2No)
–  Con: Increase storage requirements
•  Strassen [Cong, ICANN 2014]
–  Pro: O(N3) to (N2.807)
–  Con: Numerical stability

cuDNN: Speed up with Transformations
Source: Nvidia

Designing Specialized Hardware
(Accelerators) for DNNs

Properties We Can Leverage
•  Operations exhibit high parallelism
à high throughput possible
•  Memory Access is the Bottleneck
ALU
Memory Read Memory WriteMAC*
* multiply-and-accumulate
filter weight
image pixel
partial sum
updated
partial sum
200x 1x
DRAM DRAM

Properties We Can Leverage
•  Operations exhibit high parallelism
à high throughput possible
•  Input data reuse opportunities (up to 500x)
à exploit low-cost memory
Convolu'onal
Reuse
(pixels, weights)
Filter
Image
…

…

…

…

…

…

…

…

…

Image
Reuse
(pixels)

…

…

…

…

…

…

…

…

…

…

…

2
1
Filters
Image
Filter
Reuse
(weights)

…

…

…

…

…

…

…

…

…

…

…

Filter
Images
2
1

Highly-Parallel Compute Paradigms
Temporal Architecture
(SIMD/SIMT)
Register File
Memory Hierarchy
Spatial Architecture
(Dataflow Processing)
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Control
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU

Advantages of Spatial Architecture
Temporal Architecture
(SIMD/SIMT)
Register File
Memory Hierarchy
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Control
Spatial Architecture
(Dataflow Processing)
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Eﬃcient Data Reuse
Distributed local storage (RF)
Inter-PE Communica5on
Sharing among regions of PEs
Processing
Element (PE)
Control
Reg File0.5 – 1.0 kB

Data Movement is Expensive
Maximize data reuse at low
cost levels of hierarchy
DRAM
Global
Buffer
PE
PE PE
ALU fetch data to run
a MAC here
ALU
Buffer ALU
RF ALU
Normalized Energy Cost*
200×
6×
PE ALU 2×
1×
1× (Reference)
DRAM ALU
0.5 – 1.0 kB
100 – 500 kB
NoC: 200 – 1000 PEs
* measured from a commercial 65nm process

Weight Stationary (WS)
Global Buffer
W0 W1 W2 W3 W4 W5 W6 W7
Psum Activation
PE
Weight
•  Minimize weight read energy consumption
−  maximize convolutional and filter reuse of weights
•  Examples: [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015]
[Origami, GLSVLSI 2015] [Google TPU, ISCA 2017]

Output Stationary (OS)
Global Buffer
P0 P1 P2 P3 P4 P5 P6 P7
Pixel Weight
PE
Psum
•  Minimize partial sum R/W energy consumption
−  maximize local accumulation
•  Examples: [Gupta, ICML 2015] [ShiDianNao, ISCA 2015]
[ENVISION, ISSCC 2017] [Thinker, JSSC 2017]

No Local Reuse (NLR)
PE
Pixel
Psum
Global Buffer
Weight
•  Use a large global buffer as shared storage
−  Reduce DRAM access energy consumption
•  Examples: [DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]
[Zhang, FPGA 2015]

Row Stationary Dataflow
PE 1
Row 1 Row 1
PE 2
Row 2 Row 2
PE 3
Row 3 Row 3
Row 1
=
*
PE 4
Row 1 Row 2
PE 5
Row 2 Row 3
PE 6
Row 3 Row 4
Row 2
=
*
PE 7
Row 1 Row 3
PE 8
Row 2 Row 4
PE 9
Row 3 Row 5
Row 3
=
*
* * *
* * *
* * *
Optimize for overall
energy efficiency
instead for only a
certain data type

Dataflow Comparison: CONV Layers
0
0.5
1
1.5
2
Normalized
Energy/MAC
WS OSA OSB OSC NLR RS
psums
weights
pixels
DNN Dataflows
RS optimizes for the best
overall energy efficiency
resulting in a 1.4× – 2.5×
lower energy than other
dataflows

Eyeriss Deep CNN Accelerator
Off-Chip DRAM
…
…
…
…
…
…
Decomp
Comp ReLU
Input Image
Output Image
Filter Filt
Img
Psum
Psum
Global
Buffer
SRAM
108KB
64 bits
DCNN Accelerator
14×12 PE Array
Link Clock Core Clock
[Chen et al., ISSCC 2016]
4000 µm
4000µm
Global
Buffer
Spatial Array
(168 PEs)
Fabricated in a 65nm process
AlexNet @ 35 fps while consuming 278mW
>10x more energy efficient than a mobile GPU

Features: Energy versus Accuracy
0.1
1
10
100
1000
10000
0 20 40 60 80
Accuracy (Average Precision)
Energy/
Pixel (nJ)
VGG162
AlexNet2
HOG1
Measured in 65nm*
1.  [Suleiman, VLSI 2016]
2.  [Chen, ISSCC 2016]

* Only feature extrac6on. Does
not include data, augmenta6on,
ensemble and classiﬁca6on
energy, etc.
Measured in on VOC 2007 Dataset
1.  DPM v5 [Girshick, 2012]
2.  Fast R-CNN [Girshick, CVPR 2015]
Exponen6al
Linear
Video
Compression
[Suleiman et al., ISCAS 2017]

Designing Efficient DNN Models

•  Reduce size of operands for storage/compute
•  Floating point à Fixed point
•  Bit-width reduction
•  Non-linear quantization
•  Reduce number of operations for storage/compute
•  Exploit Activation Statistics (Compression)
•  Network Pruning
•  Compact Network Architectures
Approaches

Commercial Products using 8-bit Integer
Nvidia’s Pascal (2016) Google’s TPU (2016)

•  Reduce number of bits
•  Binary Nets [Courbariaux, NIPS 2015]
•  Reduce number of unique weights
•  Ternary Weight Nets [Li, arXiv 2016]
•  XNOR-Net [Rategari, ECCV 2016]
•  Non-Linear Quantization
•  LogNet [Lee, ICASSP 2017]
Reduced Precision in Research
Log Domain Quantization

Sparsity in Feature Map
9 -1 -3
1 -5 5
-2 6 -1
Many zeros in output fmaps after ReLU
ReLU
9 0 0
1 0 5
0 6 0
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5
CONV Layer
# of activations # of non-zero activations
(Normalized)

Exploit Sparsity
== 0
Zero
Buff
Scratch Pad
Enable
Zero Data Skipping
Register File
No R/W No Switching
Method 1: Skip memory access and computa6on
45% energy savings

Exploit Sparsity
Method 2: Compress data to reduce storage and data movement
0
1
2
3
4
5
6
1 2 3 4 5
DRAM Access (MB)
AlexNet Conv Layer
Uncompressed Compressed
1 2 3 4 5
AlexNet Conv Layer
DRAM
Access
(MB)
0
2
4
6
1.2×
1.4×
1.7×
1.8×
1.9×
Uncompressed
Fmaps + Weights
RLE Compressed
Fmaps + Weights

Pruning – Make Weights Sparse
retraining
Op#mal Brain Damage
[Lecun et al., NIPS 1989]
Prune DNN based on
magnitude of weights
[Han et al., NIPS 2015]
Example: AlexNet
Weight Reduction:
CONV layers 2.7x, FC layers 9.9x
Overall Reduction:
Weights 9x, MACs 3x

Network Architecture Design
Build Network with series of Small Filters
5x5 filter Two 3x3 filters
decompose
Apply sequentially
decompose
5x5 filter 5x1 filter
1x5 filter
Apply sequentially
GoogleNet/
Inception v3
VGG-16
separable
filters

1x1 Bottleneck in Popular DNN models
compress
expand
ResNet
compress
GoogleNet SqueezeNet

Understanding the Limitations of
Existing Energy-Efficient Design
Approaches for Deep Neural Networks
[Y.-H. Chen et al., SysML Conference, February 2018]

Energy-Efficient Processing of DNNs
V. Sze, Y.-H. Chen,
T-J. Yang, J. Emer,
“Efficient Processing of
Deep Neural Networks:
A Tutorial and Survey,”
Proceedings of the IEEE,
Dec. 2017
A significant amount of algorithm and hardware research
on energy-efficient processing of DNNs
We idenOfied various limitaOons to exisOng approaches
http://eyeriss.mit.edu/tutorial.html

Design of Efficient DNN Algorithms
•  Popular eﬃcient DNN algorithm approaches

Network Pruning
C
1
1
S
R
1
R
S
C
Compact Network Architectures
Examples: SqueezeNet, MobileNet
... also reduced precision
•  Focus on reducing number of MACs and weights
•  Does it translate to energy savings?

Energy-Evaluation Methodology
CNN Shape Configuration
(# of channels, # of filters, etc.)
CNN Weights and Input Data
[0.3, 0, -0.4, 0.7, 0, 0, 0.1, …]
CNN Energy Consumption
L1 L2 L3
Energy
…
Memory
Accesses
Optimization
# of MACs
Calculation
…
# acc. at mem. level 1
# acc. at mem. level 2
# acc. at mem. level n
# of MACs
Hardware Energy Costs of each
MAC and Memory Access
Ecomp
Edata
[Yang et al., CVPR 2017]
Energy estimation tool
available at
http://eyeriss.mit.edu

Key Observations
•  Number of weights alone is not a good metric for energy
•  All data types should be considered
Output Feature Map
43%
Input Feature Map
25%
Weights
22%
Computa:on
10%
Energy Consump:on
of GoogLeNet

Energy Consumption of Existing DNNs
AlexNet SqueezeNet
GoogLeNet
ResNet-50
VGG-16
77%
79%
81%
83%
85%
87%
89%
91%
93%
5E+08 5E+09 5E+10
Top-5 Accuracy
Normalized Energy Consump9on
Original DNN [Yang et al., CVPR 2017]
Deeper CNNs with fewer weights do not necessarily consume
less energy than shallower CNNs with more weights
v1.0
Batch sizes between 44 to 48

Magnitude-based Weight Pruning
AlexNet SqueezeNet
GoogLeNet
ResNet-50
VGG-16
AlexNet
SqueezeNet
77%
79%
81%
83%
85%
87%
89%
91%
93%
5E+08 5E+09 5E+10
Top-5 Accuracy
Original DNN Magnitude-based Pruning [6] [Han et al., NIPS 2015]
Reduce number of weights by removing small magnitude weights
v1.0

Energy-Aware Pruning
AlexNet SqueezeNet
GoogLeNet
ResNet-50
VGG-16
AlexNet
SqueezeNet
AlexNet SqueezeNet
GoogLeNet
77%
79%
81%
83%
85%
87%
89%
91%
93%
5E+08 5E+09 5E+10
Top-5 Accuracy
Original DNN Magnitude-based Pruning [6] Energy-aware Pruning (This Work)
1.74x
Directly target energy and incorporate it into the optimization of
DNNs to provide greater energy savings
v1.0

•  Automatically adapt
DNN to a mobile platform
to reach a target latency or
energy budget
•  Use empirical
measurements to guide
optimization (avoid
modeling of tool chain or
platform architecture)
NetAdapt: Platform-Aware DNN Adaptation
NetAdapt Measure
…
Network Proposals
Empirical Measurements
Metric Proposal A … Proposal Z
Latency 15.6 … 14.3
Energy 41 … 46
…
…
…
Pretrained
Network Metric Budget
Latency 3.8
Energy 10.5
Budget
Adapted
Network
…
…
Pla8orm
A B C D Z
[Yang et al., arXiv 2018]In collaboration with Google’s Mobile Vision Team

Latency vs. Accuracy Tradeoff with NetAdapt
•  NetAdapt boosts the real inference speed of MobileNet by 1.7x with higher accuracy
+0.3% accuracy
1.7x faster
+0.3% accuracy
1.6x faster
*Tested on the ImageNet
dataset and a Google
Pixel 1 CPU

Many Efficient DNN Design Approaches
Network Pruning
C
1
1
S
R
1
R
S
C
Compact Network Architectures
10100101000000000101000000000100
01100110
Reduce Precision
32-bit float
8-bit fixed
Binary 0
No guarantee that DNN algorithm
designer will use a given approach.
Need ﬂexible hardware!

•  Specialized DNN hardware often
rely on certain properties of DNN
in order to achieve high energy-
efficiency
•  Example: Reduce memory
access by amortizing across
MAC array
Existing DNN Architectures
58
MAC array
Weight
Memory
Activation
Memory
Weight
reuse
Activation
reuse

•  Example: reuse depends on # of channels, feature map/batch size
•  Not efficient across all network architectures (e.g., compact DNNs)
Limitation of Existing DNN Architectures
59
MAC array
(spatial
accumulation)
Number of filters
(output channels)
Number of
input channels
MAC array
(temporal
accumulation)
Number of filters
(output channels)
feature map
or batch size

(MAC/cycle)
(MAC/data)
Step 1: maximum workload parallelism
Step 2: maximum dataflow parallelism
Step 3: # of act. PEs under a finite PE array size
Number of PEs
Step 4: # of act. PEs under fixed PE array dims.
peak
perf.
Step 5: # of act. PEs under fixed storage cap.
workload operational intensity
Step 6: lower act. PE utilization due to insuff. avg. BW
Step 7: lower act. PE utilization due to insuff. inst. BW
Slope = BW to only act. PE
Eyexam: Understanding Sources of Inefficiencies
in DNN Accelerators
60
A systematic way to evaluate how each architectural decision affects
performance (throughput) for a given DNN workload
Tightens the roofline model
(Theoretical Peak Performance)
[Chen et al., In Submission]

To efficiently support:
•  Wide range of filter shapes
•  Large and Compact
•  Different Layers
•  e.g., CONV and FC
•  Wide range of sparsity
•  Dense and Sparse
Eyeriss v2
On-chipBuffer
Spatial
PE Array
Eyeriss (v1)
[Chen et al. ISSCC 2016, ISCA 2016]

Benchmarking Metrics for DNN Hardware
How can we compare designs?
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,
“Efficient Processing of Deep Neural Networks: A Tutorial and Survey,”
Proceedings of the IEEE, Dec. 2017

•  Accuracy
•  Quality of result for a given task
•  Throughput
•  Analytics on high volume data
•  Real-time performance (e.g., video at 30 fps)
•  Latency
•  For interactive applications (e.g., autonomous navigation)
•  Energy and Power
•  Edge and embedded devices have limited battery capacity
•  Data centers have stringent power ceilings due to cooling costs
•  Hardware Cost
•  $$$
Metrics for DNN Hardware

•  Accuracy
•  Difficulty of dataset and/or task should be considered
•  Throughput
•  Number of cores (include utilization along with peak performance)
•  Runtime for running specific DNN models
•  Latency
•  Include batch size used in evaluation
•  Energy and Power
•  Power consumption for running specific DNN models
•  Include external memory access
•  Hardware Cost
•  On-chip storage, number of cores, chip area + process technology
Specifications to Evaluate Metrics

Example: Metrics of Eyeriss Chip
Metric Units Input
Name of CNN Model Text AlexNet
Top-5 error classification
on ImageNet
# 19.8
Supported Layers All CONV
Bits per weight # 16
Bits per input activation # 16
Batch Size # 4
Runtime ms 115.3
Power mW 278
Off-chip Access per
Image Inference
MBytes 3.85
Number of Images Tested # 100
ASIC Specs Input
Process Technology 65nm LP
TSMC (1.0V)
Total Core Area (mm2) 12.25
Total On-Chip Memory
(kB)
192
Number of Multipliers 168
Clock Frequency (MHz) 200
Core area (mm2) /
multiplier
0.073
On-Chip memory (kB) /
multiplier
1.14
Measured or Simulated Measured

•  All metrics should be reported for fair evaluation of design tradeoffs
•  Examples of what can happen if certain metric is omitted:
•  Without the accuracy given for a specific dataset and task, one could run a
simple DNN and claim low power, high throughput, and low cost – however,
the processor might not be usable for a meaningful task
•  Without reporting the off-chip bandwidth, one could build a processor with
only multipliers and claim low cost, high throughput, high accuracy, and low
chip power – however, when evaluating system power, the off-chip memory
access would be substantial
•  Are results measured or simulated? On what test data?
Comprehensive Coverage

The evaluation process for whether a DNN system is a viable solution for a given
application might go as follows:
1.  Accuracy determines if it can perform the given task
2.  Latency and throughput determine if it can run fast enough and in real-time
3.  Energy and power consumption will primarily dictate the form factor of the
device where the processing can operate
4.  Cost, which is primarily dictated by the chip area, determines how much one
would pay for this solution
Evaluation Process

•  DNNs are a critical component in the AI revolution, delivering record breaking
accuracy on many important AI tasks for a wide range of applications; however, it
comes at the cost of high computational complexity
•  Efficient processing of DNNs is an important area of research with many
promising opportunities for innovation at various levels of hardware design,
including algorithm co-design
•  When considering different DNN solutions it is important to evaluate with the
appropriate workload in term of both input and model, and recognize that they are
evolving rapidly.
•  It’s important to consider a comprehensive set of metrics when evaluating
different DNN solutions: accuracy, speed, energy, and cost
Summary
Acknowledgements: This work is funded by the DARPA YFA grant, MIT Center
for Integrated Circuits & Systems, and gifts from Intel, Nvidia and Google.

•  Overview Paper
•  V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural
Networks: A Tutorial and Survey”, Proceedings of the IEEE, 2017
https://arxiv.org/pdf/1703.09039.pdf
•  More info about Eyeriss and Tutorial on DNN Architectures
http://eyeriss.mit.edu
•  MIT Professional Education Course on “Designing Efficient Deep Learning
Systems” http://professional-education.mit.edu/deeplearning
References
For updates on Eyerissv2, Eyexam, NetAdapt, etc.
or join EEMS news mailing list

•  A. Suleiman*, Y.-H. Chen*, J. Emer, V. Sze, "Towards Closing the Energy Gap Between HOG and CNN
Features for Embedded Vision," IEEE International Symposium of Circuits and Systems (ISCAS), May 2017.
•  V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, Z. Zhang, "Hardware for Machine Learning: Challenges and
Opportunities," IEEE Custom Integrated Circuits Conference (CICC), Invited Paper, May 2017.
•  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks," IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52,
No. 1, pp. 127-138, January 2017.
•  Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional
Neural Networks," International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.
•  Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks," IEEE International Conference on Solid-State Circuits (ISSCC), pp.
262-264, February 2016.
References

•  T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural
Network Adaptation for Mobile Applications," arXiv, April 2018.
•  Y.-H. Chen*, T.-J. Yang*, J. Emer, V. Sze, "Understanding the Limitations of Existing Energy-Efficient
Design Approaches for Deep Neural Networks," SysML Conference, February 2018.
•  V. Sze, T.-J. Yang, Y.-H. Chen, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and
Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
•  T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep Neural
Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 2017.
•  T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-
Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
•  Y.-H. Chen, J. Emer, V. Sze, "Using Dataflow to Optimize Energy Efficiency of Deep Neural Network
Accelerators," IEEE Micro's Top Picks from the Computer Architecture Conferences, May/June 2017.
References

•  M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary
weights during propagations,” in NIPS, 2015.
•  F. Li and B. Liu, “Ternary weight networks,” in NIPS Workshop on Efficient Methods for Deep Neural
Networks, 2016.
•  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNORNet: ImageNet Classification Using Binary
Convolutional Neural Networks,” in ECCV, 2016
•  E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-Efficient Neural
Networks Using Logrithmic Computations,” in ICASSP, 2017.
•  F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-
level accuracy with 50x fewer parameters and <1MB model size,” ICLR , 2017
•  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:
1704.04861, 2017.
References

•  A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in CVPR, 2016.
•  Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain Damage,” in NIPS, 1990.
•  S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural
networks,” in NIPS, 2015.
References

"Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT

Recommended

More Related Content

What's hot (20)

Similar to "Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

"Approaches for Energy Efficient Implementation of Deep Neural Networks," a Presentation from MIT