Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
Architectures
2
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
* multiply-and-accumulate
3
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM DRAM
* multiply-and-accumulate
4
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM Mem Mem DRAM
5
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
6
Types of Data Reuse in DNN
Convolutional Reuse
CONV layers only
(sliding window)
Input Fmap
Filter
Activations
Reuse:
Filter weights
7
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse
CONV layers only CONV and FC layers
(sliding window)
Filters
Input Fmap Input Fmap
Filter
1
Activations
Reuse: Reuse: Activations
Filter weights
8
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse Filter Reuse
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)
Input Fmaps
Filters
Input Fmap Input Fmap
Filter Filter
1 1
2
2
Activations
Reuse: Reuse: Activations Reuse: Filter weights
Filter weights
9
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
10
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2
11
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2
13
Low-Cost Local Data Access
PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here
Global Buffer
Psum Activation
W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight
18
WS Example: nn-X (NeuFlow)
A 3×3 2D Convolution Engine
activations
weights
psums
Global Buffer
Activation Weight
P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum
20
OS Example: ShiDianNao
psums
Global Buffer
Weight
Activation
PE
Psum
22
NLR Example: UCLA
psums
activations weights
weights
activations
psums
25
Energy Efficiency Comparison
• Same total area • 256 PEs
• AlexNet CONV layers • Batch size = 16
2 Variants of OS
Norm. Energy/Op
1.5
Normalized
1
Energy/MAC
0.5
0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR RS
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 26
Energy Efficiency Comparison
• Same total area • 256 PEs
• AlexNet CONV layers • Batch size = 16
2 Variants of OS
Norm. Energy/Op
1.5
Normalized
1
Energy/MAC
0.5
0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR Row
RS
Stationary
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 27
Energy-Efficient Dataflow:
Row Stationary (RS)
• Maximize reuse and accumulation at RF
Input Fmap
Filter Output Fmap
* =
29
30 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c b a
30
31 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c b a
31
32 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c b
a
b
32
33 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c
* =
Reg File PE
c b a
e d c
b a
c
33
34 1D Row Convolution in PE
Reg File PE
c b a
e d c
b a
c
34
35 2D Convolution in PE Array
PE 1
Row 1 * Row 1
* =
35
36 2D Convolution in PE Array
Row 1
PE 1
Row 1 * Row 1
PE 2
Row 2 * Row 2
PE 3
Row 3 * Row 3
* =
36
37 2D Convolution in PE Array
Row 1 Row 2
PE 1 PE 4
Row 1 * Row 1 Row 1 * Row 2
PE 2 PE 5
Row 2 * Row 2 Row 2 * Row 3
PE 3 PE 6
Row 3 * Row 3 Row 3 * Row 4
* =
* =
37
38 2D Convolution in PE Array
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
* =
* =
* =
38
39 Convolutional Reuse Maximized
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
39
40 Convolutional Reuse Maximized
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
40
41 Maximize 2D Accumulation in PE Array
PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
41
Dimensions Beyond 2D Convolution
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
42
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
* Row 1 = Row 1
43
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1
44
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1
C
H
H
Channel 1 Row 1
* Row 1 = Row 1
R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1
45
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* =
R H
Channel 1 Row 1 Row 1 Row 1
R
46
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row
47
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
R
C
Channel 1 Row 1
* Row 1 = Row 1
H
C
Filter 2 Fmap 1 Psum 2
* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row
48
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* =
H
Channel 2 Row 1 Row 1 Row 1
49
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums
Row 1 + Row 1 = Row 1
50
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
C
Channel 1 Row 1
* Row 1 = Row 1
R H
R
Filter 1 Fmap 1 Psum 1
* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums
51
52 DNN Processing – The Full Picture
Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
Map rows from multiple fmaps, filters and channels to same PE
to exploit other forms of reuse and local accumulation 52
Optimal Mapping in Row Stationary
CNN Configurations
C
M
C
Optimization
H
R E
1 1 1
R H E
Compiler
…
…
C C
R
M H
E (Mapper)
N
R N E
H
PE PE PE
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
ALU ALU ALU ALU
Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
ALU ALU ALU ALU
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
Compilation Execution
DNN Shape and Size Processed
(Program) Data
Dataflow, …
(Architecture)
Mapper DNN Accelerator
(Compiler) (Processor)
Implementation
Details
(µArch) Mapping Input
(Binary) Data
55
56 Evaluate Reuse in Different Dataflows
• Weight Sta7onary
– Minimize movement of filter weights
• Output Sta7onary
– Minimize movement of par5al sums
• No Local Reuse
– No PE local storage. Maximize global buffer size.
• Row Sta7onary
Normalized Energy Cost*
Evaluation Setup
ALU 1× (Reference)
• same total area RF ALU 1×
• 256 PEs PE ALU 2×
• AlexNet Buffer ALU 6×
• batch size = 16 DRAM ALU 200×
56
Variants of Output Stationary
Parallel
Output Region E E E
E E E
Targeting Targeting
Notes
CONV layers FC layers
57
Dataflow Comparison: CONV Layers
2
psums
weights
1.5
activations
Normalized
1
Energy/MAC
0.5
0
WS OSA OSB OSC NLR RS
CNN Dataflows
1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
0.5 DRAM
0
WS OSA OSB OSC NLR RS
CNN Dataflows
2 psums
weights
1.5 activations
Normalized
Energy/MAC 1
0.5
0
WS OSA OSB OSC NLR RS
CNN Dataflows
1.5e10 ALU
Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM
0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers
1.5e10 ALU
Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM
0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers
RF dominates
[Chen et al., ISCA 2016] 62
Row Stationary: Layer Breakdown
2.0e10
1.5e10 ALU
Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM
0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers
Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM
0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers
…
Comp ReLU
…
Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 66
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt
…
…
DeliveryCompDelivery
ReLU
…
..
..
..
..
.. ..
14 14
3 13
14
3 13 5
12 12
3 13
13
5
3 13
..
..
..
..
.. ..
14 14
3 13
14
3 13 5
12 Unused PEs 12
3 13 are
13
3 13 Clock Gated 5
Buffer Img
Input Image …
SRAM
Decomp Psum
…
Output Image 108KB
Filter Image Psum
…
DeliveryCompDelivery
ReLU
…
Compared to Broadcast,Off-Chip
Multicast saves >80% of NoC energy
DRAM
64 bits
70
Chip Spec & Measurement Results
Technology TSMC 65nm LP 1P9M
On-Chip Buffer 108 KB
4000 µm
# of PEs 168
Scratch Pad / PE 0.5 KB Global Spatial Array
Core Frequency 100 – 250 MHz Buffer (168 PEs)
Peak Performance 33.6 – 84.0 GOPS
4000 µm
Word Bit-width 16-bit Fixed-Point
Filter Width: 1 – 32
Filter Height: 1 – 12
Natively Supported Num. Filters: 1 – 1024
DNN Shapes Num. Channels: 1 – 1024
Horz. Stride: 1–12
Vert. Stride: 1, 2, 4
To support 2.66 GMACs [8 billion 16-bit inputs (16GB) and 2.7 billion
outputs (5.4GB)], only requires 208.5MB (buffer) and 15.4MB (DRAM)
[Chen et al., ISSCC 2016] 71
Summary of DNN Dataflows
• Weight Stationary
– Minimize movement of filter weights
– Popular with processing-in-memory architectures
• Output Stationary
– Minimize movement of partial sums
– Different variants optimized for CONV or FC layers
• No Local Reuse
– No PE local storage à maximize global buffer size
• Row Stationary
– Adapt to the NN shape and hardware constraints
– Optimized for overall system energy efficiency
72
Fused Layer
• Dataflow across multiple layers