0% found this document useful (0 votes)

151 views

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

The document summarizes different accelerator architectures for deep neural network (DNN) inference. It discusses how memory access is a bottleneck in DNN workloads due to the large number of memory operations required. The key opportunities to reduce memory accesses are through data reuse of activations and filter weights within the memory hierarchy and performing local accumulation of partial sums without accessing external memory. The document then describes three common dataflow paradigms for DNN accelerators - weight stationary, output stationary, and no local reuse - and provides examples of accelerators that use each approach to exploit data reuse and local accumulation to reduce memory accesses and improve efficiency.

Uploaded by

Ashin Antony

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

Uploaded by

Ashin Antony

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

DNN Accelerator

Architectures

ISCA Tutorial (2017)

Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen 1
2 Highly-Parallel Compute Paradigms
Temporal Architecture Spatial Architecture
(SIMD/SIMT) (Dataflow Processing)

Memory Hierarchy Memory Hierarchy

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU

Control

2
Memory Access is the Bottleneck
Memory Read MAC* Memory Write

filter weight ALU

fmap activation
partial sum updated partial sum

* multiply-and-accumulate

3
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM DRAM

* multiply-and-accumulate

Worst Case: all memory R/W are DRAM accesses

•  Example: AlexNet [NIPS 2012] has 724M MACs
à 2896M DRAM accesses required

4
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

5
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

6
Types of Data Reuse in DNN
Convolutional Reuse
CONV layers only
(sliding window)

Input Fmap
Filter

Activations
Reuse:
Filter weights

7
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse
CONV layers only CONV and FC layers
(sliding window)

Filters
Input Fmap Input Fmap
Filter
1

Activations
Reuse: Reuse: Activations
Filter weights

8
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse Filter Reuse
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)
Input Fmaps
Filters
Input Fmap Input Fmap
Filter Filter
1 1

2
2

Activations
Reuse: Reuse: Activations Reuse: Filter weights
Filter weights

9
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

1 Can reduce DRAM reads of filter/fmap by up to 500×**
1) 
** AlexNet CONV layers

10
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

1 Can reduce DRAM reads of filter/fmap by up to 500×
1) 
2) 
2 Partial sum accumulation does NOT have to access DRAM

11
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

1 Can reduce DRAM reads of filter/fmap by up to 500×
1) 
2) 
2 Partial sum accumulation does NOT have to access DRAM

•  Example: DRAM access in AlexNet can be reduced

from 2896M to 61M (best case)
12
Spatial Architecture for DNN
DRAM

Local Memory Hierarchy

Global Buffer (100 – 500 kB)
•  Global Buffer
ALU ALU ALU ALU
•  Direct inter-PE network
•  PE-local memory (RF)
ALU ALU ALU ALU
Processing
Element (PE)
ALU ALU ALU ALU Reg File 0.5 – 1.0 kB

ALU ALU ALU ALU Control

13
Low-Cost Local Data Access

PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here

Normalized Energy Cost*

ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 14
Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation

with limited low-cost local storage?

Normalized Energy Cost*

ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 15
Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation

with limited low-cost local storage?

specialized processing dataflow required!

Normalized Energy Cost*

ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 16
Dataflow Taxonomy

•  Weight Stationary (WS)

•  Output Stationary (OS)
•  No Local Reuse (NLR)

[Chen et al., ISCA 2016] 17

Weight Stationary (WS)

Global Buffer
Psum Activation

W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight

•  Minimize weight read energy consumption

−  maximize convolutional and filter reuse of weights

•  Broadcast activations and accumulate psums

spatially across the PE array.

18
WS Example: nn-X (NeuFlow)
A 3×3 2D Convolution Engine
activations

weights

psums

[Farabet et al., ICCV 2009] 19

Output Stationary (OS)

Global Buffer
Activation Weight

P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum

•  Minimize partial sum R/W energy consumption

−  maximize local accumulation

•  Broadcast/Multicast filter weights and reuse

activations spatially across the PE array

20
OS Example: ShiDianNao

Top-Level Architecture PE Architecture

weights activations

psums

[Du et al., ISCA 2015] 21

No Local Reuse (NLR)

Global Buffer
Weight
Activation
PE
Psum

•  Use a large global buffer as shared storage

−  Reduce DRAM access energy consumption

•  Multicast activations, single-cast weights, and

accumulate psums spatially across the PE array

22
NLR Example: UCLA

psums

activations weights

[Zhang et al., FPGA 2015] 23

NLR Example: TPU

Top-Level Architecture Matrix Multiply Unit

weights

activations

psums

[Jouppi et al., ISCA 2017] 24

Taxonomy: More Examples
•  Weight Stationary (WS)
[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014]
[Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]

•  Output Stationary (OS)

[Peemen, ICCD 2013] [ShiDianNao, ISCA 2015]
[Gupta, ICML 2015] [Moons, VLSI 2016]

•  No Local Reuse (NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]
[Zhang, FPGA 2015] [TPU, ISCA 2017]

25
Energy Efficiency Comparison
•  Same total area •  256 PEs
•  AlexNet CONV layers •  Batch size = 16
2 Variants of OS
Norm. Energy/Op

1.5

Normalized
1
Energy/MAC

0.5

0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR RS
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 26
Energy Efficiency Comparison
•  Same total area •  256 PEs
•  AlexNet CONV layers •  Batch size = 16
2 Variants of OS
Norm. Energy/Op

1.5

Normalized
1
Energy/MAC

0.5

0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR Row
RS
Stationary
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 27
Energy-Efficient Dataflow:
Row Stationary (RS)
•  Maximize reuse and accumulation at RF

•  Optimize for overall energy efficiency

instead for only a certain data type

[Chen et al., ISCA 2016] 28

29 Row Stationary: Energy-efficient Dataflow

Input Fmap
Filter Output Fmap

* =

29
30 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b a

30
31 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b a

31
32 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b
a
b

32
33 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c
b a
c

33
34 1D Row Convolution in PE

•  Maximize row convolutional reuse in RF

−  Keep a filter row and fmap sliding window in RF

•  Maximize row psum accumulation in RF

Reg File PE
c b a

e d c
b a
c

34
35 2D Convolution in PE Array

PE 1
Row 1 * Row 1

* =
35
36 2D Convolution in PE Array

Row 1

PE 1
Row 1 * Row 1

PE 2
Row 2 * Row 2

PE 3
Row 3 * Row 3

* =
36
37 2D Convolution in PE Array

Row 1 Row 2

PE 1 PE 4
Row 1 * Row 1 Row 1 * Row 2

PE 2 PE 5
Row 2 * Row 2 Row 2 * Row 3

PE 3 PE 6
Row 3 * Row 3 Row 3 * Row 4

* =
* =
37
38 2D Convolution in PE Array

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

* =
* =
* =
38
39 Convolutional Reuse Maximized

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Filter rows are reused across PEs horizontally

39
40 Convolutional Reuse Maximized

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Fmap rows are reused across PEs diagonally

40
41 Maximize 2D Accumulation in PE Array

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Partial sums accumulate across PEs vertically

41
Dimensions Beyond 2D Convolution
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

42
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
* Row 1 = Row 1

43
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1

44
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1

Processing in PE: concatenate fmap rows

Filter 1 Fmap 1 & 2 Psum 1 & 2

Channel 1 Row 1
* Row 1 Row 1 = Row 1 Row 1

45
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R

46
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row

47
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row

Processing in PE: interleave filter rows

Filter 1 & 2 Fmap 1 Psum 1 & 2

Channel 1
* Row 1 =

48
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1

49
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums
Row 1 + Row 1 = Row 1

50
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums

Processing in PE: interleave channels

Filter 1 Fmap 1 Psum

Channel 1 & 2
* = Row 1

51
52 DNN Processing – The Full Picture

Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
Map rows from multiple fmaps, filters and channels to same PE
to exploit other forms of reuse and local accumulation 52
Optimal Mapping in Row Stationary
CNN Configurations
C
M
C

Optimization
H
R E
1 1 1
R H E

Compiler
…

…
C C

R
M H
E (Mapper)
N
R N E
H

Row Stationary Mapping

Hardware Resources
PE PE PE
Global Buffer Row 1 Row 1 Row 1 Row 2 Row 1 * Row 3
* *
PE PE PE
ALU ALU ALU ALU Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE PE PE
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
ALU ALU ALU ALU

Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
ALU ALU ALU ALU
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =

[Chen et al., ISCA 2016] 53

Computer Architecture Analogy

Compilation Execution
DNN Shape and Size Processed
(Program) Data
Dataflow, …
(Architecture)
Mapper DNN Accelerator
(Compiler) (Processor)
Implementation
Details
(µArch) Mapping Input
(Binary) Data

[Chen et al., Micro Top-Picks 2017] 54

Dataflow
Simulation Results

55
56 Evaluate Reuse in Different Dataflows
•  Weight Sta7onary
–  Minimize movement of filter weights
•  Output Sta7onary
–  Minimize movement of par5al sums
•  No Local Reuse
–  No PE local storage. Maximize global buffer size.
•  Row Sta7onary
Normalized Energy Cost*
Evaluation Setup
ALU 1× (Reference)
•  same total area RF ALU 1×
•  256 PEs PE ALU 2×
•  AlexNet Buffer ALU 6×
•  batch size = 16 DRAM ALU 200×
56
Variants of Output Stationary

OSA OSB OSC

M M M

Parallel
Output Region E E E

E E E

# Output Channels Single Multiple Multiple

# Output Activations Multiple Multiple Single

Targeting Targeting
Notes
CONV layers FC layers

57
Dataflow Comparison: CONV Layers
2
psums
weights
1.5
activations
Normalized
1
Energy/MAC

0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS optimizes for the best overall energy efficiency

[Chen et al., ISCA 2016] 58
Dataflow Comparison: CONV Layers
2

1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
0.5 DRAM

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS uses 1.4× – 2.5× lower energy than other dataflows

[Chen et al., ISCA 2016] 59
Dataflow Comparison: FC Layers

2 psums
weights
1.5 activations
Normalized
Energy/MAC 1

0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS uses at least 1.3× lower energy than other dataflows

[Chen et al., ISCA 2016] 60
Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

[Chen et al., ISCA 2016] 61

Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

RF dominates
[Chen et al., ISCA 2016] 62
Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

RF dominates DRAM dominates

[Chen et al., ISCA 2016] 63
Row Stationary: Layer Breakdown
2.0e10 Total Energy
80% 20%
1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

CONV layers dominate energy consumption!

64
Hardware Architecture
for RS Dataflow

[Chen et al., ISSCC 2016] 65

66 Eyeriss DNN Accelerator
Link Clock Core Clock DNN Accelerator
14×12 PE Array
Filter Filt
…
Global Fmap
Input Fmap Buffer …
Decomp SRAM Psum
…
Output Fmap 108KB Psum

…
Comp ReLU
…

Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 66
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt
…

Input Image Buffer Fmap

…
SRAM
Decomp Psum
…
Output Image 108KB
Filter Fmap Psum

…
DeliveryCompDelivery
ReLU
…

How to accommodate Off-Chip

different shapes
DRAM with fixed PE array?
64 bits
67
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..

..
..

..
.. ..

14 14
3 13
14
3 13 5
12 12
3 13
13
5
3 13

Physical PE Array Physical PE Array

68
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..

..
..

..
.. ..

14 14
3 13
14
3 13 5
12 Unused PEs 12
3 13 are
13
3 13 Clock Gated 5

Physical PE Array Physical PE Array

69
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt
…

Buffer Img
Input Image …
SRAM
Decomp Psum
…
Output Image 108KB
Filter Image Psum

…
DeliveryCompDelivery
ReLU
…

Compared to Broadcast,Off-Chip
Multicast saves >80% of NoC energy
DRAM
64 bits
70
Chip Spec & Measurement Results
Technology TSMC 65nm LP 1P9M
On-Chip Buffer 108 KB
4000 µm
# of PEs 168
Scratch Pad / PE 0.5 KB Global Spatial Array
Core Frequency 100 – 250 MHz Buffer (168 PEs)
Peak Performance 33.6 – 84.0 GOPS

4000 µm
Word Bit-width 16-bit Fixed-Point
Filter Width: 1 – 32
Filter Height: 1 – 12
Natively Supported Num. Filters: 1 – 1024
DNN Shapes Num. Channels: 1 – 1024
Horz. Stride: 1–12
Vert. Stride: 1, 2, 4
To support 2.66 GMACs [8 billion 16-bit inputs (16GB) and 2.7 billion
outputs (5.4GB)], only requires 208.5MB (buﬀer) and 15.4MB (DRAM)
[Chen et al., ISSCC 2016] 71
Summary of DNN Dataflows
•  Weight Stationary
–  Minimize movement of filter weights
–  Popular with processing-in-memory architectures

•  Output Stationary
–  Minimize movement of partial sums
–  Different variants optimized for CONV or FC layers

•  No Local Reuse
–  No PE local storage à maximize global buffer size

•  Row Stationary
–  Adapt to the NN shape and hardware constraints
–  Optimized for overall system energy efficiency

72
Fused Layer
•  Dataflow across multiple layers

[Alwani et al., MICRO 2016] 73

FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
No ratings yet
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
20 pages
Alif E1 Datasheet v2.5-1
No ratings yet
Alif E1 Datasheet v2.5-1
161 pages
Instruction Manual: Mineral Oil Centrifuge With Self-Cleaning Bowl OSE 5-91-037
No ratings yet
Instruction Manual: Mineral Oil Centrifuge With Self-Cleaning Bowl OSE 5-91-037
212 pages
Solid-State-Drives (SSDS) Modeling Simulation Tools Strategies (Rino Micheloni (Eds.) ) (Z-Library)
No ratings yet
Solid-State-Drives (SSDS) Modeling Simulation Tools Strategies (Rino Micheloni (Eds.) ) (Z-Library)
177 pages
Palladium Clocking in ICE/STB Flow
No ratings yet
Palladium Clocking in ICE/STB Flow
20 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
Power Optimization (Part 2) : Xuan Silvia' Zhang
No ratings yet
Power Optimization (Part 2) : Xuan Silvia' Zhang
26 pages
Hardware Implementation of ECG System On FPGA
No ratings yet
Hardware Implementation of ECG System On FPGA
13 pages
Jetson Xavier NX Data Sheet v1.3
No ratings yet
Jetson Xavier NX Data Sheet v1.3
40 pages
Power Reduction Through RTL Clock Gating
No ratings yet
Power Reduction Through RTL Clock Gating
10 pages
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
100% (1)
Arm Neoverse N2:: Arm'S 2 Generation High Performance Infrastructure Cpus and System Ips
27 pages
03 Building Custom Socs
No ratings yet
03 Building Custom Socs
30 pages
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
100% (1)
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
45 pages
DRAM Basics by Prof. Matthew D. Sinclair
No ratings yet
DRAM Basics by Prof. Matthew D. Sinclair
103 pages
7 Series Memory Controllers
100% (1)
7 Series Memory Controllers
36 pages
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
No ratings yet
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
7 pages
UVM RAL Model: Usage and Application
No ratings yet
UVM RAL Model: Usage and Application
12 pages
Uvm Sans Uvm: An Approach To Automating Uvm Testbench Writing
No ratings yet
Uvm Sans Uvm: An Approach To Automating Uvm Testbench Writing
13 pages
Ethernet IP Core Design Document: Author: Igor Mohor
No ratings yet
Ethernet IP Core Design Document: Author: Igor Mohor
46 pages
Please! Can Someone Make UVM Easy To Use?: Rich Edelman Raghu Ardeishar Mentor Graphics
No ratings yet
Please! Can Someone Make UVM Easy To Use?: Rich Edelman Raghu Ardeishar Mentor Graphics
23 pages
AI Transformation Playbook
No ratings yet
AI Transformation Playbook
22 pages
Vlsi/Fpga Design and Test CAD Tool Flow in Mentor Graphics
No ratings yet
Vlsi/Fpga Design and Test CAD Tool Flow in Mentor Graphics
20 pages
Intelligent High Performance Memory Access Technique in Aspect of DDR3
No ratings yet
Intelligent High Performance Memory Access Technique in Aspect of DDR3
6 pages
Using Fpga Prototyping Board As An Soc Verification and Integration Platform
No ratings yet
Using Fpga Prototyping Board As An Soc Verification and Integration Platform
13 pages
LPDDR4 SDRAM Controller Core
No ratings yet
LPDDR4 SDRAM Controller Core
1 page
5 Jtag
No ratings yet
5 Jtag
64 pages
Riscv Rocket Chip Tutorial Bootcamp Jan2015
No ratings yet
Riscv Rocket Chip Tutorial Bootcamp Jan2015
30 pages
Phy Ip For Pcie 3.0
No ratings yet
Phy Ip For Pcie 3.0
2 pages
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
No ratings yet
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
26 pages
3 - SoC & NoC
No ratings yet
3 - SoC & NoC
55 pages
PCI Express Basics & Background: Richard Solomon Synopsys
100% (1)
PCI Express Basics & Background: Richard Solomon Synopsys
45 pages
Glenn Okamoto ASIC Engr Resume SJ
No ratings yet
Glenn Okamoto ASIC Engr Resume SJ
11 pages
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
No ratings yet
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
74 pages
Asic Design Flow Tutorial 3228gl
No ratings yet
Asic Design Flow Tutorial 3228gl
138 pages
DDR5-Anil Pandey PDF
No ratings yet
DDR5-Anil Pandey PDF
3 pages
Seer Verilog
No ratings yet
Seer Verilog
59 pages
Basics of DDR Protocol: Jose Thomas Vellara
No ratings yet
Basics of DDR Protocol: Jose Thomas Vellara
57 pages
Registers in Computer Architecture
No ratings yet
Registers in Computer Architecture
37 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
Amba Bus: Advanced Micro Controller Bus Architecture
No ratings yet
Amba Bus: Advanced Micro Controller Bus Architecture
15 pages
Block Interconnection: Today's Topics Divide Into Two
No ratings yet
Block Interconnection: Today's Topics Divide Into Two
9 pages
My CXL Presentation
100% (1)
My CXL Presentation
25 pages
SimVisionAdvancedRAK Overview
No ratings yet
SimVisionAdvancedRAK Overview
116 pages
ASU DDR5 Digital Presentation
No ratings yet
ASU DDR5 Digital Presentation
59 pages
Design of LPDDR3 Memory Controller With Axi
No ratings yet
Design of LPDDR3 Memory Controller With Axi
4 pages
Advanced Verifivstion Topics Index
No ratings yet
Advanced Verifivstion Topics Index
6 pages
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
No ratings yet
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
5 pages
Fujitsu ARM Cortex Design Support
No ratings yet
Fujitsu ARM Cortex Design Support
2 pages
Lec05 Introduction To Macros and SRAM Lint
No ratings yet
Lec05 Introduction To Macros and SRAM Lint
48 pages
Mvsim Pag
No ratings yet
Mvsim Pag
16 pages
Network On Chip
No ratings yet
Network On Chip
43 pages
Riscv Iommu PDF
No ratings yet
Riscv Iommu PDF
103 pages
VERILOG
No ratings yet
VERILOG
111 pages
Super Buffer
100% (1)
Super Buffer
41 pages
AXI Verification IP
No ratings yet
AXI Verification IP
54 pages
PCIe Clock Source Selection
No ratings yet
PCIe Clock Source Selection
7 pages
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
No ratings yet
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
10 pages
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet
Tcl 8.5 Network Programming
From Everand
Tcl 8.5 Network Programming
Wojciech Kocjan
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
No ratings yet
Chapter 5 Large and Fast Exploiting Memory Hierarchy
96 pages
Screws, Fittings, Seals Bucyrus
100% (1)
Screws, Fittings, Seals Bucyrus
16 pages
Cause and Effect of Early Marriage
100% (3)
Cause and Effect of Early Marriage
7 pages
Eaton UPS 9130 Installation manual
No ratings yet
Eaton UPS 9130 Installation manual
26 pages
Lesson Plan - Developing Analysis Skills - Lesson 4 Stage 5
No ratings yet
Lesson Plan - Developing Analysis Skills - Lesson 4 Stage 5
2 pages
Zirantec Submersible-Pump Catalog 2019
No ratings yet
Zirantec Submersible-Pump Catalog 2019
140 pages
Resume of Manikandan
No ratings yet
Resume of Manikandan
2 pages
c43fe1a6fb1fbdceaeb2d5568ca8d4d807bce064e1b00237b87d2069d9e188e3
No ratings yet
c43fe1a6fb1fbdceaeb2d5568ca8d4d807bce064e1b00237b87d2069d9e188e3
3 pages
TRB Polytechnic - TRB Material For Preparation in Lecturers Recruitment in Government Polytechnic Colleges
27% (22)
TRB Polytechnic - TRB Material For Preparation in Lecturers Recruitment in Government Polytechnic Colleges
356 pages
Qap (2522)
No ratings yet
Qap (2522)
1 page
TB 05
No ratings yet
TB 05
48 pages
Decision Map PDF
No ratings yet
Decision Map PDF
1 page
Phys532 Pset 2
No ratings yet
Phys532 Pset 2
2 pages
Driver Fatigue Guide
No ratings yet
Driver Fatigue Guide
32 pages
Pedido 650547615: Itens Do Pedido (1 Itens)
No ratings yet
Pedido 650547615: Itens Do Pedido (1 Itens)
2 pages
Bailey - Rationality, Democracy and The Neutral Teacher
No ratings yet
Bailey - Rationality, Democracy and The Neutral Teacher
11 pages
2019 May IT404-A - Ktu Qbank
No ratings yet
2019 May IT404-A - Ktu Qbank
2 pages
Listening Activity: Nicol Alejandra Díaz Pérez Jordan Fabian Urra Rubio
No ratings yet
Listening Activity: Nicol Alejandra Díaz Pérez Jordan Fabian Urra Rubio
4 pages
How To Write Good Captions in Photojournalism: Learning Caption Basic
No ratings yet
How To Write Good Captions in Photojournalism: Learning Caption Basic
9 pages
IoT. Lecture 2. Sensors Introduction and IMU
No ratings yet
IoT. Lecture 2. Sensors Introduction and IMU
27 pages
Repetitive Manufacturing Process Overview
No ratings yet
Repetitive Manufacturing Process Overview
3 pages
Helix Native Pilot Guide 3.50 - English
No ratings yet
Helix Native Pilot Guide 3.50 - English
83 pages
ABA636 Signature Assignment
No ratings yet
ABA636 Signature Assignment
7 pages
Cambridge International AS & A Level: Biology 9700/31 May/June 2022
No ratings yet
Cambridge International AS & A Level: Biology 9700/31 May/June 2022
6 pages
Modern Syllabus of L.K.G
100% (3)
Modern Syllabus of L.K.G
3 pages
Sharp Lc-40le820 Lc-46le820 Lc-52le820 Lc-60le820
No ratings yet
Sharp Lc-40le820 Lc-46le820 Lc-52le820 Lc-60le820
106 pages
Mastication Is Process During Which The Elastic Rubber Achieves Plastic Properties
No ratings yet
Mastication Is Process During Which The Elastic Rubber Achieves Plastic Properties
1 page
Software Environment: History of Visual Basic
No ratings yet
Software Environment: History of Visual Basic
5 pages
Senior Logistics Manager in MD PA VA Resume Howard Moore
No ratings yet
Senior Logistics Manager in MD PA VA Resume Howard Moore
4 pages
Educational Challenges in 21 Century and Sustainable Development
No ratings yet
Educational Challenges in 21 Century and Sustainable Development
12 pages

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

Uploaded by

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

Uploaded by

DNN Accelerator

ISCA Tutorial (2017)

Memory Hierarchy Memory Hierarchy

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

filter weight ALU

Worst Case: all memory R/W are DRAM accesses

Extra levels of local memory hierarchy

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

• Example: DRAM access in AlexNet can be reduced

Local Memory Hierarchy

ALU ALU ALU ALU Control

Normalized Energy Cost*

How to exploit 1 data reuse and 2 local accumulation

Normalized Energy Cost*

How to exploit 1 data reuse and 2 local accumulation

specialized processing dataflow required!

Normalized Energy Cost*

• Weight Stationary (WS)

[Chen et al., ISCA 2016] 17

• Minimize weight read energy consumption

• Broadcast activations and accumulate psums

[Farabet et al., ICCV 2009] 19

• Minimize partial sum R/W energy consumption

• Broadcast/Multicast filter weights and reuse

Top-Level Architecture PE Architecture

[Du et al., ISCA 2015] 21

• Use a large global buffer as shared storage

• Multicast activations, single-cast weights, and

[Zhang et al., FPGA 2015] 23

Top-Level Architecture Matrix Multiply Unit

[Jouppi et al., ISCA 2017] 24

• Output Stationary (OS)

• No Local Reuse (NLR)

• Optimize for overall energy efficiency

[Chen et al., ISCA 2016] 28

• Maximize row convolutional reuse in RF

• Maximize row psum accumulation in RF

Row 1 Row 2 Row 3

Row 1 Row 2 Row 3

Filter rows are reused across PEs horizontally

Row 1 Row 2 Row 3

Fmap rows are reused across PEs diagonally

Row 1 Row 2 Row 3

Partial sums accumulate across PEs vertically

Processing in PE: concatenate fmap rows

Filter 1 Fmap 1 & 2 Psum 1 & 2

Filter 1 Fmap 1 Psum 1

Filter 1 Fmap 1 Psum 1

Filter 1 Fmap 1 Psum 1

Processing in PE: interleave filter rows

Filter 1 & 2 Fmap 1 Psum 1 & 2

Filter 1 Fmap 1 Psum 1

Filter 1 Fmap 1 Psum 1

Filter 1 Fmap 1 Psum 1

Processing in PE: interleave channels

Filter 1 Fmap 1 Psum

Row Stationary Mapping

[Chen et al., ISCA 2016] 53

[Chen et al., Micro Top-Picks 2017] 54

OSA OSB OSC

# Output Channels Single Multiple Multiple

RS optimizes for the best overall energy efficiency

RS uses 1.4× – 2.5× lower energy than other dataflows

RS uses at least 1.3× lower energy than other dataflows

[Chen et al., ISCA 2016] 61

RF dominates DRAM dominates

•  Example: DRAM access in AlexNet can be reduced

•  Weight Stationary (WS)

•  Minimize weight read energy consumption

•  Broadcast activations and accumulate psums

•  Minimize partial sum R/W energy consumption

•  Broadcast/Multicast filter weights and reuse

•  Use a large global buffer as shared storage

•  Multicast activations, single-cast weights, and

•  Output Stationary (OS)

•  No Local Reuse (NLR)

•  Optimize for overall energy efficiency

•  Maximize row convolutional reuse in RF

•  Maximize row psum accumulation in RF