0% found this document useful (0 votes)
151 views

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

The document summarizes different accelerator architectures for deep neural network (DNN) inference. It discusses how memory access is a bottleneck in DNN workloads due to the large number of memory operations required. The key opportunities to reduce memory accesses are through data reuse of activations and filter weights within the memory hierarchy and performing local accumulation of partial sums without accessing external memory. The document then describes three common dataflow paradigms for DNN accelerators - weight stationary, output stationary, and no local reuse - and provides examples of accelerators that use each approach to exploit data reuse and local accumulation to reduce memory accesses and improve efficiency.

Uploaded by

Ashin Antony
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views

Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF

The document summarizes different accelerator architectures for deep neural network (DNN) inference. It discusses how memory access is a bottleneck in DNN workloads due to the large number of memory operations required. The key opportunities to reduce memory accesses are through data reuse of activations and filter weights within the memory hierarchy and performing local accumulation of partial sums without accessing external memory. The document then describes three common dataflow paradigms for DNN accelerators - weight stationary, output stationary, and no local reuse - and provides examples of accelerators that use each approach to exploit data reuse and local accumulation to reduce memory accesses and improve efficiency.

Uploaded by

Ashin Antony
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

DNN Accelerator

Architectures

ISCA Tutorial (2017)


Website: http://eyeriss.mit.edu/tutorial.html
Joel Emer, Vivienne Sze, Yu-Hsin Chen 1
2 Highly-Parallel Compute Paradigms
Temporal Architecture Spatial Architecture
(SIMD/SIMT) (Dataflow Processing)

Memory Hierarchy Memory Hierarchy


Register File
ALU ALU ALU ALU
ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU


ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU


Control

2
Memory Access is the Bottleneck
Memory Read MAC* Memory Write

filter weight ALU


fmap activation
partial sum updated partial sum

* multiply-and-accumulate

3
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM DRAM

* multiply-and-accumulate

Worst Case: all memory R/W are DRAM accesses


•  Example: AlexNet [NIPS 2012] has 724M MACs
à 2896M DRAM accesses required

4
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

5
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

6
Types of Data Reuse in DNN
Convolutional Reuse
CONV layers only
(sliding window)

Input Fmap
Filter

Activations
Reuse:
Filter weights

7
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse
CONV layers only CONV and FC layers
(sliding window)

Filters
Input Fmap Input Fmap
Filter
1

Activations
Reuse: Reuse: Activations
Filter weights

8
Types of Data Reuse in DNN
Convolutional Reuse Fmap Reuse Filter Reuse
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)
Input Fmaps
Filters
Input Fmap Input Fmap
Filter Filter
1 1

2
2

Activations
Reuse: Reuse: Activations Reuse: Filter weights
Filter weights

9
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation


1 Can reduce DRAM reads of filter/fmap by up to 500×**
1) 
** AlexNet CONV layers

10
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation


1 Can reduce DRAM reads of filter/fmap by up to 500×
1) 
2) 
2 Partial sum accumulation does NOT have to access DRAM

11
Memory Access is the Bottleneck
Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation


1 Can reduce DRAM reads of filter/fmap by up to 500×
1) 
2) 
2 Partial sum accumulation does NOT have to access DRAM

•  Example: DRAM access in AlexNet can be reduced


from 2896M to 61M (best case)
12
Spatial Architecture for DNN
DRAM

Local Memory Hierarchy


Global Buffer (100 – 500 kB)
•  Global Buffer
ALU ALU ALU ALU
•  Direct inter-PE network
•  PE-local memory (RF)
ALU ALU ALU ALU
Processing
Element (PE)
ALU ALU ALU ALU Reg File 0.5 – 1.0 kB

ALU ALU ALU ALU Control

13
Low-Cost Local Data Access

PE PE
Global
DRAM
Buffer fetch data to run
PE ALU
a MAC here

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 14
Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation


with limited low-cost local storage?

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 15
Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation


with limited low-cost local storage?

specialized processing dataflow required!

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process 16
Dataflow Taxonomy

•  Weight Stationary (WS)


•  Output Stationary (OS)
•  No Local Reuse (NLR)

[Chen et al., ISCA 2016] 17


Weight Stationary (WS)

Global Buffer
Psum Activation

W0 W1 W2 W3 W4 W5 W6 W7 PE
Weight

•  Minimize weight read energy consumption


−  maximize convolutional and filter reuse of weights

•  Broadcast activations and accumulate psums


spatially across the PE array.

18
WS Example: nn-X (NeuFlow)
A 3×3 2D Convolution Engine
activations

weights

psums

[Farabet et al., ICCV 2009] 19


Output Stationary (OS)

Global Buffer
Activation Weight

P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum

•  Minimize partial sum R/W energy consumption


−  maximize local accumulation

•  Broadcast/Multicast filter weights and reuse


activations spatially across the PE array

20
OS Example: ShiDianNao

Top-Level Architecture PE Architecture


weights activations

psums

[Du et al., ISCA 2015] 21


No Local Reuse (NLR)

Global Buffer
Weight
Activation
PE
Psum

•  Use a large global buffer as shared storage


−  Reduce DRAM access energy consumption

•  Multicast activations, single-cast weights, and


accumulate psums spatially across the PE array

22
NLR Example: UCLA

psums

activations weights

[Zhang et al., FPGA 2015] 23


NLR Example: TPU

Top-Level Architecture Matrix Multiply Unit

weights

activations

psums

[Jouppi et al., ISCA 2017] 24


Taxonomy: More Examples
•  Weight Stationary (WS)
[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014]
[Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]

•  Output Stationary (OS)


[Peemen, ICCD 2013] [ShiDianNao, ISCA 2015]
[Gupta, ICML 2015] [Moons, VLSI 2016]

•  No Local Reuse (NLR)


[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014]
[Zhang, FPGA 2015] [TPU, ISCA 2017]

25
Energy Efficiency Comparison
•  Same total area •  256 PEs
•  AlexNet CONV layers •  Batch size = 16
2 Variants of OS
Norm. Energy/Op

1.5

Normalized
1
Energy/MAC

0.5

0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR RS
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 26
Energy Efficiency Comparison
•  Same total area •  256 PEs
•  AlexNet CONV layers •  Batch size = 16
2 Variants of OS
Norm. Energy/Op

1.5

Normalized
1
Energy/MAC

0.5

0
WS
WS OSA
OSA OSB
OSB OSC
OSC NLR
NLR Row
RS
Stationary
Dataflows
CNN Dataflows
[Chen et al., ISCA 2016] 27
Energy-Efficient Dataflow:
Row Stationary (RS)
•  Maximize reuse and accumulation at RF

•  Optimize for overall energy efficiency


instead for only a certain data type

[Chen et al., ISCA 2016] 28


29 Row Stationary: Energy-efficient Dataflow

Input Fmap
Filter Output Fmap

* =

29
30 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b a

30
31 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b a

31
32 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c b
a
b

32
33 1D Row Convolution in PE
Input Fmap
Filter a b c d e Partial Sums
a b c a b c

* =

Reg File PE
c b a

e d c
b a
c

33
34 1D Row Convolution in PE

•  Maximize row convolutional reuse in RF


−  Keep a filter row and fmap sliding window in RF

•  Maximize row psum accumulation in RF

Reg File PE
c b a

e d c
b a
c

34
35 2D Convolution in PE Array

PE 1
Row 1 * Row 1

* =
35
36 2D Convolution in PE Array

Row 1

PE 1
Row 1 * Row 1

PE 2
Row 2 * Row 2

PE 3
Row 3 * Row 3

* =
36
37 2D Convolution in PE Array

Row 1 Row 2

PE 1 PE 4
Row 1 * Row 1 Row 1 * Row 2

PE 2 PE 5
Row 2 * Row 2 Row 2 * Row 3

PE 3 PE 6
Row 3 * Row 3 Row 3 * Row 4

* =
* =
37
38 2D Convolution in PE Array

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

* =
* =
* =
38
39 Convolutional Reuse Maximized

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Filter rows are reused across PEs horizontally

39
40 Convolutional Reuse Maximized

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Fmap rows are reused across PEs diagonally

40
41 Maximize 2D Accumulation in PE Array

Row 1 Row 2 Row 3

PE 1 PE 4 PE 7
Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3

PE 2 PE 5 PE 8
Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE 3 PE 6 PE 9
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5

Partial sums accumulate across PEs vertically

41
Dimensions Beyond 2D Convolution
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

42
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
* Row 1 = Row 1

43
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1

44
Filter Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels
C
Filter 1 Fmap 1 Psum 1

C
H

H
Channel 1 Row 1
* Row 1 = Row 1

R
C
R
Filter 1 Fmap 2 Psum 2
H Channel 1 Row 1
*
share the same filter row
Row 1 = Row 1

Processing in PE: concatenate fmap rows

Filter 1 Fmap 1 & 2 Psum 1 & 2


Channel 1 Row 1
* Row 1 Row 1 = Row 1 Row 1

45
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1


C

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R

46
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1


C

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row

47
Fmap Reuse in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1


C

R
C
Channel 1 Row 1
* Row 1 = Row 1

H
C
Filter 2 Fmap 1 Psum 2

* =
R H
Channel 1 Row 1 Row 1 Row 1
R
share the same fmap row

Processing in PE: interleave filter rows

Filter 1 & 2 Fmap 1 Psum 1 & 2


Channel 1
* Row 1 =

48
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1

49
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums
Row 1 + Row 1 = Row 1

50
Channel Accumulation in PE
1 Multiple Fmaps 2 Multiple Filters 3 Multiple Channels

Filter 1 Fmap 1 Psum 1

C
C
Channel 1 Row 1
* Row 1 = Row 1

R H

R
Filter 1 Fmap 1 Psum 1

* =
H
Channel 2 Row 1 Row 1 Row 1
accumulate psums

Processing in PE: interleave channels

Filter 1 Fmap 1 Psum


Channel 1 & 2
* = Row 1

51
52 DNN Processing – The Full Picture

Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =
Map rows from multiple fmaps, filters and channels to same PE
to exploit other forms of reuse and local accumulation 52
Optimal Mapping in Row Stationary
CNN Configurations
C
M
C

Optimization
H
R E
1 1 1
R H E

Compiler


C C

R
M H
E (Mapper)
N
R N E
H

Row Stationary Mapping


Hardware Resources
PE PE PE
Global Buffer Row 1 Row 1 Row 1 Row 2 Row 1 * Row 3
* *
PE PE PE
ALU ALU ALU ALU Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4

PE PE PE
Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
ALU ALU ALU ALU

Filter 1 Image
Fmap 1 & 2 Psum 1 & 2
ALU ALU ALU ALU Multiple fmaps:
* =
Filter 1 & 2 Image
Fmap 1 Psum 1 & 2
Multiple filters:
ALU ALU ALU ALU
* =
Filter 1 Image
Fmap 1 Psum
Multiple channels:
* =

[Chen et al., ISCA 2016] 53


Computer Architecture Analogy

Compilation Execution
DNN Shape and Size Processed
(Program) Data
Dataflow, …
(Architecture)
Mapper DNN Accelerator
(Compiler) (Processor)
Implementation
Details
(µArch) Mapping Input
(Binary) Data

[Chen et al., Micro Top-Picks 2017] 54


Dataflow
Simulation Results

55
56 Evaluate Reuse in Different Dataflows
•  Weight Sta7onary
–  Minimize movement of filter weights
•  Output Sta7onary
–  Minimize movement of par5al sums
•  No Local Reuse
–  No PE local storage. Maximize global buffer size.
•  Row Sta7onary
Normalized Energy Cost*
Evaluation Setup
ALU 1× (Reference)
•  same total area RF ALU 1×
•  256 PEs PE ALU 2×
•  AlexNet Buffer ALU 6×
•  batch size = 16 DRAM ALU 200×
56
Variants of Output Stationary

OSA OSB OSC


M M M

Parallel
Output Region E E E

E E E

# Output Channels Single Multiple Multiple


# Output Activations Multiple Multiple Single

Targeting Targeting
Notes
CONV layers FC layers

57
Dataflow Comparison: CONV Layers
2
psums
weights
1.5
activations
Normalized
1
Energy/MAC

0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS optimizes for the best overall energy efficiency


[Chen et al., ISCA 2016] 58
Dataflow Comparison: CONV Layers
2

1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
0.5 DRAM

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS uses 1.4× – 2.5× lower energy than other dataflows


[Chen et al., ISCA 2016] 59
Dataflow Comparison: FC Layers

2 psums
weights
1.5 activations
Normalized
Energy/MAC 1

0.5

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS uses at least 1.3× lower energy than other dataflows


[Chen et al., ISCA 2016] 60
Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

[Chen et al., ISCA 2016] 61


Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

RF dominates
[Chen et al., ISCA 2016] 62
Row Stationary: Layer Breakdown
2.0e10

1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

RF dominates DRAM dominates


[Chen et al., ISCA 2016] 63
Row Stationary: Layer Breakdown
2.0e10 Total Energy
80% 20%
1.5e10 ALU

Normalized RF
Energy 1.0e10 NoC
(1 MAC = 1) buffer
0.5e10 DRAM

0
L1 L2 L3 L4 L5 L6 L7 L8
CONV Layers FC Layers

CONV layers dominate energy consumption!


64
Hardware Architecture
for RS Dataflow

[Chen et al., ISSCC 2016] 65


66 Eyeriss DNN Accelerator
Link Clock Core Clock DNN Accelerator
14×12 PE Array
Filter Filt

Global Fmap
Input Fmap Buffer …
Decomp SRAM Psum

Output Fmap 108KB Psum


Comp ReLU

Off-Chip DRAM
64 bits
[Chen et al., ISSCC 2016] 66
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt

Input Image Buffer Fmap



SRAM
Decomp Psum

Output Image 108KB
Filter Fmap Psum


DeliveryCompDelivery
ReLU

How to accommodate Off-Chip


different shapes
DRAM with fixed PE array?
64 bits
67
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..

..

..
..

..
.. ..

14 14
3 13
14
3 13 5
12 12
3 13
13
5
3 13

Physical PE Array Physical PE Array


68
Logical to Physical Mappings
Replication Folding
13 27
AlexNet .. AlexNet ..
.. ..
Layer 3-5 3 .. Layer 2 5
..

..

..
..

..
.. ..

14 14
3 13
14
3 13 5
12 Unused PEs 12
3 13 are
13
3 13 Clock Gated 5

Physical PE Array Physical PE Array


69
Data Delivery with On-Chip Network
Link Clock Core Clock DCNN Accelerator
14×12 PE Array
FilterPatterns
Data Delivery Filt

Buffer Img
Input Image …
SRAM
Decomp Psum

Output Image 108KB
Filter Image Psum


DeliveryCompDelivery
ReLU

Compared to Broadcast,Off-Chip
Multicast saves >80% of NoC energy
DRAM
64 bits
70
Chip Spec & Measurement Results
Technology TSMC 65nm LP 1P9M
On-Chip Buffer 108 KB
4000 µm
# of PEs 168
Scratch Pad / PE 0.5 KB Global Spatial Array
Core Frequency 100 – 250 MHz Buffer (168 PEs)
Peak Performance 33.6 – 84.0 GOPS

4000 µm
Word Bit-width 16-bit Fixed-Point
Filter Width: 1 – 32
Filter Height: 1 – 12
Natively Supported Num. Filters: 1 – 1024
DNN Shapes Num. Channels: 1 – 1024
Horz. Stride: 1–12
Vert. Stride: 1, 2, 4
To support 2.66 GMACs [8 billion 16-bit inputs (16GB) and 2.7 billion
outputs (5.4GB)], only requires 208.5MB (buffer) and 15.4MB (DRAM)
[Chen et al., ISSCC 2016] 71
Summary of DNN Dataflows
•  Weight Stationary
–  Minimize movement of filter weights
–  Popular with processing-in-memory architectures

•  Output Stationary
–  Minimize movement of partial sums
–  Different variants optimized for CONV or FC layers

•  No Local Reuse
–  No PE local storage à maximize global buffer size

•  Row Stationary
–  Adapt to the NN shape and hardware constraints
–  Optimized for overall system energy efficiency

72
Fused Layer
•  Dataflow across multiple layers

[Alwani et al., MICRO 2016] 73

You might also like