A CGRA-based Approachfor Accelerating Convolutional Neural Networks

A CGRA-based Approach
for Accelerating
Convolutional Neural Networks
Masakazu Tanomoto, Shinya Takamaeda-Yamazaki,
Jun Yao, and Yasuhiko Nakashima
Nara Institute of Science and Technology (NAIST), Japan
E-mail: shinya_at_is_naist_jp
IEEE MCSoC'15 @Torino
September 23, 2015

Outline
n  Motivation: Deep learning on embedded computers
l  Target: Convolutional Neural Network (CNN)
n  Our approach: CGRA-based CNN acceleration
l  EMAX (Energy-aware Multi-mode Accelerator eXtension)
l  Mapping CNN on EMAX
n  Evaluation
l  Performance per memory bandwidth
l  Performance per area
n  Conclusion
MCSoC15 Shinya T-Y, NAIST 2

Deep learning
n  Recognition (Convolutional Neural Network (CNN))
l  Extracting high-level features automatically from raw data
l  Ex) Image, speech, and text recognition, image search
n  Reinforcement learning (Deep Q-Network (DQN))
l  Learning appropriate strategy for controlling something
l  Ex) Gaming AI, Robot control
Playing Atari 2006 automatically
(Human-level control through deep
reinforcement learning [Nature'15])
Extracted features of human and cat
(Building High-level Features Using Large
Scale Unsupervised Learning [ICML'12])

Convolutional Neural Network (CNN)
n  Nesting multiple processing layers
l  Convolution: Multiple small matrix-matrix multiplications
•  Each weight matrix corresponds to a learned feature map
•  Feature can be automatically learned by error propagation
l  Pooling and Max-out: selection from multiple values
l  Full connection: Large matrix-matrix multiplication
n  Performance Bottleneck: Convolution
l  Numerous small matrix-matrix multiplication with stencil
Input Layer Hidden Layers Output Layer
Convolution Pooling Max Out Convolution Full Connection

Motivation: DNN on embedded computers
n  Machine learning on IoT: Learning and decision on edge
computers will become more important
l  Sending all data to data centers?: Network traffic problemL
l  Decision on data centers?: Very long latencyL
n  Challenge: Energy efficient embedded accelerators
l  Why not GPU?: GPU is very energy hungry and requires
absolute energy
•  Not only energy efficiency, but also absolute peak energy amount is
important
l  Why not ASIC?: Limited capability of algorithm customization
•  Algorithms of machine learning are rapidly evolving
l  Why not FPGA?: Energy overhead to building computing logics
l  CGRA?

Computation pattern: Full connection
n  Output vector is determined by a simple vector-matrix
multiplication
l  Input and output size is certainly large: more than 1024
l  Weight matrix size is also large
n  GPU is OK: suitable for matrix multiplication
l  GPU has matrix libraries: CUBLAS, ...
dot =Weight
Output
Vector
Input
Vector

Computation pattern: Convolution
n  A value of the result matrix is calculated by numerous
matrix-matrix multiplication with a small weight matrix
l  Weight matrix size is usually small: from 3 to 8
n  I know GPU is very fast for matrix-matrix multiplication
l  Really?
3
dot
Weight
WeightWeight
=
Weight
Weight
Nextdot
Weight
WeightWeight
dot
Weight
WeightWeight

SGEMM performance on GPU
n  GPU is fast, if the matrix size is large enough
l  GPU is throughput-oriented processor
l  In case of small matrix, parallelisms and memory bandwidth are
not exploited efficiently
0
5
10
15
20
25
0
50
100
150
200
250
64 128 256 512 1024 2048 4096
#activewarpsperactivecycle
Performance[GFLOPS]
Matrix size
warp/cycle (small kernel) GFLOPS
warp/cycle (large kernel) NVIDIA
Jetson TK1
(GK20A)

Preprocessing for Convolution on GPU
n  In order to use fast matrix multiplication library of GPU,
data duplication is usually utilized
l  Converting sub-regions into a single large matrix
n  Faster than the naive convolution, but still just a
performance overhead
3
k=3
k=3
n
n
Input vector
[n-3,
n-3]
[n-1,
n-1]
Duplication Duplication
[0,0] [0,1] [0,2] [1,0] [1,1] [1,2] [2,0] [2,1] [2,2]
[0,1] [0,2] [0,3] [1,1] [1,2] [1,3] [2,1] [2,2] [2,3]
Duplication
9 (=k2)
(n-2)2
Temporal array for matrix multiplication

Our approach: EMAX
Energy-aware Multi-mode Accelerator eXtension
n  A CGRA of local memory based PEs with several buses
l  Each PE has a local memory for data locality
Interconnection
DRAM
CPU
Core
PE PE PE PE
MemoryInterface
EMAX
PE PE PE PE
PE PE PE PE

Real Chip of EMAX
n  12.5mm x 12.5mm in 180nm technology

Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
LMM: Local Memory

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
FIFO

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
Execution units

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
Constant registers

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
External Shuffle Bus

EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
External
Shuffle
Bus
Memory Bus

EMAX instruction
Type1: row#, col#, dist [count] ALU_OP & MEM_OP RGI LMM_CONTROL
Type2: row#, col#, dist [count] ALU_OP
Type3: row#, col#, dist [count] & MEM_OP RGI LMM_CONTROL
32-bit operation add/add3/sub/sub3
16-bitx2 operation mauh/mauh3/msuh3
Misc operation mulh/mmrg3/msad/minl/minl3/mh2bw/
mcas/mmid3/mmax/mmax3/mmin/mmin3
Load from EX_FIFO ldb/ldub/ldh/lhuh/ld
Floating Point Operation fmul/fma3/fadd
32-bit operation and/or/xor
16-bitx2 operation mauh/mauh3/msuh3
Load from LMM or LMM_FIFO ldb/ldub/ldh/lhuh/ld
Store to LMM stb/sth/st/cst
(a) Instruction format
(a) EX1 operation
(b) EX2 operation
(c) LMM operation

Forward propagation
n  Weight matrix is constant in the inter-most loop
l  Assigned into constant registers
n  Index of In increases linearly
l  Burst bulk transfer from the external memory
Operations per activation of EMAX
Operations per clock cycle on EMAX
for(i1=0; i1<InDim; i++){
for(j1=0; j1<(Nimg-Nk+1); j1++){
for(i2=0; i2<OutDim; i2++){
for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){
for(ky=0; ky<Nk; iy++){
for(kx=0; kx<Nk; kx++){
Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx];
}
}
}
} } }�
InDim: Dimension of input data, OutDim: Dimension of output data
Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels)
Nk: Convolution window size

Forward propagation
for(j1=0; j1<(Nimg-Nk+1); j1++){
}
}
}
} } }�

CNN on EMAX (3x3 convolution)
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
3x3 weight matrix in constant registers

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
3 Input data sets in LMMs
Same read data is forwarded via FIFOs

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
Reading data from the constant register, LMM,
and execution unit in the previous stage
Operation result is passed to the next
stage

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
Final result is stored into LMM in the next
stage

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
Write back the previous data to the main
memory

Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
memory
Loop
Control
MemoryInterface
ColumnRow
0
1
2
3
4
5
6
7
0 1 2 3
Load the next input data from the main
memory

Evaluation setup
n  Benchmark: deep learning datasets and networks
l  Imagenet (Alexnet-2), CIFAR10, MNIST (Lenet)
n  Hardware:
l  CPU (Corei7, ARM), GPU (Desktop, Mobile), EMAX
l  Metric: Performance per bandwidth, Performance per area
•  Estimation from actual LSI of EMAX and software simulations

Performance per memory bandwidth
n  EMAX achieves better performance in embedded class
datasets
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
EMAX GTX980 GK20A Core i7 ARM

datasets
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
Alexnet:
since matrix size is large,
desktop GPU is 3.17x better

datasets
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
CIFAR-10:
1.41x better than
mobile GPU
Lenet:
1.75x better than
mobile GPU

Performance per area
n  EMAX achieves much better performance in embedded
class datasets: CGRA is better for embedded systems?
0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7

0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7
Alexnet:
since matrix size is large,
desktop GPU is 2.2x better

0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7
CIFAR-10:
1.76x better than
mobile GPU
Lenet:
1.95x better than
mobile GPU

Conclusion
n  A CGRA-based acceleration approach of convolutional
neural network (CNN) for embedded accelerators
l  EMAX (Energy-aware Multi-mode Accelerator eXtension)
n  EMAX outperforms GPU in embedded class data sets
l  1.75x better performance per memory bandwidth
l  1.95x better performance per area ( energy)
Interconnection
DRAM
CPU
Core
PE PE PE PE
MemoryInterface
EMAX
PE PE PE PE
PE PE PE PE

A CGRA-based Approachfor Accelerating Convolutional Neural Networks

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to A CGRA-based Approachfor Accelerating Convolutional Neural Networks (20)

More from Shinya Takamaeda-Y (15)

Recently uploaded (20)