Edge Machine Learning For Embedded Deep Dive
Edge Machine Learning For Embedded Deep Dive
Presented By
Andy Luo
Sr. Product Marketing Manager
2018-10-01
Cloud ML
And there are many more …
Edge ML
© Copyright 2018 Xilinx
Xilinx Value Proposition in Edge/Embedded ML
3 Future proof to lower 4 Low latency end-to-end 5 Scalable device family for
precisions different applications
© Copyright 2018 Xilinx
Key Challenges for Xilinx in Edge/Embedded ML
Machine Learning
Development tools
USB3
Platforms HDMI
MIPI
Quantization
Face detection Pose estimation Video analytics Lane detection Object detection Segmentation
Framework
Darknet
Compression
Pruning Quantization
Compilation
Compiler Assembler
Tools & IP
Runtime
Core API Loader
Driver Profiler
HW Platforms
Z7020 Board Z7020 SOM ZU2 SOM ZU2/3 Card ZU9 Card ZCU102 ZCU104 Ultra96
© Copyright 2018 Xilinx
Deephi also has LSTM IP for KU115/VU9P as a part of Cloud ML
DNNDK Overview
DECENT N2Cube
˃ DNNC (Deep Neural Network Compiler)
DNNC Simulator
˃ Profiler DSight
85%
CTRL PE Array
SIGNALS VGG16 40%
PE PE ... PE PE 18%
• Pooling • Upsampling
32bits 32bits
DPU
B4096
ZU2 47000 94000 5.3Mb 240 1xB1152 576 576GOPS 500MHz 3.5W
ZU3 71000 141000 7.6Mb 360 1xB1152 576 576GOPS 500MHz N/A
ZU54) 117000 234000 5.1Mb+18Mb 1248 1xB4096 2048 1350GOPS 330MHz N/A
1xB4096 2048
ZU7EV 230000 461000 11Mb+27Mb 1728 2240GOPS 350MHz N/A
+2xB1152 +2*576
ZU9 274000 548000 32.1Mb 2520 2xB4096 4096 2700GOPS 330MHz 10W
>> 15
© Copyright 2018 Xilinx
DPU Utilization
500 445
400
313
300
200 179
118
73 92
100
12 28.3
0
VGG-SSD VGG16 ResNet50 GoogLeNet
6.8T ZU15
5.5T ZU11
4.1T ZU9
3.5T ZU7
2.9T ZU6
2.8T Z7100
2.4T ZU5
1.2T ZU3
700G Z7030
576G ZU2
230G Z7020
115G Z7014S/Z7015
102G Z7012S
56G Z7010
with DNNDK
02 Model Compilation
03 Programming
04 Hybrid Compilation
05 Execution
>> 19
© Copyright 2018 Xilinx
DECENT – Deephi Deep Compression Tool
prune more?
Y
N
Transform
[A] SSD+VGG [ 173G] 57.1 58.7 +1.6 40% 56.6 -0.5 12%
0
1 2 3 4 5 6 7 8 9 10 11 12
operations (G) mAP (%)
60
40 18
20
0
11 7 G 19G 11 . 6 G
OPS
2x DPU-4096@ZU9
SSD GPU
105
90
(batch=1)
75
FPS
60
45
Result of
30
DeePhi Pruning
15
˃ Data needs to Y
increase accuracy finetune
Calibration data
‒ Quantize activation
Training data
N
deploy
‒ Further increase accuracy
450
400
350
PERFORMANCE (FPS)
250
200
Tiny Yolov2 (7,
168) Tiny Yolov3 (5.6, 170)
150
ResNet50
100 (7.7, 118)
0
0 20 40 60 80 100 120
COMPUTATION (GOPS PER IMAGE )
350
PERFORMANCE (FPS)
250
200
Tiny Yolov2 (7, 168)
Tiny Yolov3 (5.6, 170)
150 ResNet50 (3.8, 150)
SSD (11.6, 129)
ResNet50
100 (7.7, 118) VGG16 (20, 100)
Yolov2 (16, 95) VGG16 (30, 73)
50 Yolov3 (17, 54)
Yolov2 (36,42)
Yolov3 (65,25) SSD (117,19.7)
0
0 20 40 60 80 100 120
COMPUTATION (GOP PER IMAGE )
Pruned Network
400
250
200
Tiny Yolov2 (7, 168)
Tiny Yolov3 (5.6, 170)
150 ResNet50 (3.8, 150)
SSD (11.6, 129)
ResNet50
100 (7.7, 118) FPN (8.9, 120) VGG16 (20, 100)
Yolov2 (16, VGG16 (30, 73)
95)
50 Yolov3 (17, 54)
Yolov2 (36,42)
VPGNet (10, 30) Yolov3 (65,25) SSD (117,19.7)
0
0 20 40 60 80 100 120
COMPUTATION (GOPS PER IMAGE)
˃ DP8000
Z7020 SOM
˃ DP2400
ZU9 PCIe card
Gender : Male
Upper color : Black
Detection & Tracking Person Attributes
Lower color : Black
Hat : No
Backpack : No
Handbag : No
Other bag : No
Plate Detection
License Recognition Color : Blue
Number :渝C LC689
Lane Detection
˃ Network
SSD compact version
˃ Performance
30fps per channel
˃ Network
FPN compact version
SSD compact version
˃ Performance
15fps per channel
C/C++
RTL Instruction
RTL
RTL
>> 45
© Copyright 2018 Xilinx
Two Development Flows of Using Deephi DPU IP
Suitable for FPGA designer Suitable for algorithm & software developer
>> 47
© Copyright 2018 Xilinx
HW Integration with Vivado IPI (Cont.)
˃ Steps(Cont.)
Generate bitstream
˃ Note
The port data width is consistent with DPU
data width
For frequency > 333MHz, clock wizard is
needed between MPSoC and DPU
Interrupt configuration was shown in binary.
[3]: 0- pl_ps_irq0 ; 1- pl_ps_irq1
[2:0]: interrupt number 0~7
>> 48
© Copyright 2018 Xilinx
SW Integration with SDK
˃ Device tree configuration
set interrupt number according to block design
set core-num
˃ OpenCV configuration
Enable in Filesystem Packages -> misc or libs
>> 49
© Copyright 2018 Xilinx
HW Integration with C-callable IP
Create a Library Library Use the library
#include “dpu.hpp”
void main(){
../include
˃ Steps …
uint32_t start = 0x1;
void dpu_set_start(uint32_t start); dpu.hpp dpu_set_start(start);
...
…
}
Header file dpu.hpp LFLAGS= -ldpu
#LFAGS = -ldpusw...
dpu
PL
libdpu.a
Configure DPU parameters Platform
Makefile
sdx_pack -header dpu.hpp -lib libdpu.a \
-func dpu_set_start -map start=s_axi:in:0x10 -func-end \ I/O I/O
Build application software -ip ../iprepo/dpu/component.xml -control none \
-add-ip-repo ../iprepo/src/ \ The packaged IP must use
-target-family zynquplus \ supported AXI and control
-target-cpu cortex-a53 -target-os linux -verbose interfaces
>> 50
© Copyright 2018 Xilinx
Deephi DPU IP Integration with SDSoC
C-callable IP
>> 51
© Copyright 2018 Xilinx
How to Use DNNK in SDSoC
Only 3 steps!
˃ SDSoC compiler compares the new data-motion network with the last one
˃ If the same, vpl will not be called to rerun syn & impl
˃ It only takes a few minutes if –
Use the same C-callable IP library
Use the same platform
Use the same project setting
File
Face Traffic
detect SSD
USB3 D D
D D
Ped Joint
R R
HDMI SSD detect HDMI
ISP/
VPSS* 1x Deephi DPU
MIPI
˃ Timeframe
Early Access: Now DeePhi Professional
Public Access: Jan 2019
˃ To be available on AWS in Cloud Editions 3-day On-site
Training
˃ Add-on design service
Free
Pruning Tools
DeePhi Basic Access Pruning Technology
&
Compiler Compiler 3-day on-site training by a top-
notch ML expert
Everything you need Quantizer Quantizer &
to do it yourself 30-day evaluation with encrypted
Pruned Models Pruned Models pruning output
Unlimited Unlimited
Deployment Deployment