SlideShare a Scribd company logo
High-Throughput Convolutional Neural Network
on an FPGA by Customized JPEG Compression
Hiroki Nakahara
Tokyo Institute of Technology, JP
Zhiqiang Que Wayne Luk
Imperial College London, UK
Outline
• Background
• JPEG compression for a high-speed inference
• CNN model for an FPGA implementation
• Channel shift and point-wise decomposition
• Quantization strategy
• Channel shuffle
• Fully-pipelined CNN architecture
• Experimental results
• Conclusion
2
Convolutional Neural Networks (CNNs)
• High accuracy and many applications
• Image recognitions, NLPs, data mining [1]
• FPGAs on cloud services
• Amazon AWS, Microsoft Azure, etc.
3
[1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng,
“UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery
and data mining (KDD), 2019, pp.3132–3142.
Problems
• Power consumption
• Performance bottleneck (Data-transfer)
• e.g., AWS F1 provides overall read/write at 6.5GB/s
from host CPU to FPGA [1]
4
Host
PC
Interconnect
PCIe
CNN
Kernel
.jpg
RAW(RGB) Img.
Accelerator Card
[1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for
Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
5
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Dog?
Cat?
6Image Source: https://www.kaggle.com/c/dogs-vs-cats/data
7
Labradoodle?
Fried Chicken?
Source: https://bit.ly/2zveHGT
8
Labradoodle?
Fried Chicken?
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
9
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Customized JPEG
for a High-speed Inference
10
JPEG Coding
11
Pre-
processing
DCT Quant.
Huffman
Encoding
Quant.
Table
Huffman
Coding
Table
Post-
processing
IDCT
Reverse
Quant.
Huffman
Decoding
Encoding
Decoding
CompressedImageData
RGB
Image
Picture
Matrix
DCT
Matrix
JPEG
Header
Proposed JPEG Coding
with a CNN Accelerator
12
Quant.
Huffman
Encoding
Fully
Pipelining
CNN
IDCT
Reverse
Quant.
Huffman
Decoding
Host PC
ImageStreamData
JPEG
Image
.jpg
Extreme
Quant. Value q
Quant.
Table
RAM
PCIe
Huffman
Decoding
& Reverse
Quant.
RAMRAM
Detection
Result
FPGA
Ping-pong
Buffer
Huffman
Coding Table
Huffman
Coding Table
Huffman Decoding
and Reverse Quantization Unit
13
0
1
2
3
4
2
2
2
3
4
Shift Register
Shift
Value
Quantized
Value
Quant.
Value q
Run-length
Decoder
00**
01**
10**
110*
1110
Image Data Stream
...
...
Priority
Encoder
Buffer RAM
Zig-zag writing
ADR
WDATA
Zig-zag
pattern ROM
• Decompose the 2D-IDCT with 16 1D-DCTs
14
2D-IDCT
AP-922
Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine
Transform Using the Streaming SIMD Extensions and MMX Instructions,”
https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
2D-IDCT Unit
15
..
Controller
Operation
Units
Reg. 1D-IDCT Unit
RAM RAM RAM
• Two 1D-IDCT units
• Use half precision (16 bits)
CNN model for an FPGA
Implementation
16
Overview
1. Decomposing k×k convolution by
channel shift [1] and point-wise (1×1) convolution
2. Binary (1-bit) weight quantization [2]
3. Channel split and shuffle [3]
17
[3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” CVPR, 2018.
[1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez,
and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,”
CVPR, 2018, pp. 9127-9135.
[2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural
Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
Building Blocks
18
#channel x2
(a) Plain block
(b) Down-sampling block
Channel
Split
Shift PWConv Shift PWConv
Concat
&
Shuffle
Channel
Split
Shift
PWConv
(s=2)
Shift PWConv
Concat
&
Shuffle
Shift
PWConv
(s=2)
#channel/2
Our CNN
Model
19
Layer Output
size
Kernel size Stride #Output
channel
Image 224 3
PWConv 224 1 2 24
Norm 224 1 1 24
Shift 224 1 1 24
Pool 112 3 1 24
PWConv 112 2 2 24
Norm 112 1 1 24
ReLU 112 1 1 24
Shift 112 3 1 24
Pool 56 2 2 24
Stage 2 28 116
(4 repeats)
Stage 3 14 232
(8 repeats)
Stage 4 7 464
(16 repeats)
GAP 1 7 1 464
PWConv 1 1 1 1000
• Training-aware
quantization
• w: binary, a: 8-bit
• 2.54 M params,
0.616 GMACs
Fully-pipelined
CNN Architecture
20
Dataflow for a Residual Stage
of a Plain Block
• Double buffers for branch-flow
• Xilinx #pragma HLS dataflow
21
Layer
Unit
F.map Buffer
...
...
...
...
Layer
Unit
...
...
Shuffle
...
2D Convolutional Unit
22
...
...
AdderTree
BN Act
W.mem
...
...
...
...
c
n
p
c
n×p
Convolution Unit
Pooling Units
23
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x11 x10 x04 x03 x02 x01 x00
Write
Ctrl.
Logic
F. Map Mem. (n=5, k=2)
Shift Register
Max
Selector
+F. Map Mem.
Register
Reset
Write
Ctrl.
Logic
1
𝑛!
Controller
Max. Pooling
Unit
Global Ave.
Pooling
Unit
Experimental Results
24
Compression Ratio vs. Accuracy
25
162.2
124.6
82.1
53.5
34.9
11.5
59.61
66.64
70.8 71.1 71.2 71.2
50
55
60
65
70
75
0.0
50.0
100.0
150.0
200.0
q=1 q=2 q=3 q=4 q=5 Standard
speed-up acc
ImageNetTop-1Accuracy[%]
DataTransferSpeed-UpRatio
(Baseline:RGBImageTransfer)
JPEG Quantization Bit
• ImageNet2012 (224x224 pixel image) classification task
• PyTorch 1.4.0 + modified libjpeg library
Only decreases 0.3 point of accuracy and
achieves 82.1 times speed-up
Implementation Results
Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs
JPEG Decoder 11,675 6,646 34 2 0
Huffman Decoder 6,794 2,378 0 0 0
2D-IDCT 4,881 4,278 34 2 0
Pipelined-CNN 263,120 266,784 2,336 2,744 0
Total 274,795 273,440 2,370 2,746 16
(Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%)
26
• Xilinx Inc. Virtex UltraScale+ FPGA
VCU1525 acceleration development kit
• Xilinx Inc. SDAccel 2018.2
• Operates 300MHz@75Watt
• System performance: 3321.25 FPS
• JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS)
• JPEG decoder part of the LUT was only 4.2% of total system resource
Comparison with
Other FPGA Implementations
27
Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours
FPGA Stratix V Zynq
ZU3EG
Zynq
ZU3EG
Zynq ZU9EG Virtex US+
XCVU9P
Virtex US+
XCVU9P
FPS 864.7 200.0 96.5 809.8 123.1 3321.2
Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8%
Top-5 Acc. 66.80% --- 88.12% --- --- 90.1%
Precision
(W/Act)
16/16 1/2 4/4 8/8 16/16 1/8
Freq.(MHz) 150 220 250 333 214 300
Power (W) 26.2 10.2 5.5 --- 49.25 75.0
1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018.
2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks,” 2018.
3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy:
Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019.
4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th
International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143.
5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019,
pp.73–82.
Comparison with CPU and GPU
Platform CPU GPU FPGA
Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P
Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz
Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM
Throughput (FPS) 24.0 350.0 3321.25
Power (W) 95 295 75
Efficiency (FPS/W) 0.25 1.18 44.28
28
• Ubuntu 18.04 LTS with PyTorch 1.4.0
• 128 Batch with INT8 quantization (for CPU and GPU)
Note: CPU and GPU did not use our JPEG compression scheme
Conclusion
29
Conclusion
• Customized JPEG compression for a high-speed inference
• 82.1x speed-up, 0.3-point accuracy drop
• CNN model for a fully-pipelined implementation
• Channel shift and point-wise decomposition
• Binary weight quantization
• Channel split-shuffle operation
• Fully-pipelined CNN architecture
• Achieved 3,321 FPS@75W
• Speed-up: 138.4x CPU, 9.5x GPU
• Energy efficiency: 177.1x CPU, 37.5x GPU
• Future works
• Custom compression & Other DL applications
30
Thank you
Hiroki Nakahara (Tokyo Tech, JP)
nakahara@ict.e.titech.ac.jp
31
Ad

More Related Content

What's hot (20)

Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoCBook Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Derek Murray
 
Build your career in physical ASIC design
Build your career in physical ASIC designBuild your career in physical ASIC design
Build your career in physical ASIC design
Mohammed Essam Abd El Samee
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
Koan-Sin Tan
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
Saikiran Panjala
 
LDM_ImageSythesis.pptx
LDM_ImageSythesis.pptxLDM_ImageSythesis.pptx
LDM_ImageSythesis.pptx
AkankshaRawat53
 
Introduction to RISC-V
Introduction to RISC-VIntroduction to RISC-V
Introduction to RISC-V
inside-BigData.com
 
Verilog lab mauual
Verilog lab mauualVerilog lab mauual
Verilog lab mauual
BHUSHAN MHASKE
 
ASIC VS FPGA.ppt
ASIC VS FPGA.pptASIC VS FPGA.ppt
ASIC VS FPGA.ppt
gopakumar885691
 
"Attention Is All You Need" presented by Maroua Maachou (Veepee)
"Attention Is All You Need" presented by Maroua Maachou (Veepee)"Attention Is All You Need" presented by Maroua Maachou (Veepee)
"Attention Is All You Need" presented by Maroua Maachou (Veepee)
Paris Women in Machine Learning and Data Science
 
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Angel Yogi
 
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
iosrjce
 
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112GDesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
Leah Wilkinson
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
NVIDIA Japan
 
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Vitaly Bondar
 
Clock Distribution
Clock DistributionClock Distribution
Clock Distribution
Abhishek Tiwari
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD
 
Cadence P-cell tutorial
Cadence P-cell tutorial Cadence P-cell tutorial
Cadence P-cell tutorial
Michael Lee
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
inside-BigData.com
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V International
 
Object Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNetObject Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNet
IRJET Journal
 
Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoCBook Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Book Preview: A Practical Introduction to the Xilinx Zynq-7000 Adaptive SoC
Derek Murray
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
Koan-Sin Tan
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
Saikiran Panjala
 
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Design of High Performance 8,16,32-bit Vedic Multipliers using SCL PDK 180nm ...
Angel Yogi
 
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
VLSI Implementation of Vedic Multiplier Using Urdhva– Tiryakbhyam Sutra in VH...
iosrjce
 
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112GDesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
DesignCon 2019 112-Gbps Electrical Interfaces: An OIF Update on CEI-112G
Leah Wilkinson
 
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Vitaly Bondar
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD
 
Cadence P-cell tutorial
Cadence P-cell tutorial Cadence P-cell tutorial
Cadence P-cell tutorial
Michael Lee
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
inside-BigData.com
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V International
 
Object Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNetObject Detetcion using SSD-MobileNet
Object Detetcion using SSD-MobileNet
IRJET Journal
 

Similar to FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression (20)

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
Bikramjit Chowdhury
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
Mohsen Jafarzadeh
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
Larry Smarr
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
deawoo Kim
 
EIS_REVIEW_1.pptx
EIS_REVIEW_1.pptxEIS_REVIEW_1.pptx
EIS_REVIEW_1.pptx
01fe20bec143
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
Nexgen Technology
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
thanhdowork
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
TELKOMNIKA JOURNAL
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
Harshal Solao
 
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
thanhdowork
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
Ian Foster
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
Larry Smarr
 
Hardware for Deep Learning AI ML CNN.pdf
Hardware for Deep Learning AI ML CNN.pdfHardware for Deep Learning AI ML CNN.pdf
Hardware for Deep Learning AI ML CNN.pdf
AhmedSaeed115917
 
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
thanhdowork
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
RohanLekhwani
 
Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
Bikramjit Chowdhury
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
Mohsen Jafarzadeh
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
Larry Smarr
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
deawoo Kim
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
Nexgen Technology
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
gerogepatton
 
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
[20240930_LabSeminar_Huy]GinAR: An End-To-End Multivariate Time Series Foreca...
thanhdowork
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
TELKOMNIKA JOURNAL
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
Harshal Solao
 
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
[20240902_LabSeminar_Huy]Dynamic Semantic-Based Spatial Graph Convolution Net...
thanhdowork
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
Ian Foster
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
Larry Smarr
 
Hardware for Deep Learning AI ML CNN.pdf
Hardware for Deep Learning AI ML CNN.pdfHardware for Deep Learning AI ML CNN.pdf
Hardware for Deep Learning AI ML CNN.pdf
AhmedSaeed115917
 
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
thanhdowork
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
RohanLekhwani
 
Ad

More from Hiroki Nakahara (20)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
Hiroki Nakahara
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
Hiroki Nakahara
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
Hiroki Nakahara
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
Hiroki Nakahara
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
Hiroki Nakahara
 
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
Hiroki Nakahara
 
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
Hiroki Nakahara
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
Hiroki Nakahara
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
Hiroki Nakahara
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
Hiroki Nakahara
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Hiroki Nakahara
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
Hiroki Nakahara
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
Hiroki Nakahara
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
Hiroki Nakahara
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
Hiroki Nakahara
 
Naist2015 dec ver1
Naist2015 dec ver1Naist2015 dec ver1
Naist2015 dec ver1
Hiroki Nakahara
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Hiroki Nakahara
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
Hiroki Nakahara
 
ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
Hiroki Nakahara
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
Hiroki Nakahara
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
Hiroki Nakahara
 
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural NetworkISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
Hiroki Nakahara
 
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
Hiroki Nakahara
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
Hiroki Nakahara
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
Hiroki Nakahara
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
Hiroki Nakahara
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
Hiroki Nakahara
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
Hiroki Nakahara
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Hiroki Nakahara
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
Hiroki Nakahara
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
Hiroki Nakahara
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
Hiroki Nakahara
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
Hiroki Nakahara
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Hiroki Nakahara
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
Hiroki Nakahara
 
Ad

Recently uploaded (20)

Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

  • 1. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression Hiroki Nakahara Tokyo Institute of Technology, JP Zhiqiang Que Wayne Luk Imperial College London, UK
  • 2. Outline • Background • JPEG compression for a high-speed inference • CNN model for an FPGA implementation • Channel shift and point-wise decomposition • Quantization strategy • Channel shuffle • Fully-pipelined CNN architecture • Experimental results • Conclusion 2
  • 3. Convolutional Neural Networks (CNNs) • High accuracy and many applications • Image recognitions, NLPs, data mining [1] • FPGAs on cloud services • Amazon AWS, Microsoft Azure, etc. 3 [1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng, “UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery and data mining (KDD), 2019, pp.3132–3142.
  • 4. Problems • Power consumption • Performance bottleneck (Data-transfer) • e.g., AWS F1 provides overall read/write at 6.5GB/s from host CPU to FPGA [1] 4 Host PC Interconnect PCIe CNN Kernel .jpg RAW(RGB) Img. Accelerator Card [1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 5. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 5 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 9. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 9 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 10. Customized JPEG for a High-speed Inference 10
  • 12. Proposed JPEG Coding with a CNN Accelerator 12 Quant. Huffman Encoding Fully Pipelining CNN IDCT Reverse Quant. Huffman Decoding Host PC ImageStreamData JPEG Image .jpg Extreme Quant. Value q Quant. Table RAM PCIe Huffman Decoding & Reverse Quant. RAMRAM Detection Result FPGA Ping-pong Buffer Huffman Coding Table Huffman Coding Table
  • 13. Huffman Decoding and Reverse Quantization Unit 13 0 1 2 3 4 2 2 2 3 4 Shift Register Shift Value Quantized Value Quant. Value q Run-length Decoder 00** 01** 10** 110* 1110 Image Data Stream ... ... Priority Encoder Buffer RAM Zig-zag writing ADR WDATA Zig-zag pattern ROM
  • 14. • Decompose the 2D-IDCT with 16 1D-DCTs 14 2D-IDCT AP-922 Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMX Instructions,” https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
  • 15. 2D-IDCT Unit 15 .. Controller Operation Units Reg. 1D-IDCT Unit RAM RAM RAM • Two 1D-IDCT units • Use half precision (16 bits)
  • 16. CNN model for an FPGA Implementation 16
  • 17. Overview 1. Decomposing k×k convolution by channel shift [1] and point-wise (1×1) convolution 2. Binary (1-bit) weight quantization [2] 3. Channel split and shuffle [3] 17 [3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” CVPR, 2018. [1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez, and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,” CVPR, 2018, pp. 9127-9135. [2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
  • 18. Building Blocks 18 #channel x2 (a) Plain block (b) Down-sampling block Channel Split Shift PWConv Shift PWConv Concat & Shuffle Channel Split Shift PWConv (s=2) Shift PWConv Concat & Shuffle Shift PWConv (s=2) #channel/2
  • 19. Our CNN Model 19 Layer Output size Kernel size Stride #Output channel Image 224 3 PWConv 224 1 2 24 Norm 224 1 1 24 Shift 224 1 1 24 Pool 112 3 1 24 PWConv 112 2 2 24 Norm 112 1 1 24 ReLU 112 1 1 24 Shift 112 3 1 24 Pool 56 2 2 24 Stage 2 28 116 (4 repeats) Stage 3 14 232 (8 repeats) Stage 4 7 464 (16 repeats) GAP 1 7 1 464 PWConv 1 1 1 1000 • Training-aware quantization • w: binary, a: 8-bit • 2.54 M params, 0.616 GMACs
  • 21. Dataflow for a Residual Stage of a Plain Block • Double buffers for branch-flow • Xilinx #pragma HLS dataflow 21 Layer Unit F.map Buffer ... ... ... ... Layer Unit ... ... Shuffle ...
  • 22. 2D Convolutional Unit 22 ... ... AdderTree BN Act W.mem ... ... ... ... c n p c n×p Convolution Unit
  • 23. Pooling Units 23 x00 x01 x02 x03 x04 x10 x11 x12 x13 x14 x20 x21 x22 x23 x24 x30 x31 x32 x33 x34 x40 x41 x42 x43 x44 x11 x10 x04 x03 x02 x01 x00 Write Ctrl. Logic F. Map Mem. (n=5, k=2) Shift Register Max Selector +F. Map Mem. Register Reset Write Ctrl. Logic 1 𝑛! Controller Max. Pooling Unit Global Ave. Pooling Unit
  • 25. Compression Ratio vs. Accuracy 25 162.2 124.6 82.1 53.5 34.9 11.5 59.61 66.64 70.8 71.1 71.2 71.2 50 55 60 65 70 75 0.0 50.0 100.0 150.0 200.0 q=1 q=2 q=3 q=4 q=5 Standard speed-up acc ImageNetTop-1Accuracy[%] DataTransferSpeed-UpRatio (Baseline:RGBImageTransfer) JPEG Quantization Bit • ImageNet2012 (224x224 pixel image) classification task • PyTorch 1.4.0 + modified libjpeg library Only decreases 0.3 point of accuracy and achieves 82.1 times speed-up
  • 26. Implementation Results Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs JPEG Decoder 11,675 6,646 34 2 0 Huffman Decoder 6,794 2,378 0 0 0 2D-IDCT 4,881 4,278 34 2 0 Pipelined-CNN 263,120 266,784 2,336 2,744 0 Total 274,795 273,440 2,370 2,746 16 (Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%) 26 • Xilinx Inc. Virtex UltraScale+ FPGA VCU1525 acceleration development kit • Xilinx Inc. SDAccel 2018.2 • Operates 300MHz@75Watt • System performance: 3321.25 FPS • JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS) • JPEG decoder part of the LUT was only 4.2% of total system resource
  • 27. Comparison with Other FPGA Implementations 27 Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours FPGA Stratix V Zynq ZU3EG Zynq ZU3EG Zynq ZU9EG Virtex US+ XCVU9P Virtex US+ XCVU9P FPS 864.7 200.0 96.5 809.8 123.1 3321.2 Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8% Top-5 Acc. 66.80% --- 88.12% --- --- 90.1% Precision (W/Act) 16/16 1/2 4/4 8/8 16/16 1/8 Freq.(MHz) 150 220 250 333 214 300 Power (W) 26.2 10.2 5.5 --- 49.25 75.0 1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018. 2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” 2018. 3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019. 4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143. 5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 28. Comparison with CPU and GPU Platform CPU GPU FPGA Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM Throughput (FPS) 24.0 350.0 3321.25 Power (W) 95 295 75 Efficiency (FPS/W) 0.25 1.18 44.28 28 • Ubuntu 18.04 LTS with PyTorch 1.4.0 • 128 Batch with INT8 quantization (for CPU and GPU) Note: CPU and GPU did not use our JPEG compression scheme
  • 30. Conclusion • Customized JPEG compression for a high-speed inference • 82.1x speed-up, 0.3-point accuracy drop • CNN model for a fully-pipelined implementation • Channel shift and point-wise decomposition • Binary weight quantization • Channel split-shuffle operation • Fully-pipelined CNN architecture • Achieved 3,321 FPS@75W • Speed-up: 138.4x CPU, 9.5x GPU • Energy efficiency: 177.1x CPU, 37.5x GPU • Future works • Custom compression & Other DL applications 30