SlideShare a Scribd company logo
Quantizing Deep Networks for
Efficient Inference at the edge
Raghu Krishnamoorthi, Facebook
Questions/Feedback: raghuraman@fb.com
Acknowledgements
• Results presented here are from work done at Google as part of the
Tensorflow lite team and work at facebook as part of the pytorch
team.
• Acknowledge contributions from several colleagues at Google
including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko,
Suharsh Sivakumar and Pete Warden.
• Also acknowledge work from colleagues at facebook: Jongsoo Park,
Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao
Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang,
James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
Outline
• Motivation
• Quantization: Overview
• Quantizing deep networks
• Post Training quantization
• Quantization aware training
• Lower precision inference
• Hardware accelerator recommendations
• Model system co-design
• Looking ahead
Motivation(1)
• Data-center power consumption is doubling every year
Source: Deep Learning Inference in Facebook Data-Centers [1]
Motivation(2)
• Number of edge devices is
growing rapidly, lots of these
devices are resource
constrained.
Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
Motivation(3)
• While models are
becoming more efficient,
high accuracy still implies
high complexity
From: Benchmark Analysis of
Representative Deep Neural Network
Architectures, Simone Bianco et al,
Quantization
• Many approaches to solve the problems outlined here:
• Better hardware accelerators: TPUs => Requires new custom hardware
• Optimized kernels: Cudnn, Intel MKL-DNN
• Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires
new architectures
• A simpler approach that does not require re-design of models/new
hardware is quantization.
• Quantization refers to techniques to perform computation and storage at
reduced precision
• Works in combination with above approaches
• Requires optimized kernels to efficiently use existing hardware.
Background: Quantization(1)
• Quantization refers to mapping values from fp32 to a lower precision
format.
• Specified by
• Format
• Mapping type
• Granularity
From:https://commons.wikimedia.org/w/index.
php?curid=69415943
fp32
fp16
bfloat16
int8
int4
binary
Background:Quantization(2)
• We also consider different granularities of quantization:
• Per layer quantization
• Same mapping for all elements in a layer.
• Per row/ Per-channel quantization:
• Choose quantizer parameters independently for each row (fc layers) or for
each conv kernel (conv layers)
• Outlier aware quantization:
• Separate outliers to use lower precision arithmetic for bulk of weights.
• Dense computations for inliers with sparse computation for outliers that have
a large magnitude
Modeling quantization during
training
• Emulate quantization by quantizing and
de-quantizing in succession
• Values are still in floating point, but with
reduced precision
• 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥
= 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑
𝑥
𝑠
− 𝑧 + 𝑧
= 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 )
• Can also model quantization as a
stochastic rounding operation
Fake Quantizer (top), showing the
quantization of output values.
Approximation for purposes of
derivative calculation (bottom).
Quantization:Benefits
Benefits Quantization
Applicability Broad applicability across models and use cases
Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP
Software Support Kernel libraries widely available
Memory Size 4x reduction
Memory Bandwidth/Cache 4x reduction
Compute 2x to 4x speedup, depending on ISA
Power Typically 4x (dominated by memory access)
o Comparing float32 implementations with 8 bit inference
Quantization: Challenges
Challenges Notes Mitigation
Accuracy drop Loss in accuracy can be too high for certain
applications
Quantization aware training
Kernel Support Wide variety of operators+multiple
hardware platforms
Improving software tool-chain
(TVM) to handle varied
backends.
“Complexity” Non-trivial: Requires calibration/training in
some cases
Support in software packages:
TensorRT, Tensorflow and
Pytorch
Quantizing deep networks
Model Quantization: Overview
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Fake Quantization
Quantization
What to quantize?
• Only quantize parts of network that contribute significantly to performance
• Roofline analysis to identify compute vs memory bandwidth bound operations
• May need to further reduce based on accuracy impact.
• Multiple ways to quantize a network with different impact:
Quantization scheme Memory bandwidth
reduction (Weights)
Memory bandwidth
reduction (Activations)
Compute
Speedup
Notes
Weight only
quantization to int8
4x 1x 1x Suitable for
embedding lookups
Dynamic
quantization
4x 1x 2x Suitable for fc layers
with small batches
Static quantization
(int32 accumulators)
4x 4x 2x Suited for all layers,
important for
convolutions
Static quantization
(int16 accumulators)
4x 4x 4x Requires lower
precision
weights/activations
Post training quantization:
Weight compression
• Simplest quantization scheme is to compress the
weights to lower precision
• Requires no input data and can be done statically as part of
preparing a model for inference
• Hardware accelerators can benefit if de-compression is
done after memory access
• Trivial for case of fp16/int8 quantization of weights.
• K-means compression is also supported in select platforms
and is amenable to simple de-compression
• Scatter-Gather operation in select processors
• Supported in CoreML
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Dynamic quantization
• Dynamic quantization refers to schemes where the activations are
read/written in fp32 and are dynamically quantized to lower precisions for
compute.
• Requires no calibration data
• Data exchanged between operations is in floating point, so no need to
worry about format conversion.
• Provides performance improvements close to static quantization when
memory access is dominated by weights
• Suitable for inference in RNNs
• Lesser gains for conv layers
• Supported by:
• Pytorch
• Tensorflow Lite
Quantizing weights and activations
• Post training quantization refers to quantizing both weights and
activations to reduced precision, typically int8.
• Requires estimation of statistics of activations for determining
quantizer parameters.
• Quantizer parameters are determined by minimizing an error metric:
• KL Divergence: TensorRT
• Saturation error: Tensorflow Lite
Results
Setup
• Standard classification model architectures
• Evaluate classification on imagenet validation dataset.
• Results obtained using Tensorflow, more details are at:
https://arxiv.org/abs/1806.08342
• More results and support from pytorch for quantizations to be
announced at Pytorch Devcon on October 10th.
Post training quantization: Results
Network Asymmetric,
Per Layer
Symmetric,
Per Channel
Asymmetric,
Per Channel
Activation
only
quantized
Weight only,
Symmetric,
Per Channel
Floating point
Mobilenet-
v2-1-224
0.001 0.698 0.697 0.7 0.698 0.719
Mobilenet-
v1-1-224
0.001 0.591 0.703 0.708 0.591 0.709
Nasnet-
Mobile
0.72 0.72 0.74 0.74 0.72 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752
• 8 bits for weights and activations is sufficient for common CV classification tasks
• Smaller networks are “harder” to quantize
• At 8 bits, accuracy drop is dominated by weight quantization
Quantization aware training
Results
Network Asymmetric,
Per Layer, post
training quant
Symmetric, Per
Channel, post
training quant
Asymmetric,
Per Layer,
QAT
Symmetric, Per
channel QAT
Floating point
Mobilenet-v2-1-
224
0.001 0.698 0.709 0.711 0.719
Mobilenet-v1-1-
224
0.001 0.591 0.70 0.707 0.709
Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.752
• Quantization aware training provides the best accuracy and allows for simpler quantization
schemes.
Performance: Operator level benchmarks
Server: FBGEMM (quantized) vs MKL-DNN (fp32)
Performance: Model level benchmarks
(Mobile: Tensorflow Lite)
Mobile: Inference time: float vs quantized, TFLite, Pixel2
QNNPACK kernels provides an additional 2x speedup
Lower precision inference(1)
• Four bit precisions for weights
provides good accuracy, but
needs to be done selectively.
• Larger networks are more robust
to lower precision
• Quantization aware training is
critical
• Selectively quantizing layers of a
network to different precisions can
reduce the accuracy drop
4-bit weights, 8 bit activations: Top-1 accuracy results
Lower precision inference(2)
• Different layers of a
neural network
have different
sensitivity to
quantization errors
• Exciting work on
Differentiable
architecture search
[7] for determining
precision
allocations across
layers, showing
excellent
performance:
Architecture trade-offs(1)
• Clear tradeoff between number of parameters and robustness to
quantization
• One can also tradeoff number of feature maps vs precision
• Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the
base network.
Architecture tradeoffs (2)
• Restricting ranges of activations apriori can hurt accuracy
• Preferable to learn activation ranges instead of fixing them beforehand.
Co-design: Training quantized models
• Designing models that provide good quantization performance requires co-
design of model architecture, training algorithms and hardware.
• Specific training enhancements include:
• Fine tune from floating point models for building quantized models.
• Freeze batch normalization statistics update to exactly model inference for further
benefits.
• Model exact rounding arithmetic done in hardware during training
• Stochastic quantization provides models robust to random perturbations of weights, but
underperforms techniques that model quantization as done at inference.
• Other enhancements to improve accuracy:
• Use distillation to train quantized student from floating point teacher network [3]
Conclusions
Hardware accelerator recommendations:
Basics
• Optimize memory bandwidth
• First order predictor of power consumption
• Don’t ignore activations: Most literature focusses on weights, activations can be very
significant for large resolution inputs.
• Fuse multiple operations
• Have floating point support as a backup
• Avoid switching compute to different hardware
• Optimize for GEMM
• Still the workhorse for most DNN applications
• Support low precision inference
• 8 is required, but supporting lower precision can provide really high throughput.
Hardware accelerator recommendations:
Software
• Don’t forget the software toolchain!
• Need to make it easy for customers to use hardware
• Integration with Tensorflow/Pytorch is important
• Writing optimized kernels for new hardware is hard
• Most implementations optimize for a specific set of models, with poor
performance for kernels needed for other models.
Hardware accelerator recommendations:
Differentiation
• Build a strategy for operator support
• Take a close look at TVM/MLIR efforts.
• Code generation along with hand written kernels
• To get the best out of hardware w.quantization
• Provide exact models of HW kernels for integration with training frameworks
• Consider other techniques beyond quantization
• Sparsity
• K-means compression
• Dynamic/Adaptive execution
• Don’t forget privacy
• Secure aggregation/Homo-morphic encryption becoming increasingly important
• Training at the edge:
• Depending on applications, this can be very important for privacy/personalization
References
1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers:
Characterization, Performance Optimizations and Hardware Implications”
2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network
Architectures
3. A. Polino et al, “Model compression via distillation and quantization”
4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference”
5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with
binary weights during propagations”
6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper”
7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural
architecture search”
Ad

More Related Content

What's hot (20)

Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
Spotle.ai
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
Md Tajul Islam
 
ONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep Learning
Hagay Lupesko
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
Taeoh Kim
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
Edge AI and Vision Alliance
 
R-CNN
R-CNNR-CNN
R-CNN
Mohamed Rashid
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
Spotle.ai
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
Sangmin Woo
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 
ONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep LearningONNX - The Lingua Franca of Deep Learning
ONNX - The Lingua Franca of Deep Learning
Hagay Lupesko
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
Taeoh Kim
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Ryo Takahashi
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
Edge AI and Vision Alliance
 

Similar to "Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
Sabidur Rahman
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
MarcinJedyk
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
Marcin Jedyk
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
James McCombs
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
Vertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part IIVertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part II
Vertex Holdings
 
Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
InVID Project
 
High Efficiency Video Codec
High Efficiency Video CodecHigh Efficiency Video Codec
High Efficiency Video Codec
Tejus Adiga M
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
Subhashis Hazarika
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
Sabidur Rahman
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
MarcinJedyk
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
Marcin Jedyk
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
James McCombs
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
ananth
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
taeseon ryu
 
Vertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part IIVertex Perspectives | AI Optimized Chipsets | Part II
Vertex Perspectives | AI Optimized Chipsets | Part II
Vertex Holdings
 
Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101Neural Networks from Scratch - TensorFlow 101
Neural Networks from Scratch - TensorFlow 101
Gerold Bausch
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
InVID Project
 
High Efficiency Video Codec
High Efficiency Video CodecHigh Efficiency Video Codec
High Efficiency Video Codec
Tejus Adiga M
 
Ad

More from Edge AI and Vision Alliance (20)

“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
Edge AI and Vision Alliance
 
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
Edge AI and Vision Alliance
 
“Multi-object Tracking Systems,” a Presentation from Tryolabs
“Multi-object Tracking Systems,” a Presentation from Tryolabs“Multi-object Tracking Systems,” a Presentation from Tryolabs
“Multi-object Tracking Systems,” a Presentation from Tryolabs
Edge AI and Vision Alliance
 
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
Edge AI and Vision Alliance
 
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
Edge AI and Vision Alliance
 
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
Edge AI and Vision Alliance
 
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
Edge AI and Vision Alliance
 
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
Edge AI and Vision Alliance
 
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
Edge AI and Vision Alliance
 
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
Edge AI and Vision Alliance
 
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
Edge AI and Vision Alliance
 
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
Edge AI and Vision Alliance
 
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
Edge AI and Vision Alliance
 
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
Edge AI and Vision Alliance
 
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
Edge AI and Vision Alliance
 
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
Edge AI and Vision Alliance
 
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
“Implementing AI/Computer Vision for Corporate Security Surveillance,” a Pres...
Edge AI and Vision Alliance
 
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
“Continual Learning thru Sequential, Lightweight Optimization,” a Presentatio...
Edge AI and Vision Alliance
 
“Multi-object Tracking Systems,” a Presentation from Tryolabs
“Multi-object Tracking Systems,” a Presentation from Tryolabs“Multi-object Tracking Systems,” a Presentation from Tryolabs
“Multi-object Tracking Systems,” a Presentation from Tryolabs
Edge AI and Vision Alliance
 
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
“Improved Navigation Assistance for the Blind via Real-time Edge AI,” a Prese...
Edge AI and Vision Alliance
 
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
“Using Vision Systems, Generative Models and Reinforcement Learning for Sport...
Edge AI and Vision Alliance
 
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
“Introduction to Cameras for Embedded Applications,” a Presentation from Sens...
Edge AI and Vision Alliance
 
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
“Introduction to Modern Radar for Machine Perception,” a Presentation from Se...
Edge AI and Vision Alliance
 
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
“Diagnosing Problems and Implementing Solutions for Deep Neural Network Train...
Edge AI and Vision Alliance
 
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
“Seeing Through Machines: A Guide to Image Sensors for Edge AI Applications,”...
Edge AI and Vision Alliance
 
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
Edge AI and Vision Alliance
 
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
“Removing Weather-related Image Degradation at the Edge,” a Presentation from...
Edge AI and Vision Alliance
 
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
“Seeing the Invisible: Unveiling Hidden Details through Advanced Image Acquis...
Edge AI and Vision Alliance
 
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
“Data-efficient and Generalizable: The Domain-specific Small Vision Model Rev...
Edge AI and Vision Alliance
 
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
“Omnilert Gun Detect: Harnessing Computer Vision to Tackle Gun Violence,” a P...
Edge AI and Vision Alliance
 
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
“Adventures in Moving a Computer Vision Solution from Cloud to Edge,” a Prese...
Edge AI and Vision Alliance
 
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
“Bridging Vision and Language: Designing, Training and Deploying Multimodal L...
Edge AI and Vision Alliance
 
Ad

Recently uploaded (20)

Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

  • 1. Quantizing Deep Networks for Efficient Inference at the edge Raghu Krishnamoorthi, Facebook Questions/Feedback: [email protected]
  • 2. Acknowledgements • Results presented here are from work done at Google as part of the Tensorflow lite team and work at facebook as part of the pytorch team. • Acknowledge contributions from several colleagues at Google including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko, Suharsh Sivakumar and Pete Warden. • Also acknowledge work from colleagues at facebook: Jongsoo Park, Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang, James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
  • 3. Outline • Motivation • Quantization: Overview • Quantizing deep networks • Post Training quantization • Quantization aware training • Lower precision inference • Hardware accelerator recommendations • Model system co-design • Looking ahead
  • 4. Motivation(1) • Data-center power consumption is doubling every year Source: Deep Learning Inference in Facebook Data-Centers [1]
  • 5. Motivation(2) • Number of edge devices is growing rapidly, lots of these devices are resource constrained. Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
  • 6. Motivation(3) • While models are becoming more efficient, high accuracy still implies high complexity From: Benchmark Analysis of Representative Deep Neural Network Architectures, Simone Bianco et al,
  • 7. Quantization • Many approaches to solve the problems outlined here: • Better hardware accelerators: TPUs => Requires new custom hardware • Optimized kernels: Cudnn, Intel MKL-DNN • Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires new architectures • A simpler approach that does not require re-design of models/new hardware is quantization. • Quantization refers to techniques to perform computation and storage at reduced precision • Works in combination with above approaches • Requires optimized kernels to efficiently use existing hardware.
  • 8. Background: Quantization(1) • Quantization refers to mapping values from fp32 to a lower precision format. • Specified by • Format • Mapping type • Granularity From:https://commons.wikimedia.org/w/index. php?curid=69415943 fp32 fp16 bfloat16 int8 int4 binary
  • 9. Background:Quantization(2) • We also consider different granularities of quantization: • Per layer quantization • Same mapping for all elements in a layer. • Per row/ Per-channel quantization: • Choose quantizer parameters independently for each row (fc layers) or for each conv kernel (conv layers) • Outlier aware quantization: • Separate outliers to use lower precision arithmetic for bulk of weights. • Dense computations for inliers with sparse computation for outliers that have a large magnitude
  • 10. Modeling quantization during training • Emulate quantization by quantizing and de-quantizing in succession • Values are still in floating point, but with reduced precision • 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥 = 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑 𝑥 𝑠 − 𝑧 + 𝑧 = 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 ) • Can also model quantization as a stochastic rounding operation Fake Quantizer (top), showing the quantization of output values. Approximation for purposes of derivative calculation (bottom).
  • 11. Quantization:Benefits Benefits Quantization Applicability Broad applicability across models and use cases Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP Software Support Kernel libraries widely available Memory Size 4x reduction Memory Bandwidth/Cache 4x reduction Compute 2x to 4x speedup, depending on ISA Power Typically 4x (dominated by memory access) o Comparing float32 implementations with 8 bit inference
  • 12. Quantization: Challenges Challenges Notes Mitigation Accuracy drop Loss in accuracy can be too high for certain applications Quantization aware training Kernel Support Wide variety of operators+multiple hardware platforms Improving software tool-chain (TVM) to handle varied backends. “Complexity” Non-trivial: Requires calibration/training in some cases Support in software packages: TensorRT, Tensorflow and Pytorch
  • 14. Model Quantization: Overview Train Convert for inference Graph Optimization Kernel Implementation Train Convert for inference Graph Optimization Kernel Implementation Quantization Train Convert for inference Graph Optimization Kernel Implementation Quantization Fake Quantization Quantization
  • 15. What to quantize? • Only quantize parts of network that contribute significantly to performance • Roofline analysis to identify compute vs memory bandwidth bound operations • May need to further reduce based on accuracy impact. • Multiple ways to quantize a network with different impact: Quantization scheme Memory bandwidth reduction (Weights) Memory bandwidth reduction (Activations) Compute Speedup Notes Weight only quantization to int8 4x 1x 1x Suitable for embedding lookups Dynamic quantization 4x 1x 2x Suitable for fc layers with small batches Static quantization (int32 accumulators) 4x 4x 2x Suited for all layers, important for convolutions Static quantization (int16 accumulators) 4x 4x 4x Requires lower precision weights/activations
  • 16. Post training quantization: Weight compression • Simplest quantization scheme is to compress the weights to lower precision • Requires no input data and can be done statically as part of preparing a model for inference • Hardware accelerators can benefit if de-compression is done after memory access • Trivial for case of fp16/int8 quantization of weights. • K-means compression is also supported in select platforms and is amenable to simple de-compression • Scatter-Gather operation in select processors • Supported in CoreML Train Convert for inference Graph Optimization Kernel Implementation Quantization
  • 17. Dynamic quantization • Dynamic quantization refers to schemes where the activations are read/written in fp32 and are dynamically quantized to lower precisions for compute. • Requires no calibration data • Data exchanged between operations is in floating point, so no need to worry about format conversion. • Provides performance improvements close to static quantization when memory access is dominated by weights • Suitable for inference in RNNs • Lesser gains for conv layers • Supported by: • Pytorch • Tensorflow Lite
  • 18. Quantizing weights and activations • Post training quantization refers to quantizing both weights and activations to reduced precision, typically int8. • Requires estimation of statistics of activations for determining quantizer parameters. • Quantizer parameters are determined by minimizing an error metric: • KL Divergence: TensorRT • Saturation error: Tensorflow Lite
  • 20. Setup • Standard classification model architectures • Evaluate classification on imagenet validation dataset. • Results obtained using Tensorflow, more details are at: https://arxiv.org/abs/1806.08342 • More results and support from pytorch for quantizations to be announced at Pytorch Devcon on October 10th.
  • 21. Post training quantization: Results Network Asymmetric, Per Layer Symmetric, Per Channel Asymmetric, Per Channel Activation only quantized Weight only, Symmetric, Per Channel Floating point Mobilenet- v2-1-224 0.001 0.698 0.697 0.7 0.698 0.719 Mobilenet- v1-1-224 0.001 0.591 0.703 0.708 0.591 0.709 Nasnet- Mobile 0.72 0.72 0.74 0.74 0.72 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752 • 8 bits for weights and activations is sufficient for common CV classification tasks • Smaller networks are “harder” to quantize • At 8 bits, accuracy drop is dominated by weight quantization
  • 23. Results Network Asymmetric, Per Layer, post training quant Symmetric, Per Channel, post training quant Asymmetric, Per Layer, QAT Symmetric, Per channel QAT Floating point Mobilenet-v2-1- 224 0.001 0.698 0.709 0.711 0.719 Mobilenet-v1-1- 224 0.001 0.591 0.70 0.707 0.709 Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.752 • Quantization aware training provides the best accuracy and allows for simpler quantization schemes.
  • 24. Performance: Operator level benchmarks Server: FBGEMM (quantized) vs MKL-DNN (fp32)
  • 25. Performance: Model level benchmarks (Mobile: Tensorflow Lite) Mobile: Inference time: float vs quantized, TFLite, Pixel2 QNNPACK kernels provides an additional 2x speedup
  • 26. Lower precision inference(1) • Four bit precisions for weights provides good accuracy, but needs to be done selectively. • Larger networks are more robust to lower precision • Quantization aware training is critical • Selectively quantizing layers of a network to different precisions can reduce the accuracy drop 4-bit weights, 8 bit activations: Top-1 accuracy results
  • 27. Lower precision inference(2) • Different layers of a neural network have different sensitivity to quantization errors • Exciting work on Differentiable architecture search [7] for determining precision allocations across layers, showing excellent performance:
  • 28. Architecture trade-offs(1) • Clear tradeoff between number of parameters and robustness to quantization • One can also tradeoff number of feature maps vs precision • Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the base network.
  • 29. Architecture tradeoffs (2) • Restricting ranges of activations apriori can hurt accuracy • Preferable to learn activation ranges instead of fixing them beforehand.
  • 30. Co-design: Training quantized models • Designing models that provide good quantization performance requires co- design of model architecture, training algorithms and hardware. • Specific training enhancements include: • Fine tune from floating point models for building quantized models. • Freeze batch normalization statistics update to exactly model inference for further benefits. • Model exact rounding arithmetic done in hardware during training • Stochastic quantization provides models robust to random perturbations of weights, but underperforms techniques that model quantization as done at inference. • Other enhancements to improve accuracy: • Use distillation to train quantized student from floating point teacher network [3]
  • 32. Hardware accelerator recommendations: Basics • Optimize memory bandwidth • First order predictor of power consumption • Don’t ignore activations: Most literature focusses on weights, activations can be very significant for large resolution inputs. • Fuse multiple operations • Have floating point support as a backup • Avoid switching compute to different hardware • Optimize for GEMM • Still the workhorse for most DNN applications • Support low precision inference • 8 is required, but supporting lower precision can provide really high throughput.
  • 33. Hardware accelerator recommendations: Software • Don’t forget the software toolchain! • Need to make it easy for customers to use hardware • Integration with Tensorflow/Pytorch is important • Writing optimized kernels for new hardware is hard • Most implementations optimize for a specific set of models, with poor performance for kernels needed for other models.
  • 34. Hardware accelerator recommendations: Differentiation • Build a strategy for operator support • Take a close look at TVM/MLIR efforts. • Code generation along with hand written kernels • To get the best out of hardware w.quantization • Provide exact models of HW kernels for integration with training frameworks • Consider other techniques beyond quantization • Sparsity • K-means compression • Dynamic/Adaptive execution • Don’t forget privacy • Secure aggregation/Homo-morphic encryption becoming increasingly important • Training at the edge: • Depending on applications, this can be very important for privacy/personalization
  • 35. References 1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications” 2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network Architectures 3. A. Polino et al, “Model compression via distillation and quantization” 4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” 5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with binary weights during propagations” 6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper” 7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural architecture search”