SlideShare a Scribd company logo
INTEGER QUANTIZATION FOR DEEP
LEARNING INFERENCE: PRINCIPLES
AND EMPIRICAL EVALUATION
NVIDIA, Georgia Tech.
Arxiv 2020
Presenter: Jemin Lee
https://leejaymin.github.io/index.html
7 Oct. 2020
Neural Acceleration Study Season #3
Contents
• Introduction and Background
• Quantization Fundamentals
• Post-Training Quantization
• Techniques to Recover Accuracy
• Recommended Workflow
• Conclusion
Neural Acceleration Study #2
Introduction
• Low-precision formats offer several performance benefits
• First, many processors provide higher throughput math pipelines the low-bit
formats, which can speed up math-intensive operations, such as convolutions and
matrix multiplications.
• Second, smaller word sizes reduce memory bandwidth pressure, improving
performance for bandwidth-limited computations.
• Third, smaller word sizes lead to lower memory size requirements, which can
improve cache utilization as well as other aspects of memory-system operation.
Neural Acceleration Study #3
Introduction
• Review the mathematical fundamentals underlying various
integer quantization choices (Section 3) as well as techniques
for recovering accuracy lost due to quantization (Section 5).
Section 6 combines this information into a recommended
workflow.
Neural Acceleration Study #3
Related Work
• Uniform [7,59] quantization enables the use of integer or fixed-
point math pipelines, allowing computation to be performed in the
quantized domain.
• Non-uniform[13,60,34,2] quantization requires dequantization, e.g.
a codebook lookup, before doing computation in higher precision,
limiting its benefits to model compression and bandwidth reduction.
• This paper focuses on leveraging quantization to accelerate
computation, so we will restrict our focus to uniform quantization
schemes.
• In this paper, we present an evaluation of int8 quantization on all of
the major network architectures with both PTQ and QAT.
Neural Acceleration Study #3
Quantization Fundamentals
• Uniform quantization steps
• 1) Choose the range of the real numbers to be quantized, clamping the
values outside this range.
• 2) Map the real values to integers representable by the bit-width of the
quantized representation (round each mapped real value to the closest
integer value).
• Quantize: convert a real number to a quantized integer
representation (e.g. from fp32 to int8).
• Dequantize: convert a number from quantized integer
representation to a real number (e.g. from int32 to fp16).
Neural Acceleration Study #3
Quantization Fundamentals: Affine Quantization
(Asymmetric)
Neural Acceleration Study #3
• 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛
• S is scale factor and z is zero-point
• Int8
• S=
𝟐𝟓𝟓
𝜷$𝜶
, z=-round(𝜷 " 𝒔)-128
Quantization Fundamentals: Scale Quantization
(Symmetric)
Neural Acceleration Study #3
• integer range [−127, 127]
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• At the coarsest, per-tensor granularity, the same quantization
parameters are shared by all elements in the tensor.
• The finest granularity would have individual quantization
parameters per element.
• tensor - per row or per column for 2D matrices,
• per channel for 3D (image-like) tensors.
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• Integer matrix multiplication
• per-row or per-tensor for activations
• per-column or per-tensor for weights
• The equation is as below:
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• For activations, only per-tensor quantization is practical for
performance reasons.
• The number of rows depends on batch instances.
• Row count can vary at inference time.
• It prevents the per-row scaling factor from being computation
offline.
• (which would not be meaningful for different instances in a mini-
batch), whereas determining them online imposes a compute
overhead and in some cases results in poor accuracy (Dynamic
Quantization discussion in [58]).
Quantization Fundamentals: Tensor Quantization
Granularity
Neural Acceleration Study #3
• The corresponding granularity to per-column in convolutions is
per-kernel, or equivalently per-output-channel since each
kernel produces a separate output channel [27, 29].
• This is commonly referred to as “per-channel” weight
quantization in literature and we follow that convention [21,
25, 26, 38, 46]. We examine the granularity impact on accuracy
in Section 4.1.
Quantization Fundamentals: Computational Cost of
Affine Quantization
Neural Acceleration Study #3
• Breakdown three terms:
• the integer dot product, just as in scale quantization (Equation 10).
• only integer weights and zero-points can be computed offline, only adding an
element-wise addition at inference time.
• If the layer has a bias then this term can be folded in without increasing inference cost.
• involves the quantized input matrix Xq, and thus cannot be computed
offline.
• overhead, reducing or even eliminating the throughput advantage that integer math
pipelines have over reduced precision floating-point.
• extra computation is incurred only if affine quantization is used for the
weight matrix.
• Scale quantization for weights.
• Affine quantization could be used for activations
Integer quantization for deep learning inference: principles and empirical evaluation
Quantization Fundamentals: Calibration
Neural Acceleration Study #3
• Max: Use the maximum absolute value seen during calibration [52].
• Entropy: Use KL divergence to minimize information loss between the original floating-point
values and values that could be represented by the quantized format. This is the default method
used by TensorRT [36].
• Percentile: Set the range to a percentile of the distribution of absolute values seen during
calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values.
• Clipping the distribution trades off a large clipping error on a few outlier values for smaller
rounding errors on a majority of the values.
Post Training Quantization
Neural Acceleration Study #3
• Quantization parameters are calibrated offline by processing the trained model weights and
activations generated by running inference on a sample dataset, no further training is involved.
Post Training Quantization
Neural Acceleration Study #3
• refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
Post Training Quantization: Weight Quantization
Neural Acceleration Study #3
• Int8 weights: max calibration
• However, as BN parameters are learned per channel, their folding can result in significantly
different weight value distributions across channels.
• Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max
calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the
experiments in this paper use per-channel max calibration for weights.
Post Training Quantization: Activation Quantization
Neural Acceleration Study #3
• For most of the networks, there is at least one activation calibration method that achieves
acceptable accuracy,
• except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%.
• no single calibration is best for all networks.
Techniques to Recover Accuracy
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
• One also has an option to train networks with quantization
• there are also approaches that jointly learn the model weights and quantization parameters.
Techniques to Recover Accuracy: Partial Quantization
Neural Acceleration Study #3
• Partial quantization: leaves the most sensitive layers unquantized
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Intuition for QAT
• the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By
training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the
quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing
the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower
loss, and thus higher accuracy.
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Fake Quantization
• FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s
• Problem: The quantization function is non-differentiable
• Solution: Straight Through Estimation (STE)
Techniques to Recover Accuracy: Quantization Aware
Training
Neural Acceleration Study #3
• Result
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Apply the same fine-tuning schedule as before.
• but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping
them fixed throughout fine-tuning.
• Use PACT[1]
• Fixed best: best accuracy with PTQ as described in Table 5.
• Use max calibration for activation quantization
• Learning the range results in higher accuracy than keeping it fixed for most networks.
• In best calibration cases, learning the ranges generate very similar results.
• Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated.
• PACT could yield better results with longer fine-tuning or a separate process.
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Techniques to Recover Accuracy: Learning Quantization
Parameters
Neural Acceleration Study #3
• Complex fine-tuning procedure
[1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
Recommended Workflow (Recap)
• Weights:
• Per-column (FC), per-channel (conv.)
• Symmetric (high performance)
• Max calibration (static)
• Activations:
• Per tensor, symmetric
• PTQ
• Activation calibration could select the method
among max, entropy, 99.99%, and 99.999%.
• If none, do Partial quant, or QAT
• Partial Quantization:
• Perform sensitivity analysis to identify the most sensitive layers
and leave them in floating-point.
• If none, do QAT.
• QAT:
• Start from the best calibrated quantized model.
• Use QAT to fine-tune for around 10% of the
original training schedule
with an annealing learning rate schedule starting at
1% of the initial training learning rate.
• For details, refer to Appendix A.2.
Neural Acceleration Study #3
Conclusion
• Review the mathematical background for integer quantization of neural networks, as
well as some performance-related reasons for choosing quantization parameters.
• Empirically evaluate various choices for int8 quantization of a variety of models, leading
to a quantization workflow proposal.
• MobileNets and BERT are challengeable for quantization.
• Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning
techniques.
• Some more complex techniques, such as ADMM and distillation, were not required for
int8 quantization of these models.
• However, these techniques should be evaluated when quantizing to even lower-bit
integer representations, which we leave to future work.
Neural Acceleration Study #3
Thank you
Neural Acceleration Study #2
Ad

More Related Content

What's hot (20)

Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
LeNet-5
LeNet-5LeNet-5
LeNet-5
佳蓉 倪
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
Vajiheh Zoghiyan
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Pradnya Saval
 
Design and Testing Challenges for Chiplet Based Design: Assembly and Test View
Design and Testing Challenges for Chiplet Based Design: Assembly and Test ViewDesign and Testing Challenges for Chiplet Based Design: Assembly and Test View
Design and Testing Challenges for Chiplet Based Design: Assembly and Test View
ODSA Workgroup
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
Bill Liu
 
Introduction of Faster R-CNN
Introduction of Faster R-CNNIntroduction of Faster R-CNN
Introduction of Faster R-CNN
Simossyi Funabashi
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Densenet CNN
Densenet CNNDensenet CNN
Densenet CNN
ArunKumar7374
 
Google net
Google netGoogle net
Google net
Brian Kim
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
홍배 김
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
Edge AI and Vision Alliance
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
남주 김
 
Slideshare - PCIe
Slideshare - PCIeSlideshare - PCIe
Slideshare - PCIe
Jin Wu
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
Akash Goel
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
inside-BigData.com
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Pradnya Saval
 
Design and Testing Challenges for Chiplet Based Design: Assembly and Test View
Design and Testing Challenges for Chiplet Based Design: Assembly and Test ViewDesign and Testing Challenges for Chiplet Based Design: Assembly and Test View
Design and Testing Challenges for Chiplet Based Design: Assembly and Test View
ODSA Workgroup
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
Bill Liu
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
홍배 김
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
Edge AI and Vision Alliance
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
남주 김
 
Slideshare - PCIe
Slideshare - PCIeSlideshare - PCIe
Slideshare - PCIe
Jin Wu
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
Akash Goel
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
inside-BigData.com
 

Similar to Integer quantization for deep learning inference: principles and empirical evaluation (20)

Programming and Management of Research_16-JUL-2021.pptx
Programming and Management of Research_16-JUL-2021.pptxProgramming and Management of Research_16-JUL-2021.pptx
Programming and Management of Research_16-JUL-2021.pptx
AntnioSobral10
 
TensorFlow.pptx
TensorFlow.pptxTensorFlow.pptx
TensorFlow.pptx
Jayesh Patil
 
Cardiac arrhythmia detection using artificial intelligence
Cardiac arrhythmia detection using artificial intelligenceCardiac arrhythmia detection using artificial intelligence
Cardiac arrhythmia detection using artificial intelligence
Kshitij Goyal
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
April Knyff
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Deep learning
Deep learningDeep learning
Deep learning
Rouyun Pan
 
“DNN Quantization: Theory to Practice,” a Presentation from AMD
“DNN Quantization: Theory to Practice,” a Presentation from AMD“DNN Quantization: Theory to Practice,” a Presentation from AMD
“DNN Quantization: Theory to Practice,” a Presentation from AMD
Edge AI and Vision Alliance
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
Back Propagation-11-11-2qwasdddddd024.pptx
Back Propagation-11-11-2qwasdddddd024.pptxBack Propagation-11-11-2qwasdddddd024.pptx
Back Propagation-11-11-2qwasdddddd024.pptx
vinodkumarthatipamul
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
Lacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural NetworksLacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural Networks
valerimmladenov
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Backpropagation and computational graph.pptx
Backpropagation and computational graph.pptxBackpropagation and computational graph.pptx
Backpropagation and computational graph.pptx
tintu47
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
Edge AI and Vision Alliance
 
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
FQ-ViT: Post-Training Quantization for Fully Quantized Vision TransformerFQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Sungchul Kim
 
Programming and Management of Research_16-JUL-2021.pptx
Programming and Management of Research_16-JUL-2021.pptxProgramming and Management of Research_16-JUL-2021.pptx
Programming and Management of Research_16-JUL-2021.pptx
AntnioSobral10
 
Cardiac arrhythmia detection using artificial intelligence
Cardiac arrhythmia detection using artificial intelligenceCardiac arrhythmia detection using artificial intelligence
Cardiac arrhythmia detection using artificial intelligence
Kshitij Goyal
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
April Knyff
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
“DNN Quantization: Theory to Practice,” a Presentation from AMD
“DNN Quantization: Theory to Practice,” a Presentation from AMD“DNN Quantization: Theory to Practice,” a Presentation from AMD
“DNN Quantization: Theory to Practice,” a Presentation from AMD
Edge AI and Vision Alliance
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Edge AI and Vision Alliance
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
FaridAliMousa1
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi
 
Back Propagation-11-11-2qwasdddddd024.pptx
Back Propagation-11-11-2qwasdddddd024.pptxBack Propagation-11-11-2qwasdddddd024.pptx
Back Propagation-11-11-2qwasdddddd024.pptx
vinodkumarthatipamul
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
Lacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural NetworksLacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural Networks
valerimmladenov
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Backpropagation and computational graph.pptx
Backpropagation and computational graph.pptxBackpropagation and computational graph.pptx
Backpropagation and computational graph.pptx
tintu47
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
Edge AI and Vision Alliance
 
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
FQ-ViT: Post-Training Quantization for Fully Quantized Vision TransformerFQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Sungchul Kim
 
Ad

More from jemin lee (7)

MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
jemin lee
 
HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
jemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
jemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
jemin lee
 
HAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network QuantizationHAWQ-V3: Dyadic Neural Network Quantization
HAWQ-V3: Dyadic Neural Network Quantization
jemin lee
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
jemin lee
 
MLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performanceMLPerf an industry standard benchmark suite for machine learning performance
MLPerf an industry standard benchmark suite for machine learning performance
jemin lee
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
jemin lee
 
Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage Versatile tensor accelerator (vta) introduction and usage
Versatile tensor accelerator (vta) introduction and usage
jemin lee
 
Jetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usageJetson agx xavier and nvdla introduction and usage
Jetson agx xavier and nvdla introduction and usage
jemin lee
 
Ad

Recently uploaded (20)

Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdfTop Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
evrigsolution
 
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdfProtect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
株式会社クライム
 
Adobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 linkAdobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 link
mahmadzubair09
 
Why Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card ProvidersWhy Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card Providers
Tapitag
 
How I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetryHow I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetry
Cees Bos
 
Digital Twins Software Service in Belfast
Digital Twins Software Service in BelfastDigital Twins Software Service in Belfast
Digital Twins Software Service in Belfast
julia smits
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025
Web Designer
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Gojek Clone App for Multi-Service Business
Gojek Clone App for Multi-Service BusinessGojek Clone App for Multi-Service Business
Gojek Clone App for Multi-Service Business
XongoLab Technologies LLP
 
Orion Context Broker introduction 20250509
Orion Context Broker introduction 20250509Orion Context Broker introduction 20250509
Orion Context Broker introduction 20250509
Fermin Galan
 
Time Estimation: Expert Tips & Proven Project Techniques
Time Estimation: Expert Tips & Proven Project TechniquesTime Estimation: Expert Tips & Proven Project Techniques
Time Estimation: Expert Tips & Proven Project Techniques
Livetecs LLC
 
Download MathType Crack Version 2025???
Download MathType Crack  Version 2025???Download MathType Crack  Version 2025???
Download MathType Crack Version 2025???
Google
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEMGDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
philipnathen82
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdfTop Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
Top Magento Hyvä Theme Features That Make It Ideal for E-commerce.pdf
evrigsolution
 
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdfProtect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
Protect HPE VM Essentials using Veeam Agents-a50012338enw.pdf
株式会社クライム
 
Adobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 linkAdobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 link
mahmadzubair09
 
Why Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card ProvidersWhy Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card Providers
Tapitag
 
How I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetryHow I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetry
Cees Bos
 
Digital Twins Software Service in Belfast
Digital Twins Software Service in BelfastDigital Twins Software Service in Belfast
Digital Twins Software Service in Belfast
julia smits
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.pptPassive House Canada Conference 2025 Presentation [Final]_v4.ppt
Passive House Canada Conference 2025 Presentation [Final]_v4.ppt
IES VE
 
Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025Wilcom Embroidery Studio Crack Free Latest 2025
Wilcom Embroidery Studio Crack Free Latest 2025
Web Designer
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Orion Context Broker introduction 20250509
Orion Context Broker introduction 20250509Orion Context Broker introduction 20250509
Orion Context Broker introduction 20250509
Fermin Galan
 
Time Estimation: Expert Tips & Proven Project Techniques
Time Estimation: Expert Tips & Proven Project TechniquesTime Estimation: Expert Tips & Proven Project Techniques
Time Estimation: Expert Tips & Proven Project Techniques
Livetecs LLC
 
Download MathType Crack Version 2025???
Download MathType Crack  Version 2025???Download MathType Crack  Version 2025???
Download MathType Crack Version 2025???
Google
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEMGDS SYSTEM | GLOBAL  DISTRIBUTION SYSTEM
GDS SYSTEM | GLOBAL DISTRIBUTION SYSTEM
philipnathen82
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 

Integer quantization for deep learning inference: principles and empirical evaluation

  • 1. INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION NVIDIA, Georgia Tech. Arxiv 2020 Presenter: Jemin Lee https://leejaymin.github.io/index.html 7 Oct. 2020 Neural Acceleration Study Season #3
  • 2. Contents • Introduction and Background • Quantization Fundamentals • Post-Training Quantization • Techniques to Recover Accuracy • Recommended Workflow • Conclusion Neural Acceleration Study #2
  • 3. Introduction • Low-precision formats offer several performance benefits • First, many processors provide higher throughput math pipelines the low-bit formats, which can speed up math-intensive operations, such as convolutions and matrix multiplications. • Second, smaller word sizes reduce memory bandwidth pressure, improving performance for bandwidth-limited computations. • Third, smaller word sizes lead to lower memory size requirements, which can improve cache utilization as well as other aspects of memory-system operation. Neural Acceleration Study #3
  • 4. Introduction • Review the mathematical fundamentals underlying various integer quantization choices (Section 3) as well as techniques for recovering accuracy lost due to quantization (Section 5). Section 6 combines this information into a recommended workflow. Neural Acceleration Study #3
  • 5. Related Work • Uniform [7,59] quantization enables the use of integer or fixed- point math pipelines, allowing computation to be performed in the quantized domain. • Non-uniform[13,60,34,2] quantization requires dequantization, e.g. a codebook lookup, before doing computation in higher precision, limiting its benefits to model compression and bandwidth reduction. • This paper focuses on leveraging quantization to accelerate computation, so we will restrict our focus to uniform quantization schemes. • In this paper, we present an evaluation of int8 quantization on all of the major network architectures with both PTQ and QAT. Neural Acceleration Study #3
  • 6. Quantization Fundamentals • Uniform quantization steps • 1) Choose the range of the real numbers to be quantized, clamping the values outside this range. • 2) Map the real values to integers representable by the bit-width of the quantized representation (round each mapped real value to the closest integer value). • Quantize: convert a real number to a quantized integer representation (e.g. from fp32 to int8). • Dequantize: convert a number from quantized integer representation to a real number (e.g. from int32 to fp16). Neural Acceleration Study #3
  • 7. Quantization Fundamentals: Affine Quantization (Asymmetric) Neural Acceleration Study #3 • 𝒇 𝒙 = 𝒔 % 𝒙 + 𝒛 • S is scale factor and z is zero-point • Int8 • S= 𝟐𝟓𝟓 𝜷$𝜶 , z=-round(𝜷 " 𝒔)-128
  • 8. Quantization Fundamentals: Scale Quantization (Symmetric) Neural Acceleration Study #3 • integer range [−127, 127]
  • 9. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • At the coarsest, per-tensor granularity, the same quantization parameters are shared by all elements in the tensor. • The finest granularity would have individual quantization parameters per element. • tensor - per row or per column for 2D matrices, • per channel for 3D (image-like) tensors.
  • 10. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • Integer matrix multiplication • per-row or per-tensor for activations • per-column or per-tensor for weights • The equation is as below:
  • 11. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • For activations, only per-tensor quantization is practical for performance reasons. • The number of rows depends on batch instances. • Row count can vary at inference time. • It prevents the per-row scaling factor from being computation offline. • (which would not be meaningful for different instances in a mini- batch), whereas determining them online imposes a compute overhead and in some cases results in poor accuracy (Dynamic Quantization discussion in [58]).
  • 12. Quantization Fundamentals: Tensor Quantization Granularity Neural Acceleration Study #3 • The corresponding granularity to per-column in convolutions is per-kernel, or equivalently per-output-channel since each kernel produces a separate output channel [27, 29]. • This is commonly referred to as “per-channel” weight quantization in literature and we follow that convention [21, 25, 26, 38, 46]. We examine the granularity impact on accuracy in Section 4.1.
  • 13. Quantization Fundamentals: Computational Cost of Affine Quantization Neural Acceleration Study #3 • Breakdown three terms: • the integer dot product, just as in scale quantization (Equation 10). • only integer weights and zero-points can be computed offline, only adding an element-wise addition at inference time. • If the layer has a bias then this term can be folded in without increasing inference cost. • involves the quantized input matrix Xq, and thus cannot be computed offline. • overhead, reducing or even eliminating the throughput advantage that integer math pipelines have over reduced precision floating-point. • extra computation is incurred only if affine quantization is used for the weight matrix. • Scale quantization for weights. • Affine quantization could be used for activations
  • 15. Quantization Fundamentals: Calibration Neural Acceleration Study #3 • Max: Use the maximum absolute value seen during calibration [52]. • Entropy: Use KL divergence to minimize information loss between the original floating-point values and values that could be represented by the quantized format. This is the default method used by TensorRT [36]. • Percentile: Set the range to a percentile of the distribution of absolute values seen during calibration [33]. For example, 99% calibration would clip 1% of the largest magnitude values. • Clipping the distribution trades off a large clipping error on a few outlier values for smaller rounding errors on a majority of the values.
  • 16. Post Training Quantization Neural Acceleration Study #3 • Quantization parameters are calibrated offline by processing the trained model weights and activations generated by running inference on a sample dataset, no further training is involved.
  • 17. Post Training Quantization Neural Acceleration Study #3 • refer to the relative accuracy change, computed by (accint8 − accfp32)/accfp32
  • 18. Post Training Quantization: Weight Quantization Neural Acceleration Study #3 • Int8 weights: max calibration • However, as BN parameters are learned per channel, their folding can result in significantly different weight value distributions across channels. • Table 4 reports per-channel (per-column for linear layers) granularity and indicates that max calibration is sufficient to maintain accuracy when quantizing weights to int8. The rest of the experiments in this paper use per-channel max calibration for weights.
  • 19. Post Training Quantization: Activation Quantization Neural Acceleration Study #3 • For most of the networks, there is at least one activation calibration method that achieves acceptable accuracy, • except for MobileNets, EfficientNets, Transformer and BERT where the accuracy drop is larger than 1%. • no single calibration is best for all networks.
  • 20. Techniques to Recover Accuracy Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized • One also has an option to train networks with quantization • there are also approaches that jointly learn the model weights and quantization parameters.
  • 21. Techniques to Recover Accuracy: Partial Quantization Neural Acceleration Study #3 • Partial quantization: leaves the most sensitive layers unquantized
  • 22. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Intuition for QAT • the model has converged to a “narrow” minimum, since a small change in the weight leads to a large change in loss. By training with quantization, we may potentially avoid these narrow minima by computing gradients with respect to the quantized weights, as shown in Figure 6b. In doing so, narrow minima will result in larger gradients, potentially allowing the model to explore the loss landscape for “wide” [22] or “flat” [16, 30] minima, where quantization points have lower loss, and thus higher accuracy.
  • 23. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Fake Quantization • FakeQuant(x) = DQ(Q(x)) = INT8(s*x) / s • Problem: The quantization function is non-differentiable • Solution: Straight Through Estimation (STE)
  • 24. Techniques to Recover Accuracy: Quantization Aware Training Neural Acceleration Study #3 • Result
  • 25. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Apply the same fine-tuning schedule as before. • but allow the ranges of each quantized activation tensor to be learned along with the weights, as opposed to keeping them fixed throughout fine-tuning. • Use PACT[1] • Fixed best: best accuracy with PTQ as described in Table 5. • Use max calibration for activation quantization • Learning the range results in higher accuracy than keeping it fixed for most networks. • In best calibration cases, learning the ranges generate very similar results. • Not to offer extra benefit for int8 over QAT if activation ranges are already carefully calibrated. • PACT could yield better results with longer fine-tuning or a separate process. [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 26. Techniques to Recover Accuracy: Learning Quantization Parameters Neural Acceleration Study #3 • Complex fine-tuning procedure [1] Pact: Parameterized clipping activation for quantized neural networks, ICLR18
  • 27. Recommended Workflow (Recap) • Weights: • Per-column (FC), per-channel (conv.) • Symmetric (high performance) • Max calibration (static) • Activations: • Per tensor, symmetric • PTQ • Activation calibration could select the method among max, entropy, 99.99%, and 99.999%. • If none, do Partial quant, or QAT • Partial Quantization: • Perform sensitivity analysis to identify the most sensitive layers and leave them in floating-point. • If none, do QAT. • QAT: • Start from the best calibrated quantized model. • Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. • For details, refer to Appendix A.2. Neural Acceleration Study #3
  • 28. Conclusion • Review the mathematical background for integer quantization of neural networks, as well as some performance-related reasons for choosing quantization parameters. • Empirically evaluate various choices for int8 quantization of a variety of models, leading to a quantization workflow proposal. • MobileNets and BERT are challengeable for quantization. • Workflow composes post-training quantization, partial quantization, and quantization-aware fine-tuning techniques. • Some more complex techniques, such as ADMM and distillation, were not required for int8 quantization of these models. • However, these techniques should be evaluated when quantizing to even lower-bit integer representations, which we leave to future work. Neural Acceleration Study #3