“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap

Practical Approaches
to DNN Quantization
Dwith Chenna
Senior Embedded DSP Eng., Computer Vision
Magic Leap Inc.

• Why Quantization?
• Quantization Scheme
• Types of Quantization
• Post Training Quantization
• Quantization Tools
• Network Architecture
Contents
3
• Calibration Dataset
• Min/Max Tuning
• Quantization Evaluation
• Quantization Analysis
• Quantization Aware Training
• Best Practices
© 2023 Magic Leap

• Quantization is a powerful tool to enable deep learning on edge
devices
• Resource constrained hardware with limited memory and low
power requirement
Why Quantization?
4
© 2023 Magic Leap

• Model compression: Up to 4x smaller (float32 to int8) network size
and memory bandwidth
• Latency reduction: Up to 2x-3x times, int8 compute is significantly
faster compared to float32 [1]
• Trade-off: Potential effects on the model accuracy
Why Quantization?
5
© 2023 Magic Leap

• Convert full precision float-point numbers to int8 [2]
q - quantized value, r - real value, s - scale, z - zero point
• Quantized value to float-point representation
• In case of float-point distribution, we obtain scale and zero point as:
Quantization Scheme
6
© 2023 Magic Leap

• Assumes symmetric distribution for
simplicity, zero point = 0
• Symmetric per tensor
• Calculate scale for the entire tensor
• Symmetric per channel
• Calculate scale for each channel of the
tensor
• Computationally efficient
Quantization Scheme: Symmetric
7
© 2023 Magic Leap

• Accounts for shifts in the distribution, better
utilization of quantization range
• Asymmetric per tensor
• Scale and zero point for the entire tensor
• Symmetric per channel
• Scale and zero points for each channel of
the tensor
• Better handling of diverse distributions
Quantization Scheme: Asymmetric
8
© 2023 Magic Leap

Contents
9
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• Post Training Quantization (PTQ)
• Simple yet efficient
• Uses already trained model and calibration dataset
• Quantization Aware Training (QAT)
• Emulates inference-time quantization
• Resource intensive as it needs retraining
Types of Quantization
10
© 2023 Magic Leap

• Dynamic Quantization
• Weights are quantized ahead of time
• Activations are quantized during inference (dynamic)
• Static Quantization
• Weights and activations are quantized
• Memory bandwidth and compute savings
• Needs representative dataset
Post Training Quantization
11
© 2023 Magic Leap

• Best quantization scheme for deep neural
networks?
• Weights: Symmetric per channel
• Static distribution makes it easy for
quantization
• Weight distributions tend to be
symmetric [3]
• Symmetric per channel handles
diversity in weight distribution
12
© 2023 Magic Leap
Empirical distribution in a pre-trained network

• Activations: Asymmetric/Symmetric per tensor
• Dynamic distribution per inference makes it difficult to find
statistics
• Approximation through representative/calibration dataset
• Batch normalization enables better distributions for quantization
13
© 2023 Magic Leap

Contents
14
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• Tflite supports 8-bit integer PTQ [1]
• Quantization scheme
• Weights: Symmetric per channel
• Activations: Asymmetric per tensor
• Quantization analysis
• Selective quantization with mixed precision (float32/16 + int8/int16)
• Layerwise quantization error with custom metrics
Quantization Tools: Tflite
15
© 2023 Magic Leap

• Pytorch supports 8-bit integer PTQ [4]
• Weights: (A)symmetric per tensor/channel
• Activations: (A)symmetric per tensor/channel
• Layerwise quantization error through custom metrics
Quantization Tools: Pytorch
16
© 2023 Magic Leap

Contents
17
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• FLOPS is not everything!
• Network Architecture Search (NAS)
• Most NAS based models (e.g., efficientNet) try to minimize
compute
• Results in deeper and leaner network that works well with
cache-based systems
Network Architecture
18
© 2023 Magic Leap

• Efficient architecture for quantization [5]
19
© 2023 Magic Leap

• Quantization aware
• Larger models have redundancy which enables robustness to
quantization
• Enable utilization of simpler and efficient quantization
schemes
20
© 2023 Magic Leap

• Optimization tool chain
• Aggressive layer fusion for optimal memory bandwidth
• Optimal quantization parameter selection
• Hardware
• Better suited for the hardware CPU/GPU/DSP/accelerator
21
© 2023 Magic Leap

Contents
22
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• Representative dataset to estimate activation distribution
• Need to address diversity of the use case
• Size: ~100-1000 images are statistically significant [6]
Calibration Dataset
23
© 2023 Magic Leap

• Minimize quantization error and eliminate outliers
• Trade-offs: range vs quantization error
• Mean/Standard deviation
• Assuming normal distribution
• Min/Max: mean +/- 3*STD
Min/Max Tuning
24
© 2023 Magic Leap

• Histogram
• Ignore the last x% percent
• Moving average (TensorFlow default)
• Search max/min (Pytorch/TensorRT)
• Find histogram to cover most entropy
Min/Max Tuning
25
© 2023 Magic Leap

Contents
26
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• Evaluate best fit quantization schemes to the model [2]
• ResNet50: Symmetric per tensor
• MobileNet: Asymmetric per channel
Quantization Evaluation
27
© 2023 Magic Leap

• Effects of quantization scheme on model accuracy [2]
• Classification accuracy of the quantized model
Quantization Evaluation
28
© 2023 Magic Leap

• Model Selection
Contents
29
• Min/Max Tuning
• Best Practices
© 2023 Magic Leap

• What to do when quantization fails?
• Individual layer support for quantization
• Identifying few problematic layers will significantly improve performance
• Common pitfalls
• Handling input/output quantization
• Layer fusion before quantization
Quantization Analysis
30
© 2023 Magic Leap

• Analyse individual layers sensitivity to quantization [1]
• Selective quantization: mixed precision inference for testing
31
© 2023 Magic Leap
There are many layers with wide ranges, and some layers have high rmse/scale values
layer number (x-axis) vs activation range (y-axis) root mean square error (rmse) vs activation range

• Non-Linear activations: precision requirement and quantization support
• ReLU/ReLU6 preferred over Sigmoid/LeakyReLU
• Weight/activation distribution: visualization or metrics for data
distribution, i.e., range
• Layer fusion Conv + BN + ReLU / Conv + BN / Conv + ReLU before
quantization
32
© 2023 Magic Leap

• Use larger bit width for more sensitive layers, i.e., fully connected,
network head
• Int16 activation support in tflite
• Min/Max tuning: Outlier weights that cause all other weights to be less
precise
33
© 2023 Magic Leap

• Large difference in weight values for different output channels: more
quantization error
• Asymmetric/Symmetric per channel quantization
• Weight equalization techniques to minimize the variation [7]
34
© 2023 Magic Leap

• When everything else fails!
• QAT is a fine-tuning process
• Start with trained floating-point model: with reduced momentum and
learning rate
Quantization Aware Training
37
© 2023 Magic Leap

• Inserting quantization nodes
during training [8]
• Simulate quantization using
float-point operations
• Tune quantization parameters
during training
38
© 2023 Magic Leap

• Model selection
• NAS: Efficient architecture for quantization
• Quantization tools
• Support for quantization schemes and analysis tools
• Calibration dataset
• Representative dataset with ~100-1000 samples
Best Practices
41
© 2023 Magic Leap

• Quantization accuracy
• Evaluate best-fit quantization scheme for the model
• Identify potentially problematic layers
• Quantization aware training
• Fine tune model for quantization
Best Practices
42
© 2023 Magic Leap

1. https://www.tensorflow.org/lite/performance/post_training_quantization
2. Quantizing deep convolutional networks for efficient inference: A whitepaper [link]
3. Fixed Point Quantization of Deep Convolutional Networks [link]
4. https://pytorch.org/docs/stable/quantization.html
5. https://deci.ai/resources/achieve-fp32-accuracy-int8-inference-speed/
6. SelectQ: Calibration Data Selection for Post-Training Quantization[link]
7. AI Model Efficiency Toolkit (AIMET) [link]
8. Aspects and best practices of quantization aware training for custom network
accelerators [link]
References
43
© 2023 Magic Leap

“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap

Recommended

More Related Content

What's hot (20)

Similar to “Practical Approaches to DNN Quantization,” a Presentation from Magic Leap (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap