PQAT
PQAT
https://www.datature.io/blog/a-comprehensive-guide-to-neural-network-model-pruning
Need for Pruning
Pruning removes redundant weights in neural networks to enhance
efficiency. Key benefits include:
• Reduced Model Size: Eliminates insignificant connections, lowering
storage and memory requirements.
• Faster Inference: Sparse models require fewer computations, improving
execution speed.
• Lower Power Consumption: Reduces computational load, making
inference energy-efficient.
• Edge Deployment: Enables deployment on resource-limited devices by
optimizing model complexity.
• Pruning maintains model performance while significantly improving
efficiency for real-world applications.
Pruning
Pruning Process
Technics
There are two main approaches to pruning neural networks:
• Train-Time Pruning: Pruning is integrated into the training process, typically by
applying L1 or L2 regularization to encourage sparsity in the weights. This helps
the model learn a more compact representation while training.
• Post-Training Pruning: The model is fully trained first, and then pruning is
applied to remove less significant weights. This method does not influence the
training process and is commonly used to optimize pre-trained models for
deployment.
Quantization ✓ Quantization refers to the process of approximating a
model's parameters (weights and activations) using
lower-precision data types, such as 8-bit integers
(INT8), instead of the commonly used 32-bit floating-
point numbers (FP32).
✓ The primary goal is to improve the efficiency of deep
learning models by reducing the number of bits
required for computations.
Need for Quantization
Quantization is essential for deploying deep learning models on resource-
constrained devices. Key benefits include:
• Reduced Memory Footprint: Lower-precision formats (e.g., INT8 vs.
FP32) minimize storage needs.
• Faster Inference: Low-precision arithmetic speeds up computations,
especially on specialized hardware like TPUs.
• Lower Power Consumption: Reduces energy use, making it ideal for
battery-powered devices.
• Edge Deployment: Enables efficient on-device inference without reliance
on cloud resources.
• Quantization enhances efficiency, making deep learning models
lightweight and scalable for real-world applications.
Quantization Process
Verification after Quantization
Technics
There are two main approaches to pruning neural networks:
1. Post-Training Quantization (PTQ)
1. Applied after model training without modifying learned weights.
2. Converts weights and activations from FP32 to lower precision (e.g., INT8).
3. Types:
1. Dynamic Quantization (only weights are quantized, activations stay FP32).
2. Static Quantization (both weights and activations are quantized using calibration).
4. Advantage: Faster inference with minimal retraining.
2. Quantization-Aware Training (QAT)
1. Simulates quantization effects during training so the model learns to adapt.
2. Weights and activations remain in FP32 during training but use lower precision (e.g., INT8) in inference.
3. Advantage: Higher accuracy than PTQ, especially for complex models.
• These two methods are widely used in real-world applications, with PTQ preferred for efficiency and QAT for
maintaining accuracy.
Conclusion
TensorFlow Lite package
Edge deployment