0% found this document useful (0 votes)
12 views

PQAT

The document discusses the optimization of deep learning models for deployment on edge devices, focusing on techniques such as pruning and quantization to enhance efficiency and reduce resource consumption. It highlights the necessity of edge deployment for real-time processing, privacy, and reduced bandwidth usage, while outlining the edge deployment process and key optimization aspects. The document also details the benefits and methods of pruning and quantization, emphasizing their importance for model performance on resource-constrained devices.

Uploaded by

Sekhar Sankuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

PQAT

The document discusses the optimization of deep learning models for deployment on edge devices, focusing on techniques such as pruning and quantization to enhance efficiency and reduce resource consumption. It highlights the necessity of edge deployment for real-time processing, privacy, and reduced bandwidth usage, while outlining the edge deployment process and key optimization aspects. The document also details the benefits and methods of pruning and quantization, emphasizing their importance for model performance on resource-constrained devices.

Uploaded by

Sekhar Sankuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Optimizing Deep Learning Models for

Edge Device Deployment


(Quantization and Pruning)
Table of Contents
↓ Introduction
↓ Need for Edge Deployment
↓ Edge Deployment process
↓ Technics
↓ Pruning
↓ Quantization
↓ Tools for Edge deployment
Introduction
➢Edge AI: Bringing deep learning models to edge devices
➢Challenges: Limited compute, memory, and power constraints
➢Importance of optimization for real-time, low-latency
applications
Necessity of Edge Deployment
Low Latency & Real-Time Processing
• Edge devices enable real-time decision-making by processing data locally, reducing dependency on cloud
servers.
• This is crucial for applications like autonomous vehicles, medical diagnostics, and industrial automation.
Privacy & Security
• By keeping data processing on the device, edge deployment minimizes the risk of data breaches and
enhances privacy compliance
• This is essential for sensitive applications like healthcare and finance.
Reduced Bandwidth & Power Consumption
• Transmitting large amounts of data to the cloud is costly and power-intensive.
• Edge AI optimizes resource usage by processing data locally, reducing bandwidth consumption and
making AI feasible for IoT and battery-powered devices.
Edge Deployment Process
Edge Deployment Pipeline
Key Optimization Aspects for Edge AI

•Model Compression (Pruning, Quantization, Knowledge Distillation)

•Efficient Architectures (MobileNets, EfficientNets, Transformers for Edge)

•Hardware Acceleration (TPUs, NPUs, FPGAs)

•Inference Optimization (TfLite, ONNX, TensorRT, EdgeTPU)


Training vs Inference
Example: XOR gate
Mathematical Calculations
Pruning
• Model pruning refers to the act of removing unimportant parameters from a deep
learning neural network model to reduce the model size and enable more efficient
model inference.
• Generally, only the weights of the parameters are pruned, leaving the biases untouched.
The pruning of biases tends to have much more significant downsides.

https://www.datature.io/blog/a-comprehensive-guide-to-neural-network-model-pruning
Need for Pruning
Pruning removes redundant weights in neural networks to enhance
efficiency. Key benefits include:
• Reduced Model Size: Eliminates insignificant connections, lowering
storage and memory requirements.
• Faster Inference: Sparse models require fewer computations, improving
execution speed.
• Lower Power Consumption: Reduces computational load, making
inference energy-efficient.
• Edge Deployment: Enables deployment on resource-limited devices by
optimizing model complexity.
• Pruning maintains model performance while significantly improving
efficiency for real-world applications.
Pruning
Pruning Process
Technics
There are two main approaches to pruning neural networks:
• Train-Time Pruning: Pruning is integrated into the training process, typically by
applying L1 or L2 regularization to encourage sparsity in the weights. This helps
the model learn a more compact representation while training.
• Post-Training Pruning: The model is fully trained first, and then pruning is
applied to remove less significant weights. This method does not influence the
training process and is commonly used to optimize pre-trained models for
deployment.
Quantization ✓ Quantization refers to the process of approximating a
model's parameters (weights and activations) using
lower-precision data types, such as 8-bit integers
(INT8), instead of the commonly used 32-bit floating-
point numbers (FP32).
✓ The primary goal is to improve the efficiency of deep
learning models by reducing the number of bits
required for computations.
Need for Quantization
Quantization is essential for deploying deep learning models on resource-
constrained devices. Key benefits include:
• Reduced Memory Footprint: Lower-precision formats (e.g., INT8 vs.
FP32) minimize storage needs.
• Faster Inference: Low-precision arithmetic speeds up computations,
especially on specialized hardware like TPUs.
• Lower Power Consumption: Reduces energy use, making it ideal for
battery-powered devices.
• Edge Deployment: Enables efficient on-device inference without reliance
on cloud resources.
• Quantization enhances efficiency, making deep learning models
lightweight and scalable for real-world applications.
Quantization Process
Verification after Quantization
Technics
There are two main approaches to pruning neural networks:
1. Post-Training Quantization (PTQ)
1. Applied after model training without modifying learned weights.
2. Converts weights and activations from FP32 to lower precision (e.g., INT8).
3. Types:
1. Dynamic Quantization (only weights are quantized, activations stay FP32).
2. Static Quantization (both weights and activations are quantized using calibration).
4. Advantage: Faster inference with minimal retraining.
2. Quantization-Aware Training (QAT)
1. Simulates quantization effects during training so the model learns to adapt.
2. Weights and activations remain in FP32 during training but use lower precision (e.g., INT8) in inference.
3. Advantage: Higher accuracy than PTQ, especially for complex models.
• These two methods are widely used in real-world applications, with PTQ preferred for efficiency and QAT for
maintaining accuracy.
Conclusion
TensorFlow Lite package
Edge deployment

You might also like