Lec03 Pruning I
Lec03 Pruning I
ai Lecture 03:
Pruning and Sparsity
Part I
Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Today’s AI is too BIG!
81
InceptionV3 Xception
ResNetXt-50
ImageNet Top-1 accuracy (%)
79
DenseNet-169 DPN-92
77 ResNetXt-101
DenseNet-121 DenseNet-264
MBNetV2
75
ResNet-101
Shu eNet InceptionV2 ResNet-50
73 2M 4M 8M 16M 32M 64M
IGCV3-D
71 #Parameters
MobileNetV1
69
0 1 2 3 4 5 6 7 8 9
MACs (Billion)
Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffl
ffi
ffi
Efficient Deep Learning Techniques are Essential
Bridges the Gap between the Supply and Demand of Computation
MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion) 175B Model compression
100 bridges the gap.
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
Part 1 of This Course: Efficient Inference
Pruning
Quantization
Knowledge Distillation
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
ffi
ffi
MLPerf (the Olympic Game for AI Computing)
Pruning on Large Language Models
• The open division submission on Llama 2 70B: 2.5x speedup while maintaining 99% accuracy.
• Depth pruning: 80 layers -> 32 layers
• Width pruning: 28,762 intermediate dimensions -> 14,336 intermediate dimensions
Llama 2 70B performance metrics for both closed division and open division.
Measured on a single NVIDIA H200 GPU.
NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffl
ffi
ffi
Memory is Expensive
Data Movement → More Memory Reference → More Energy
1 = 200
This image is in the public domain
Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
fl
fl
ffi
ffi
Memory is Expensive
Data Movement → More Memory Reference → More Energy
Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
fl
fl
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
?
• In what pattern should we prune the neural Pruning ?
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
which synapses?
• Determine the Pruning Ratio which neurons?
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
prune 30%?
• Determine the Pruning Ratio prune 50%?
• What should target sparsity be for each layer? prune 70%?
7000 synapses
per neuron [2]
2500 synapses
per neuron [1]
Time
Newborn 2-4 years old Adolescence Adult
pruning
synapses
pruning
neurons
0.5%
0.0%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
Pruning
0.5%
0.0%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
Pruning Pruning+Finetuing
0.5%
0.0%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
0.5%
0.0%
-0.5%
Accuracy Loss
-1.0%
-1.5%
-2.0%
#Parameters MACs
Neural Network
Before Pruning After Pruning Reduction Reduction
AlexNet 61 M 6.7 M 9✕ 3✕
E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
ffi
Neural Network Pruning
Pruning the NeuralTalk LSTM does not hurt image caption quality.
Baseline: a basketball Baseline: a brown dog is Baseline: a man is riding Baseline: a soccer
player in a white uniform running through a grassy a surfboard on a wave. player in red is running in
is playing with a ball . eld. the eld.
Pruned 90%: a Pruned 90%: a brown Pruned 90%: a man in a Pruned 95%: a man in a
basketball player in a dog is running through a wetsuit is riding a wave red shirt and black and
white uniform is playing grassy area. on a beach. white black shirt is
with a basketball. running through a eld.
E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
fi
ffi
fi
ffi
fi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
3200
#publications on pruning and sparse neural networks
2400
# Publications
1600
EIE
800 Optimal
Brain Damage
Deep
Compression
0
1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022
598 Le Cun, Denker and Solla
ABSTRACT
Souce: https://github.com/mit-han-lab/pruning-sparsity-publications Abstract
We have used information-theoretic ideas to derive a class of prac-
Neural networks are both computationally intensive and memory intensive, making
MIT 6.5940: TinyML and E cient Deep Learning Computing
tical and nearly optimal schemes for adapting the size of a neural
network. By removing unimportant weights from a network, sev-
https://e cientml.ai
them difficult to deploy on embedded systems. Also, conventional networks fix 23
ffi
ffi
Pruning in the Industry
Hardware support for sparsity
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Pruning in the Industry
Hardware support for sparsity
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ffi
ffi
ffi
Section 2: Pruning Granularity
Pruning can be performed at di erent granularities, from structured to non-structured.
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ffi
ffi
ff
Pruning at Different Granularities
A simple example of 2D weight matrix
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
ffi
ffi
Pruning at Different Granularities
A simple example of 2D weight matrix
Preserved
Pruned
Fine-grained/Unstructured
• More exible pruning index choice
• Hard to accelerate (irregular)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
fl
ffi
ffi
Pruning at Different Granularities
A simple example of 2D weight matrix
Preserved
Pruned
Fine-grained/Unstructured Coarse-grained/Structured
• More exible pruning index choice • Less exible pruning index choice (a subset
• Hard to accelerate (irregular) of the ne-grained case)
• Easy to accelerate (just a smaller matrix!)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
fl
fl
ffi
fi
ffi
Pruning at Different Granularities
The case of convolutional layers
• The weights of convolutional layers have 4 dimensions [co, ci, kh, kw]:
• ci: input channels (or channels)
• co: output channels (or lters)
• kh: kernel size height
• kw: kernel size width
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
fi
Pruning at Different Granularities
The case of convolutional layers
Preserved
• Some of the commonly used pruning granularities
Pruned
Notations
kw = 3
kh = 3
co = 3
ci = 2
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers
kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 33
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers
kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices
• Usually larger compression ratio since we can exibly nd “redundant” weights (we will later
discuss how we nd them)
#Parameters
Neural Network
Before Pruning After Pruning Reduction
AlexNet 61 M 6.7 M 9✕
E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
fi
ffi
fl
fi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices
• Usually larger compression ratio since we can exibly nd “redundant” weights (we will later
discuss how we nd them)
• Can deliver speed up on some custom hardware (e.g., EIE) but not GPU (easily)
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
fi
ffi
fl
fi
Pruning at Different Granularities kw = 3
The case of convolutional layers
kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
Dense Matrix
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)
• It is supported by NVIDIA’s Ampere GPU Architecture, which delivers up to 2x speed up
non-zero 2-bit
values indices
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)
• It is supported by NVIDIA’s Ampere GPU Architecture, which delivers ~2x speed up
• Usually maintains accuracy (tested on varieties of tasks)
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers
kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?
Irregular Regular
Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio
#channels
Layer 0 Sparsity=0.5
Layer 1 Sparsity=0.3
Channel
Layer 2 Sparsity=0.7
Prune
Layer 3 Sparsity=0.2
Layer 4 Sparsity=0.3
… …
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio
We will later discuss how to nd sparsity ratios
Sparsity=0.3 Sparsity=0.5
Sparsity=0.3 Sparsity=0.3
Sparsity=0.3 < Sparsity=0.7
Sparsity=0.3 Sparsity=0.2
Sparsity=0.3 Sparsity=0.3
… …
Uniform Shrink Channel Prune
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
fi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio
… …
Uniform Shrink Channel Prune
Latency (ms)
AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
?
• In what pattern should we prune the neural Pruning ?
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
which synapses?
• Determine the Pruning Ratio which neurons?
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
ffi
ffi
ffi
Section 3: Pruning Criterion
What synapses and neurons should we prune?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
Selection of Synapses to Prune
• When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.
w0 x0 Example
(∑ ) f( ⋅ ) = ReLU( ⋅ ), W = [10, − 8, 0.1]
y=f wi xi + b
i
w1x1
∑
wi xi + b f ➡ y = ReLU(10x0 − 8x1 + 0.1x2)
i
w2 x2
• If one weight will be removed, which one?
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
Magnitude-based Pruning
A heuristic pruning criterion
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W |
• Example
• Example
3 -2 L1-norm |3|+|-2| 5 0 0
1 -5 Row-wise |1|+|-5| 6 1 -5
• Example
13
3 -2 L2-norm = 2
|3| + | − 2| 2 √13 0 0
26
1 -5 Row-wise 2 2 √26 1 -5
= |1| + | − 5|
(∑ )
(S) p (S)
∥W ∥p = | wi | , where W is a structural set of parameters
i∈S
• Example
13
3 -2 L2-norm = 2
|3| + | − 2| 2 √13 0 0
26
1 -5 Row-wise 2 2 √26 1 -5
= |1| + | − 5|
Channel
Weight Activation
Scaling Factor
⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1
Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
fi
fi
Scaling-based Pruning
Pruning criterion for lter pruning
• A scaling factor is associated with each lter (i.e., output channel) in convolutional layers
• The scaling factor is multiplied to the output of that channel
• The scaling factors are trainable parameters
• The lters/output channels with small scaling factor magnitude will be pruned
Channel Channel
Weight Activation Weight Activation
Scaling Factor Scaling Factor
⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1
Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
fi
ffi
ffi
ffi
fi
fi
Scaling-based Pruning
Pruning criterion for lter pruning
• A scaling factor is associated with each lter (i.e., output channel) in convolutional layers
• The scaling factors can be reused from batch normalization layer
zi − μℬ
zo = γ +β
σℬ
2 +ϵ
Channel Channel
Weight Activation Weight Activation
Scaling Factor Scaling Factor
⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1
Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
ffi
fi
fi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that
Neuron Pruning
in Linear Layer
Channel Pruning
in Convolution Layer
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.
Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4
Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0
Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.
• Similar to magnitude of weights, the Average Percentage of Zero activations (APoZ) can be
exploited to measure the importance of the neurons.
Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4
Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0
Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.
• Similar to magnitude of weights, the Average Percentage of Zero activations (APoZ) can be
exploited to measure the importance of the neurons.
• The smaller APoZ is, the more importance the neuron has.
Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4
Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0
Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z
ci co co
b
ci
=b
XP WTP Ẑ
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z
ci co co
b
ci
=b
XP WTP Ẑ
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T
∑
Z = XW = XcWc ci co co
c=0
b X0 X1 X2 X3
ci
WT0
WT1
=b
WT2
WT3
T
X W Z
ci co co
b
ci
=b
XP WTP Ẑ
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T
∑
Z = XW = XcWc ci co co
XP WTP Ẑ
Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T
∑
Z = XW = XcWc ci co co
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
fi
fi
ffi
fi
ff
References
1. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE
2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
3. Optimal Brain Damage [LeCun et al., NeurIPS 1989]
4. Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
5. E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
6. Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]
7. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
8. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
9. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
10. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
11. Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
12. Pruning Convolutional Filters with First Order Taylor Series Ranking [Wang M.]
13. Importance Estimation for Neural Network Pruning [Molchanov et al., CVPR 2019]
14. Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv
2017]
15. Pruning Convolutional Neural Networks for Resource E cient Inference [Molchanov et al., ICLR 2017]
16. Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
17. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression [Luo et al., ICCV 2017]
18. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot [Elias Frantar, Dan Alistarh, ArXiv 2023]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
ffi
ffi
ffi
ffi
ffi