0% found this document useful (0 votes)
59 views

Lec03 Pruning I

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Lec03 Pruning I

Uploaded by

peter.yeh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

EfficientML.

ai Lecture 03:
Pruning and Sparsity
Part I

Song Han
Associate Professor, MIT
Distinguished Scientist, NVIDIA
@SongHan_MIT

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai
ffi
ffi
Today’s AI is too BIG!
81
InceptionV3 Xception
ResNetXt-50
ImageNet Top-1 accuracy (%)

79

DenseNet-169 DPN-92
77 ResNetXt-101
DenseNet-121 DenseNet-264
MBNetV2
75
ResNet-101
Shu eNet InceptionV2 ResNet-50
73 2M 4M 8M 16M 32M 64M
IGCV3-D

71 #Parameters
MobileNetV1

69
0 1 2 3 4 5 6 7 8 9
MACs (Billion)

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE 2020]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 2
ffl
ffi
ffi
Efficient Deep Learning Techniques are Essential
Bridges the Gap between the Supply and Demand of Computation

MT-NLG
1000 530B
GPT-3
Model Size (#Params in Billion) 175B Model compression
100 bridges the gap.
T-NLG
17B
A100
10 TPUv3
A100 80GB
V100
TPUv2 MegatronLM 40GB
32GB 32GB
16GB 8.3B
1
GPT-2
1.5B
BERT
0.1 0.34B Model Size
GPT
0.11B GPU Memory
Transformer
0.05B Assume data are FP16.
0.01
2017 2018 2020 2021 2022

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 3
ffi
ffi
Part 1 of This Course: Efficient Inference

Pruning

Quantization

Neural Architecture Search

Knowledge Distillation
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 4
ffi
ffi
MLPerf (the Olympic Game for AI Computing)
Pruning on Large Language Models
• The open division submission on Llama 2 70B: 2.5x speedup while maintaining 99% accuracy.
• Depth pruning: 80 layers -> 32 layers
• Width pruning: 28,762 intermediate dimensions -> 14,336 intermediate dimensions

Closed Division Open Division Speedup

O ine samples/sec 4488 11189 2.5x

Llama 2 70B performance metrics for both closed division and open division.
Measured on a single NVIDIA H200 GPU.

NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 5
ffl
ffi
ffi
Memory is Expensive
Data Movement → More Memory Reference → More Energy

Operation Energy [pJ] Relative Energy Cost


32 bit int ADD 0.1

32 bit oat ADD 0.9

32 bit Register File 1


200 ✕
32 bit int MULT 3.1

32 bit oat MULT 3.7

32 bit SRAM Cache 5

32 bit DRAM Memory 640


Rough Energy Cost For Various Operations in 45nm 0.9V
1 10 100 1000 10000

1 = 200
This image is in the public domain

Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 6
fl
fl
ffi
ffi
Memory is Expensive
Data Movement → More Memory Reference → More Energy

Operation Energy [pJ] Relative Energy Cost


32 bit int ADD 0.1

32 bit oat ADD 0.9

32 bit Register File 1


200 ✕
How should
32 bit int MULT we make 3.1
deep learning more e cient?
32 bit oat MULT 3.7

32 bit SRAM Cache 5

32 bit DRAM Memory 640


Rough Energy Cost For Various Operations in 45nm 0.9V
1 10 100 1000 10000

Battery images are in the public domain


Image 1, image 2, image 2, image 4

Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 7
fl
fl
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning

• Determine the Pruning Granularity


pruning
• In what pattern should we prune the neural synapses
network?
• Determine the Pruning Criterion
pruning
• What synapses/neurons should we prune? neurons

• Determine the Pruning Ratio


• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 8
ffi
ffi
ffi
Neural Network Pruning
• In general, we could formulate the pruning as
follows:
x x
arg min L(x; WP)
WP
subject to
∥Wp∥0 < N

• L represents the objective function for neural


network training;
• x is input, W is original weights, WP is pruned
weights; arg min L(x; W) arg min L(x; WP)
W WP
• ∥Wp∥0 calculates the #nonzeros in WP, and N is s . t .∥WP∥0 ≤ N
the target #nonzeros.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 9
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 10
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
?
• In what pattern should we prune the neural Pruning ?
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
which synapses?
• Determine the Pruning Ratio which neurons?
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 11
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
prune 30%?
• Determine the Pruning Ratio prune 50%?
• What should target sparsity be for each layer? prune 70%?

• Fine-tune/Train Pruned Neural Network


• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 12
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning? x
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
arg min L(x; WP)
• What should target sparsity be for each layer? WP

• Fine-tune/Train Pruned Neural Network s . t .∥WP∥0 ≤ N


• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 13
ffi
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning? before pruning after pruning

• Determine the Pruning Granularity


pruning
• In what pattern should we prune the neural synapses
network?
• Determine the Pruning Criterion
pruning
• What synapses/neurons should we prune? neurons

• Determine the Pruning Ratio


• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 14
ffi
ffi
ffi
Pruning Happens in Human Brain
Number of Synapses 15000 synapses
[1]
per neuron

7000 synapses
per neuron [2]
2500 synapses
per neuron [1]
Time
Newborn 2-4 years old Adolescence Adult

Do We Have Brain to Spare? [Drachman DA, Neurology 2004] Data Source: 1, 2


Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013] Slide Inspiration: Alila Medical Media
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 15
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
before pruning after pruning

pruning
synapses

pruning
neurons

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 16
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons

0.5%
0.0%
-0.5%

Accuracy Loss
-1.0%
-1.5%
-2.0%

Train Connectivity -2.5%


-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 17
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons

Pruning

0.5%
0.0%
-0.5%

Accuracy Loss
-1.0%
-1.5%
-2.0%

Train Connectivity -2.5%


-3.0%
-3.5%
-4.0%
Prune Connections
-4.5%
40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 18
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons

Pruning Pruning+Finetuing

0.5%
0.0%
-0.5%

Accuracy Loss
-1.0%
-1.5%
-2.0%

Train Connectivity -2.5%


-3.0%
-3.5%
-4.0%
Prune Connections
-4.5%
40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)
Train Weights
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 19
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons

Pruning Pruning+Finetuing Iterative Pruning and Finetuing

0.5%
0.0%
-0.5%

Accuracy Loss
-1.0%
-1.5%
-2.0%

Train Connectivity -2.5%


-3.0%
-3.5%
-4.0%
Prune Connections
-4.5%
40% 50% 60% 70% 80% 90% 100%
Pruning Ratio (Parameters Pruned Away)
Train Weights
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 20
ffi
ffi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons

#Parameters MACs
Neural Network
Before Pruning After Pruning Reduction Reduction

AlexNet 61 M 6.7 M 9✕ 3✕

VGG-16 138 M 10.3 M 12 ✕ 5✕

GoogleNet 7M 2.0 M 3.5 ✕ 5✕

ResNet50 26 M 7.47 M 3.4 ✕ 6.3 ✕

SqueezeNet 1M 0.38 M 3.2 ✕ 3.5 ✕

E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 21
ffi
ffi
ffi
Neural Network Pruning
Pruning the NeuralTalk LSTM does not hurt image caption quality.

Baseline: a basketball Baseline: a brown dog is Baseline: a man is riding Baseline: a soccer
player in a white uniform running through a grassy a surfboard on a wave. player in red is running in
is playing with a ball . eld. the eld.

Pruned 90%: a Pruned 90%: a brown Pruned 90%: a man in a Pruned 95%: a man in a
basketball player in a dog is running through a wetsuit is riding a wave red shirt and black and
white uniform is playing grassy area. on a beach. white black shirt is
with a basketball. running through a eld.
E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 22
fi
ffi
fi
ffi
fi
ffi
Neural Network Pruning
Make neural network smaller by removing synapses and neurons
3200
#publications on pruning and sparse neural networks

2400
# Publications

1600
EIE

800 Optimal
Brain Damage
Deep
Compression
0
1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022
598 Le Cun, Denker and Solla

Learning both Weights and Connections for Efficient


Neural Networks

Song Han Jeff Pool


Optimal Brain Damage Stanford University NVIDIA
[email protected] [email protected]

John Tran William J. Dally


NVIDIA Stanford University
Yann Le Cun, John S. Denker and Sara A. Sol1a [email protected] NVIDIA
AT&T Bell Laboratories, Holmdel, N. J. 07733 [email protected]

ABSTRACT
Souce: https://github.com/mit-han-lab/pruning-sparsity-publications Abstract
We have used information-theoretic ideas to derive a class of prac-
Neural networks are both computationally intensive and memory intensive, making
MIT 6.5940: TinyML and E cient Deep Learning Computing
tical and nearly optimal schemes for adapting the size of a neural
network. By removing unimportant weights from a network, sev-
https://e cientml.ai
them difficult to deploy on embedded systems. Also, conventional networks fix 23
ffi
ffi
Pruning in the Industry
Hardware support for sparsity

EIE [Han et al., ISCA 2016]

ESE [Han et al., FPGA 2017]

2:4 sparsity in A100 GPU


SpArch [Zhang et al., HPCA 2020] 2X peak performance, 1.5X measured BERT speedup
SpAtten [Wang et al., HPCA 2021]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 24
ffi
ffi
Pruning in the Industry
Hardware support for sparsity

EIE [Han et al., ISCA 2016]

ESE [Han et al., FPGA 2017]

Reduce model complexity by 5x to 50x with minimal


SpArch [Zhang et al., HPCA 2020] accuracy impact
SpAtten [Wang et al., HPCA 2021]

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 25
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
• In what pattern should we prune the neural Pruning
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
• Determine the Pruning Ratio
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 26
ffi
ffi
ffi
Section 2: Pruning Granularity
Pruning can be performed at di erent granularities, from structured to non-structured.

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 27
ffi
ffi
ff
Pruning at Different Granularities
A simple example of 2D weight matrix

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 28
ffi
ffi
Pruning at Different Granularities
A simple example of 2D weight matrix

Preserved
Pruned

Fine-grained/Unstructured
• More exible pruning index choice
• Hard to accelerate (irregular)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 29
fl
ffi
ffi
Pruning at Different Granularities
A simple example of 2D weight matrix

Preserved
Pruned

Fine-grained/Unstructured Coarse-grained/Structured
• More exible pruning index choice • Less exible pruning index choice (a subset
• Hard to accelerate (irregular) of the ne-grained case)
• Easy to accelerate (just a smaller matrix!)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 30
fl
fl
ffi
fi
ffi
Pruning at Different Granularities
The case of convolutional layers
• The weights of convolutional layers have 4 dimensions [co, ci, kh, kw]:
• ci: input channels (or channels)
• co: output channels (or lters)
• kh: kernel size height
• kw: kernel size width

• The 4 dimensions give us more choices to select pruning granularities

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 31
ffi
ffi
fi
Pruning at Different Granularities
The case of convolutional layers
Preserved
• Some of the commonly used pruning granularities
Pruned

Notations

kw = 3
kh = 3
co = 3

ci = 2

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 32
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers

kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning
like Tetris :)

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 33
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers

kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 34
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices

Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 35
ffi
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices
• Usually larger compression ratio since we can exibly nd “redundant” weights (we will later
discuss how we nd them)

#Parameters
Neural Network
Before Pruning After Pruning Reduction

AlexNet 61 M 6.7 M 9✕

VGG-16 138 M 10.3 M 12 ✕

GoogleNet 7M 2.0 M 3.5 ✕

ResNet50 26 M 7.47 M 3.4 ✕

E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 36
ffi
ffi
fi
ffi
fl
fi
Pruning at Different Granularities
Let’s look into some cases
• Fine-grained Pruning (the case we show before)
• Flexible pruning indices
• Usually larger compression ratio since we can exibly nd “redundant” weights (we will later
discuss how we nd them)
• Can deliver speed up on some custom hardware (e.g., EIE) but not GPU (easily)

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 37
ffi
fi
ffi
fl
fi
Pruning at Different Granularities kw = 3
The case of convolutional layers

kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 38
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned

Dense Matrix

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 39
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)

Dense Matrix 2:4 Sparse Matrix

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 40
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)
• It is supported by NVIDIA’s Ampere GPU Architecture, which delivers up to 2x speed up

non-zero 2-bit
values indices

Dense Matrix 2:4 Sparse Matrix Compressed Matrix

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 41
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Pattern-based Pruning: N:M sparsity
• N:M sparsity means that in each contiguous M elements, N of them is pruned
• A classic case is 2:4 sparsity (50% sparsity)
• It is supported by NVIDIA’s Ampere GPU Architecture, which delivers ~2x speed up
• Usually maintains accuracy (tested on varieties of tasks)

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 42
ffi
ffi
Pruning at Different Granularities kw = 3
The case of convolutional layers

kh = 3
co = 3
Preserved
• Some of the commonly used pruning granularities
Pruned
ci = 2
Pros? Cons?

Irregular Regular

Fine-grained Pattern-based Vector-level Kernel-level Channel-level


Pruning Pruning Pruning Pruning Pruning

Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 43
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio

#channels
Layer 0 Sparsity=0.5
Layer 1 Sparsity=0.3
Channel
Layer 2 Sparsity=0.7
Prune
Layer 3 Sparsity=0.2
Layer 4 Sparsity=0.3
… …

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 44
ffi
ffi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio
We will later discuss how to nd sparsity ratios
Sparsity=0.3 Sparsity=0.5
Sparsity=0.3 Sparsity=0.3
Sparsity=0.3 < Sparsity=0.7
Sparsity=0.3 Sparsity=0.2
Sparsity=0.3 Sparsity=0.3
… …
Uniform Shrink Channel Prune

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 45
ffi
ffi
fi
Pruning at Different Granularities
Let’s look into some cases
• Channel Pruning
• Pro: Direct speed up due to reduced channel numbers (leading to an NN with smaller
#channels)
• Con: smaller compression ratio

ImageNet Accuracy (%)


Pruning (AMC)

< Uniform Scaling

… …
Uniform Shrink Channel Prune
Latency (ms)

AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 46
ffi
ffi
Neural Network Pruning
• Introduction to Pruning
• What is pruning?
• How should we formulate pruning?
• Determine the Pruning Granularity
?
• In what pattern should we prune the neural Pruning ?
network?
• Determine the Pruning Criterion
• What synapses/neurons should we prune?
which synapses?
• Determine the Pruning Ratio which neurons?
• What should target sparsity be for each layer?
• Fine-tune/Train Pruned Neural Network
• How should we improve performance of pruned
models?
Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 47
ffi
ffi
ffi
Section 3: Pruning Criterion
What synapses and neurons should we prune?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 48
ffi
ffi
Selection of Synapses to Prune
• When removing parameters from a neural network model,
• the less important the parameters being removed are,
• the better the performance of pruned neural network is.

w0 x0 Example
(∑ ) f( ⋅ ) = ReLU( ⋅ ), W = [10, − 8, 0.1]
y=f wi xi + b
i
w1x1

wi xi + b f ➡ y = ReLU(10x0 − 8x1 + 0.1x2)
i
w2 x2
• If one weight will be removed, which one?

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 49
ffi
ffi
Magnitude-based Pruning
A heuristic pruning criterion
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For element-wise pruning,
Importance = | W |

• Example

3 -2 L1-norm |3| |-2| 3 2 3 0

1 -5 Element-wise |1| |-5| 1 5 0 -5

Weight Importance Pruned Weight


Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 50
ffi
ffi
ffi
Magnitude-based Pruning
A heuristic pruning criterion
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For row-wise pruning, the L1-norm magnitude can be de ned as,
(S)

Importance = | wi | , where W is the structural set S of parameters W
i∈S

• Example

3 -2 L1-norm |3|+|-2| 5 0 0

1 -5 Row-wise |1|+|-5| 6 1 -5

Weight Importance Pruned Weight


Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 51
ffi
ffi
ffi
fi
Magnitude-based Pruning
A heuristic pruning criterion
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• For row-wise pruning, the L2-norm magnitude can be de ned as,
2 (S)

Importance = | wi | , where W is the structural set S of parameters W
i∈S

• Example
13
3 -2 L2-norm = 2
|3| + | − 2| 2 √13 0 0
26
1 -5 Row-wise 2 2 √26 1 -5
= |1| + | − 5|

Weight Importance Pruned Weight


Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 52
ffi
ffi
ffi
fi
Magnitude-based Pruning
A heuristic pruning criterion
• Magnitude-based pruning considers weights with larger absolute values are more important
than other weights.
• Magnitude is also known as Lp-norm de ned as,
1
p

(∑ )
(S) p (S)
∥W ∥p = | wi | , where W is a structural set of parameters
i∈S

• Example
13
3 -2 L2-norm = 2
|3| + | − 2| 2 √13 0 0
26
1 -5 Row-wise 2 2 √26 1 -5
= |1| + | − 5|

Weight Importance Pruned Weight


Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 53
ffi
ffi
fi
Scaling-based Pruning
Pruning criterion for lter pruning
• A scaling factor is associated with each lter (i.e., output channel) in convolutional layers
• The scaling factor is multiplied to the output of that channel
• The scaling factors are trainable parameters

Channel
Weight Activation
Scaling Factor

Filter 0 1.17 Channel 0

Filter 1 0.10 Channel 1

Filter 2 0.29 Channel 2

Filter 3 0.82 Channel 3

⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1

Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 54
ffi
ffi
ffi
fi
fi
Scaling-based Pruning
Pruning criterion for lter pruning
• A scaling factor is associated with each lter (i.e., output channel) in convolutional layers
• The scaling factor is multiplied to the output of that channel
• The scaling factors are trainable parameters
• The lters/output channels with small scaling factor magnitude will be pruned
Channel Channel
Weight Activation Weight Activation
Scaling Factor Scaling Factor

Filter 0 1.17 Channel 0 Filter 0 1.17 Channel 0

Filter 1 0.10 Channel 1


Filter 3 0.82 Channel 3

Filter 2 0.29 Channel 2 ⋮ ⋮ ⋮


Filter N-1 0.56 Channel N-1
Filter 3 0.82 Channel 3

⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1

Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 55
fi
ffi
ffi
ffi
fi
fi
Scaling-based Pruning
Pruning criterion for lter pruning
• A scaling factor is associated with each lter (i.e., output channel) in convolutional layers
• The scaling factors can be reused from batch normalization layer
zi − μℬ
zo = γ +β
σℬ
2 +ϵ

Channel Channel
Weight Activation Weight Activation
Scaling Factor Scaling Factor

Filter 0 1.17 Channel 0 Filter 0 1.17 Channel 0

Filter 1 0.10 Channel 1


Filter 3 0.82 Channel 3

Filter 2 0.29 Channel 2 ⋮ ⋮ ⋮


Filter N-1 0.56 Channel N-1
Filter 3 0.82 Channel 3

⋮ ⋮ ⋮
Filter N-1 0.56 Channel N-1

Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 56
ffi
ffi
ffi
fi
fi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 57
ffi
ffi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that
• The objective function L is nearly quadratic: the last term is neglected

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 58
ffi
ffi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that
• The objective function L is nearly quadratic: the last term is neglected
• The neural network training has converged: rst-order terms are neglected

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 59
ffi
ffi
fi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that
• The objective function L is nearly quadratic: the last term is neglected
• The neural network training has converged: rst-order terms are neglected
• The error caused by deleting each parameter is independent: cross terms are neglected

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 60
ffi
ffi
fi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• The induced error can be approximated by a Taylor series.
1 2 1 3
∑ ∑ ∑
δL = L(x; W) − L(x; WP = W − δW) = giδwi + hiiδwi + hijδwiδwj + O(∥δW∥ )
i
2 i
2 i≠j
where
2
∂L ∂L
gi = , hi,j =
∂wi ∂wi∂wj
• Optimal Brain Damage assumes that
• The objective function L is nearly quadratic: the last term is neglected
• The neural network training has converged: rst-order terms are neglected
• The error caused by deleting each parameter is independent: cross terms are neglected
1 2
δLi = L(x; W) − L(x; WP | wi = 0) ≈ hiiwi
2

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 61
ffi
ffi
fi
Second-Order-based Pruning
Minimize the error on loss function introduced by pruning synapses
• Optimal Brain Damage assumes that
• The objective function L is nearly quadratic
• The neural network training has converged
• The error caused by deleting each parameter is independent
2
1 2 ∂ L
δLi = L(x; W) − L(x; WP | wi = 0) ≈ hiiwi , where hii =
2 ∂wi∂wj
• The synapses with smaller induced error | δLi | will be removed; that is to say,
1 2
importancewi = | δLi | = hiiwi
2
* hii is non-negative

Hessian Matrix H is di cult to compute.

Optimal Brain Damage [LeCun et al., NeurIPS 1989]


MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 62
ffi
ffi
ffi
Selection of Neurons to Prune
• When removing neurons from a neural network model,
• the less useful the neurons being removed are,
• the better the performance of pruned neural network is.
Neuron pruning is coarse-grained weight pruning
Weight Matrix

Neuron Pruning
in Linear Layer

Channel Pruning
in Convolution Layer

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 63
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.

Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4

Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0

Channel = 3 Batch = 2 Channel = 3

Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 64
ffi
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.
• Similar to magnitude of weights, the Average Percentage of Zero activations (APoZ) can be
exploited to measure the importance of the neurons.

Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4

Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0

Channel = 3 Batch = 2 Channel = 3

5+6 11 5+7 12 6+8 14


Average Percentage of Zeros (APoZ) = = = = = =
2 ⋅ 4 ⋅ 4 32 2 ⋅ 4 ⋅ 4 32 2 ⋅ 4 ⋅ 4 32

Channel 0 Channel 1 Channel 2

Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 65
ffi
ffi
ffi
Percentage-of-Zero-Based Pruning
• ReLU activation will generate zeros in the output activation.
• Similar to magnitude of weights, the Average Percentage of Zero activations (APoZ) can be
exploited to measure the importance of the neurons.
• The smaller APoZ is, the more importance the neuron has.
Width = 4 Width = 4
0 0.1 0.5 1 0.1 0.5 0 0 0 0 0.8 0 0.5 0 0.2 0.1 0.1 0.5 0 0 0 0.8 0.1 0
Height = 4

Height = 4
1.2 0.6 0.3 0.2 0.2 0.3 0 1 0.7 0 0.6 0.1 0 0.2 1.2 0 0 0.8 0 1 0.2 0 0 0.3
Output
Activations 0 0.5 0 0.3 0.1 0 0 0.5 1.2 1 0 0.2 1.2 0 0.2 0.3 0.1 0 0.1 1.0 0 0.4 0 0.5
0.2 0 0 0.8 0.1 0.6 0.7 0.1 0.5 0 0.3 0.5 0.2 0.4 0 0 0.2 0 1.0 0 0.2 0 0.3 0

Channel = 3 Batch = 2 Channel = 3

5+6 11 5+7 12 6+8 14


Average Percentage of Zeros (APoZ) = = = = = =
2 ⋅ 4 ⋅ 4 32 2 ⋅ 4 ⋅ 4 32 2 ⋅ 4 ⋅ 4 32

Channel 0 Channel 1 Channel 2

Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 66
ffi
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 67
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z

ci co co
b
ci
=b
XP WTP Ẑ

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 68
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Instead of considering the pruning error of the
objective function L(x; W), regression-based ci co co
pruning minimizes the reconstruction error of the
corresponding layer’s outputs.
b
ci
=b
T
X W Z

Minimize the error between Z and Ẑ

ci co co
b
ci
=b
XP WTP Ẑ

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 69
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T

Z = XW = XcWc ci co co
c=0
b X0 X1 X2 X3
ci
WT0
WT1
=b
WT2
WT3
T
X W Z

Minimize the error between Z and Ẑ

ci co co
b
ci
=b
XP WTP Ẑ

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 70
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T

Z = XW = XcWc ci co co

• The problem can be formulate as


c=0
b X0 X1 X2 X3
ci
WT0
WT1
=b
WT2
ci−1
WT3
̂ 2 T 2

arg min ∥Z − Z∥F = ∥Z − βcXcWc ∥F T
W, β X W Z
c=0
subject to
Minimize the error between Z and Ẑ
∥β∥0 ≤ Nc
• β is coe cient vector of length ci for channel ci co co
selection. βc = 0 means channel c is pruned. b
ci
=b
• Nc is the number of nonzero channels.

XP WTP Ẑ

Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 71
ffi
ffi
ffi
Regression-based Pruning
Minimize reconstruction error of the corresponding layer’s outputs
• Let ci−1
T T

Z = XW = XcWc ci co co

• The problem can be formulate as


c=0
b X0 X1 X2 X3
ci
WT0
WT1
=b
WT2
ci−1
WT3
̂ 2 T 2

arg min ∥Z − Z∥F = ∥Z − βcXcWc ∥F T
W, β X W Z
c=0
subject to
Minimize the error between Z and Ẑ
∥β∥0 ≤ Nc
• β is coe cient vector of length ci for channel ci co co
selection. βc = 0 means channel c is pruned. b
ci
=b
• Nc is the number of nonzero channels.

• Solve the problem by: XP WTP Ẑ


• Fix W, solve β for channel selection
• Fix β, solve W to minimize reconstruction error
Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 72
ffi
ffi
ffi
Summary of Today’s Lecture Pruning Demo
In this lecture, we introduced:
• What is pruning
• Granularities of pruning before pruning after pruning

• Criteria to select weights to prune


• We will cover in the next lecture: pruning
synapses
• How to nd pruning ratio for each layer
• How to train/ ne-tune the pruned layer
pruning
• Automated ways to nd pruning ratios neurons

• System support for di erent granularities

MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 73
ffi
fi
fi
ffi
fi
ff
References
1. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey [Deng et al., IEEE
2020]
2. Computing's Energy Problem (and What We Can Do About it) [Horowitz, M., IEEE ISSCC 2014]
3. Optimal Brain Damage [LeCun et al., NeurIPS 1989]
4. Learning Both Weights and Connections for E cient Neural Network [Han et al., NeurIPS 2015]
5. E cient Methods and Hardware for Deep Learning [Han S., Stanford University]
6. Peter Huttenlocher (1931–2013) [Walsh, C. A., Nature 2013]
7. Exploring the granularity of sparsity in convolutional neural networks [Mao et al., CVPR-W]
8. Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
9. AMC: Automl for Model Compression and Acceleration on Mobile Devices [He et al., ECCV 2018]
10. Learning Structured Sparsity in Deep Neural Networks [Wen et al., NeurIPS 2016]
11. Learning E cient Convolutional Networks through Network Slimming [Liu et al., ICCV 2017]
12. Pruning Convolutional Filters with First Order Taylor Series Ranking [Wang M.]
13. Importance Estimation for Neural Network Pruning [Molchanov et al., CVPR 2019]
14. Network Trimming: A Data-Driven Neuron Pruning Approach towards E cient Deep Architectures [Hu et al., ArXiv
2017]
15. Pruning Convolutional Neural Networks for Resource E cient Inference [Molchanov et al., ICLR 2017]
16. Channel Pruning for Accelerating Very Deep Neural Networks [He et al., ICCV 2017]
17. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression [Luo et al., ICCV 2017]
18. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot [Elias Frantar, Dan Alistarh, ArXiv 2023]
MIT 6.5940: TinyML and E cient Deep Learning Computing https://e cientml.ai 74
ffi
ffi
ffi
ffi
ffi
ffi
ffi

You might also like