2006.03669v2
2006.03669v2
James T. O’ Neill
Department of Computer Science
University of Liverpool
Liverpool, England, L69 3BX
[email protected]
arXiv:2006.03669v2 [cs.LG] 1 Aug 2020
Abstract
Overparameterized networks trained to convergence have shown impressive performance in
domains such as computer vision and natural language processing. Pushing state of the art
on salient tasks within these domains corresponds to these models becoming larger and more
difficult for machine learning practitioners to use given the increasing memory and storage
requirements, not to mention the larger carbon footprint. Thus, in recent years there has
been a resurgence in model compression techniques, particularly for deep convolutional
neural networks and self-attention based networks such as the Transformer.
Hence, in this paper we provide a timely overview of both old and current compression
techniques for deep neural networks, including pruning, quantization, tensor decomposition,
knowledge distillation and combinations thereof.
1
Contents
1 Introduction 4
1.1 Further Motivation for Compression . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Categorizations of Compression Techniques . . . . . . . . . . . . . . . . . . . 6
1.3 Compression Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Weight Sharing 7
2.1 Clustering-based Weight Sharing . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Learning Weight Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Weight Sharing in Large Architectures . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Reusing Layers Recursively . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Network Pruning 12
3.1 Categorizing Pruning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Pruning using Weight Regularization . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Pruning via Loss Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Pruning using Second Order Derivatives . . . . . . . . . . . . . . . . . 16
3.3.2 Pruning using First Order Derivatives . . . . . . . . . . . . . . . . . . 18
3.4 Structured Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Structured Pruning via Weight Regularization . . . . . . . . . . . . . . 20
3.4.2 Structured Pruning via Loss Sensitivity . . . . . . . . . . . . . . . . . 20
3.4.3 Sparse Bayesian Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Search-based Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Evolutionary-Based Pruning . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Sequential Monte Carlo & Reinforcement Learning Based Pruning . . 25
3.6 Pruning Before Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.1 Pruning to Search for Optimal Architectures . . . . . . . . . . . . . . 28
3.6.2 Few-Shot and Data-Free Pruning Before Training . . . . . . . . . . . . 29
5 Knowledge Distillation 34
5.1 Analysis of Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Data-Free Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Distilling Recurrent (Autoregressive) Neural Networks . . . . . . . . . . . . . 39
5.4 Distilling Transformer-based (Non-Autoregressive) Networks . . . . . . . . . . 39
5.5 Ensemble-based Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Reinforcement Learning Based Knowledge Distillation . . . . . . . . . . . . . 42
2
5.7 Generative Modelling Based Knowledge Distillation . . . . . . . . . . . . . . . 43
5.7.1 Variational Inference Learned Student . . . . . . . . . . . . . . . . . . 43
5.7.2 Generative Adversarial Student . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Pairwise-based Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . 45
6 Quantization 47
6.1 Approximating High Resolution Computation . . . . . . . . . . . . . . . . . . 48
6.2 Adaptive Ranges and Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 Robustness to Quantization and Related Distortions . . . . . . . . . . . . . . 50
6.4 Retraining Quantized Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.4.1 Loss-aware quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4.2 Differentiable Quantization . . . . . . . . . . . . . . . . . . . . . . . . 54
7 Summary 58
7.1 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3
1. Introduction
Deep neural networks (DNN) are becoming increasingly large, pushing the limits of general-
ization performance and tackling more complex problems in areas such as computer vision
(CV), natural language processing (NLP), robotics and speech to name a few. For example,
Transformer-based architectures (Vaswani et al., 2017; Sanh et al., 2019; Liu et al., 2019b;
Yang et al., 2019; Lan et al., 2019; Devlin et al., 2018) that are commonly used in NLP
(also used in CV to a less extent (Parmar et al., 2018)) have millions of parameters for each
fully-connected layer. Tangentially, Convolutional Neural Network ((CNN) Fukushima, 1980)
based architectures (Krizhevsky et al., 2012; He et al., 2016b; Zagoruyko and Komodakis,
2016b; He et al., 2016a) used in vision and NLP tasks Kim (2014); Hu et al. (2014); Gehring
et al. (2017)).
From the left of Figure 1, we see that in general, larger overparameterized CNN networks
generalize better for ImageNet (a large image classification benchmark dataset). However,
recent architectures that aim to reduce the number of floating point operations (FLOPs) and
improve training efficiency with less parameters have also shown impressive performance e.g
EfficientNet (Tan and Le, 2019b).
The increase in Transformer network size, shown on the right, is more pronounced given
that the network consists of fully-connected layers that contain many parameters in each
self-attention block (Vaswani et al., 2017). MegatronLM (Shoeybi et al., 2019), shown on the
right-hand side, is a 72-layer GPT-2 model consisting of 8.3 billion parameters, trained by
using 8-way model parallelism and 64-way data parallelism over 512 GPUs. Rosset (2019)
proposed a 17 billion parameter Transformer model for natural language text generation
(NLG) that consists of 78 layers with hidden size of 4,256 and each block containing 28
attention heads. They use DeepSpeed 1 with ZeRO (Rajbhandari et al., 2019) to eliminate
memory redundancies in data parallelism and model parallelism and allow for larger batch
sizes (e.g 512), resulting in three times faster training and less GPUs required in the cluster
(256 instead of 1024). Brown et al. (2020) the most recent Transformer to date, trains a GPT-
3 autoregressive language model that contains 175 billion parameters. This model can perform
NLP tasks (e.g machine translation, question-answering) and digit arithmetic relatively well
with only few examples, closing the performance gap to similarly large pretrained models that
are further fine-tuned for specific tasks and in some cases outperforming them given the large
increase in the number of parameters. The resources required to store the aforementioned
CNN and Transformer models on Graphics Processing Units (GPUs) or Tensor Processing
Units (TPUs) let alone train them is out of reach for a large majority of machine learning
practitioners. Moreover, these models have predominantly been driven by improving the
state of the art (SoTA) and pushing the boundaries of what complex tasks can be solved
using them. Therefore, we expect that the current trend of increasing network size will
remain.
Thus, the motivation to compress models has grown and expanded in recent years from
being predominantly focused around deployment on mobile devices, to also learning smaller
networks on the same device but with eased hardware constraints i.e learning on a small
1
A library that allows for distributed training with mixed precision (MP), model parallelism, memory
optimization, clever gradient accumulation, loss scaling with MP, large batch training with specialized
optimizers, adaptive learning rates and advanced parameter search. See here https://github.com/
microsoft/DeepSpeed.git
4
number of GPUs and TPUs or the same number of GPUs and TPUs but with a smaller
amount of VRAM. For these reasons, model compression can be viewed as a critical research
endeavour to allow the machine learning community to continue to deploy and understand
these large models with limited resources.
Hence, this paper provides an overview of methods and techniques for compressing DNNs.
This includes weight sharing (section 2), pruning (section 3), tensor decomposition (section 4),
knowledge distillation (section 5) and quantization (section 6).
Figure 1: Accuracy vs # Parameters for CNN architectures (source on left: Tan and Le
(2019a)) and # Parameters vs Years for Transformers (source on right: Sanh (2019))
5
the discovery of the double descent phenomena (Belkin et al., 2019b), whereby a 2nd descent
in the test error is found for overparameterized DNNs that have little to no training errors,
occurring after the ( critical regime) region where the test error is initially high. This 2nd
descent in test error tends to converge to an error lower than that found in the 1st descent
where the 1st descent corresponds to the traditional bias-variance tradeoff. Moreover, the
norm of the weights becomes dramatically smaller in each layer during this 2nd descent,
during the compression phase (Shwartz-Ziv and Tishby, 2017). Since the weights tend to
be close to zero when trained far into this 2nd region, it becomes clearer why compression
techniques, such as pruning, has less effect on the networks behaviour when trained to
convergence since the magnitude of individual weights becomes smaller as the network grows
larger.
Frankle and Carbin (2018) also showed that training a network to convergence with more
parameters makes it easier to find a subnetwork that when trained from scratch, maintains
performance, further suggesting that compressing large pretrained overparameterized DNNs
that are trained to convergence has advantages from performance and storage perspective
over training and equivalently smaller DNN. Even in cases when the initial compression
causes a degradation in performance, retraining the compressed model can and is commonly
carried out to maintain performance.
Lastly, large pretrained models are widely and publicly available2,3 and thus can be easily
used and compared by the rest of the machine learning community, avoiding the need to
train these models from scratch. This further motivates the utility of model compression
and its advantages over training equivalently smaller network from scratch.
6
We now begin to describe work for each compression type, beginning with weight sharing.
2. Weight Sharing
The simplest form of network reduction involves sharing weights between layers or structures
within layers (e.g filters in CNNs). We note that unlike compression techniques discussed
in later sections (Section 3-6), standard weight sharing is carried out prior to training the
original networks as opposed to compressing the model after training. However,recent work
which we discuss here (Chen et al., 2015; Ullrich et al., 2017; Bai et al., 2019) have also been
used to reduce DNNs post-training and hence we devote this section to this straightforward
and commonly used technique.
Weight sharing reduces the network size and avoids sparsity. It is not always clear
how many and what group of weights should be shared before there is an unacceptable
performance degradation for a given network architecture and task. For example, Inan
et al. (2016) find that tying the input and output representations of words leads to good
performance while dramatically reducing the number of parameters proportional to the size
of the vocabulary of given text corpus. Although, this may be specific to language modelling,
since the output classes are a direct function of the inputs which are typically very high
dimensional (e.g typically greater than 106 ). Moreover, this approach assigns the embedding
matrix to be shared, as opposed to sharing individual or sub-blocks of the matrix. Other
approaches include clustering weights such that their centroid is shared among each cluster
and using weight penalty term in the objective to group weights in a way that makes them
more amenable to weight sharing. We discuss these approaches below along with other recent
techniques that have shown promising results when used in DNNs.
K X 2
X hX i
C= (y c − dc ) − log πj pj (w i ) (1)
σy2 c
i j
The expectation maximization (EM) algorithm is used to optimize these mixture pa-
rameters. The number of parameters tied is then proportional to the number of mixture
components that are used in the Gaussian model.
An Extension of Soft-Weight Sharing Ullrich et al. (2017) build on soft-weight
sharing (Nowlan and Hinton, 1992) with factorized posteriors by optimizing the objective in
7
Equation 2. Here, τ = 5e−3 controls the influence of the log-prior means µ, variances σ and
mixture coefficients π, which are learned during retraining apart from the j-th component
that are set to µj = 0 and πj = 0.99. Each mixture parameter has a learning rate set to
5 × 10−4 . Given the sensitivity of the mixtures to collapsing if the correct hyperparameters
are not chosen, they also consider the inverse-gamma hyperprior for the mixture variances
that is more stable during training.
After training with the above objective, if the components have a KL-divergence under a
set threshold, some of these components are merged (Adhikari and Hollmen, 2012) as shown
in Equation 3. Each weight is set then set to the mean of the component with the highest
mixture value argmax(π), performing GMM-based quantization.
πi µi + πj µj 2
πi σi2 + πj σj2
πnew = πi + πj , µnew = , σnew = (3)
πi + πj πi + πj
In their experiments, 17 Gaussian components were merge to 6 quantization components,
while still leading to performance close to the original LeNet classifier used on MNIST.
8
et al., 2017) (parallelizable self-attention and its global receptive field). As apart of UT,
weight sharing to reduce the network size showed strong results on NLP defacto benchmarks
while .
Dabre and Fujita (2019) use a 6-hidden layer Transformer network for neural machine
translation (NMT) where the same weights are fed back into the same attention block
recurrently. This straightforward approach surprisingly showed similar performance of an
untied 6-hidden layer for standard NMT benchmark datasets.
Xiao et al. (2019) use shared attention weights in Transformer as dot-product attention
can be slow during the auto-regressive decoding stage. Attention weights from hidden states
are shared among adjacent layers, drastically reducing the number of parameters proportional
to number of attention heads used. The Jenson-Shannon (JS) divergence is taken between
self-attention weights of different heads and they average them to compute the average JS
score. They find that the weight distribution is similar for layers 2-6 but larger variance is
found among encoder-decoder attention although some adjacent layers still exhibit relatively
JS score. Weight matrices are shared based on the JS score whereby layers that have JS
score larger than a learned threshold (dynamically updated throughout training) are shared.
The criterion used involves finding the largest group of attention blocks that have similarity
above the learned threshold to maximize largest number of weight groups that can be shared
while maintaining performance. They find a 16 time storage reduction over the original
Transformer while maintaining competitive performance.
Deep Equilibrium Model Bai et al. (2019) propose deep equilibrium models (DEMs)
that use a root-finding method to find the equilibrium point of a network and can be
analytically backpropogated through at the equilibrium point using implicit differentiation.
This is motivated by the observation that hidden states of sequential models converge towards
a fixed point. Regardless of the network depth, the approach only requires constant memory
because backpropogration only needs to be performed on the layer of the equilibrium point.
∗ ;x
For a recurrent network fW (z1:T 1:T ) of infinite hidden layer depth that takes inputs
x1:T and hidden states z1:T up to T timesteps, the transformations can be expressed as,
[i] [i] ∗ ∗
lim z1:T = lim fW (Z1:T ; x1:T ) := fW (z1:T ; x1:T ) = z1:T (4)
i→∞ i→∞ |{z}
equilibrium point
∗
where the final representation z1:T is the hidden state output corresponds to the equilib-
rium point of the network. They assume that this equilibrium point exists for large models,
such as Transformer and Trellis (Bai et al., 2018) networks (CNN-based architecture).
∗
∂z1:T
The ∂W requires implicit differentiation and Equation 5 can be rewritten as Equation 6.
∗
∂z1:T ∗ ;x
dfW (z1:T ∗ ;x ∗
1:T ) ∂fW (z1:T 1:T ) ∂z1:T
= + ∗ (5)
∂W dW ∂z1:T ∂W
∗ ;x
∂fW (z1:T ∗ ∗ ;x
1:T ) ∂z1:T dfW (z1:T 1:T )
I− ∗ = (6)
∂z1:T ∂W dW
9
∗ ;x ∗ ∗
For notational convenience they define gW (z1:T 1:T ) = fW (z1:T ; x1:T ) − z1:T → 0 and
∗
thus the equilibrium state z1:T is thus the root of gW found by the Broyden’s method (Broyden,
1965)4 .
The Jacobian of the function gW at the equilibrium point z1:T ∗ w.r.t W can then be
expressed as Equation 7. Note that this is computed without having to consider how the
∗
equilibrium z1:T was obtained.
∗ ;x
∂fW (z1:T 1:T )
JgW =− I− ∗ (7)
∗
z1:T ∂z1:T
∗
Since fW (·) is in equilibrium at z1:T they do not require to backpropogate through all
the layers, assuming all layers are the same (this is why it is considered a weight sharing
technique). They only need to solve Equation 8 to find the equilibrium points using Broydens
method,
∗
∂z1:T ∗ ;x
d fW (z1:T 1:T )
= −JgW (8)
∂W ∗
z1:T dW
and then perform a single layer update using backpropogation at the equilibrium point.
∗ ∗ ;x
∂L L ∂z1:T ∂L d fW (z1:T 1:T )
= ∗ =− (9)
∂W ∂z1:T ∂W ∗
∂z1:T J−1 ∗
dW
gW z1:T
The benefit of using Broyden method is that the full Jacobian does not need to be stored
−1
but instead an approximation Ĵ using the Sherman-Morrison formula (Scellier and Bengio,
2017) which can then be used as apart of the Broyden iteration:
[i+1] [i] −1 [i]
z1:T := z1:T − αĴgW [i]
gW (z1:T ; x1:T ) for i = 0, 1, 2, . . . (10)
z1:T
where alpha is the learning rate. This update can then be expressed as Equation 11
∗ ;x
d fW (z1:T
∂L ∂L 1:T )
W+ = W − α · =W+α (11)
∂W ∂z ∗ J−1 dW
1:T gW ∗
z1:T
Figure 2 shows the difference between a standard Transformer network forward pass
and backward pass in comparison to DEM passes. The left figure illustrates the Broyden
iterations to find the equilibrium point for inputs over successive inputs. On WikiText-103,
they show that DEMs can improve SoTA sequence models and reduce memory by 88% use
for similar computational requirements as the original models.
10
Figure 2: original source Bai et al. (2019): Comparison of the DEQ with conventional
weight-tied deep networks
the number of layers and number of parameters are the most significant factors while
increasing the number of feature maps (i.e the representation dimensionality) improves as
a byproduct of the increase in parameters. From this, they conclude that adding layers
without increasing the number of parameters can increase performance and that the number
of parameters far outweights the feature map dimensions with respect to performance.
Köpüklü et al. (2019) have also focused on reusing convolutional layers using recurrency
applying batch normalization after recursed layers and channel shuffling to allow filter outputs
to be passed as inputs to other filters in the same block. By channel shuffling, the LRU
blocks become robust with dealing with more than one type of channel, leading to improved
performance without increasing the number of parameters. Savarese and Maire (2019) learn
a linear combination of parameters from an external group of templates. They too use
recursive convolutional blocks as apart of the learned parameter shared weighting scheme.
However, layer recursion can lead to vanishing or exploding gradients (VEGs). Hence, we
concisely describe previous work that have aimed to mitigate VEGs in parameter shared
networks, namely ones which use the aforementioned recursivity.
Kim et al. (2016) have used residual connections between the input and the output
reconstruction layer to avoid signal attenuation, which can further lead to vanishing gradients
in the backward pass. This is applied in the context self-supervision by reconstructing high
resolution images for image super-resolution. Tai et al. (2017) extend the work of Kim et al.
(2016). Instead of passing the intermediate outputs of a shared parameter recursive block to
another convolutional layer, they use an elementwise addition of the intermediate outputs
of the residual recursive blocks before passing to the final convolutional layer. The original
input image is then added to the output of last convolutional layer which corresponds to the
final representation of the recursive residual block outputs.
Zhang et al. (2018c) combine residual (skip) connections and dense connections, where
skip connections add the input to each intermediate hidden layer input.
Guo et al. (2019a) address VGs in recursive convolutional blocks by using a gating
unit that chooses the number of self-loops for a given block before VEGs occur. They
use the Gumbel Softmax trick without gumbel noise to make deterministic predictions of
the number of self-loops there should be for a given recursive block throughout training.
They also find that batch normalization is at the root of gradient explosion because of the
11
statistical bias induced by having a different number of self-loops during training, effecting
the calculation of the moving average. This is adressed by normalizing inputs according to
the number of self-loops which is dependent on the gating unit. When used in Resnet-53
architecture, dynamically recursivity outperforms the larger ResNet-101 while reducing the
number parameters by 47%.
3. Network Pruning
Pruning weights is perhaps the most commonly used technique to reduce the number of
parameters in a pretrained DNN. Pruning can lead to a reduction of storage and model
runtime and performance is usually maintaining by retraining the pruned network. Iterative
weight pruning prunes while retraining until the desired network size and accuracy tradeoff is
met. From a neuroscience perspective, it has been found that as humans learn they also carry
out a similar kind of iterative pruning, removing irrelevant or unimportant information from
past experiences (Walsh, 2013). Similarly, pruning is not carried out at random, but selected
so that unimportant information about past experiences is discarded. In the context of DNNs,
random pruning (akin to Binary Dropout) can be detrimental to the models performance
and may require even more retraining steps to account for the removal of important weights
or neurons (Yu et al., 2018).
The simplest pruning strategy involves setting a threshold γ that decides which weights or
units (in this case, the absolute sum of magnitudes of incoming weights) are removed (Hagi-
wara, 1993). The threshold can be set based on each layers weight magnitude distribution,
where weights centered around the mean µ are removed, or it the threshold can be set globally
for the whole network. Alternatively, pruning the weights with lowest absolute value of the
normalized gradient multiplied by the weight magnitude (Lee et al., 2018) for a given set of
mini-batch inputs can be used, either layer-wise or globally too.
Instead of setting a threshold, one can predefine a percentage of weights to be pruned based
on the magnitude of w, or a percentage aggregated by weights for each layer wl , ∀l ∈ L. Most
commonly, the percentage of weights that are closest to 0 are removed. The aforementioned
criteria for pruning are all types of magnitude-based pruning (MBP). MBP has also been
combined with other strategies such as adding new neurons during iterative pruning to further
improve performance (Han and Qiao, 2013; Narasimha et al., 2008), where the number of
new neurons added is less than the number pruned in the previous pruning step and so the
overall number of parameters monotonically decreases.
MBP is the most commonly used in DNNs due to its simplicity and performs well for a
wide class of machine learning models (including DNNs) on a diverse range of tasks (Setiono
and Leow, 2000). In general, global MBP tends to outperform layer-wise MBP (Karnin,
1990; Reed, 1993; Hagiwara, 1993; Lee et al., 2018), because there is more flexibility on the
amount of sparsity for each layer, allowing more salient layer to be more dense while less
salient to contain more non-zero entries. Before discussing more involved pruning methods,
we first make some important categorical distinctions.
12
This can happen when the DNN is not necessarily overparameterized, in which case almost
all parameters are necessary to maintain good generalization.
Pruning techniques can also be categorized into what type of criteria is used as follows:
1. The aforementioned magnitude-based pruning whereby the weights with the lowest
absolute value of the weight are removed based on a set threshold or percentage,
layer-wise or globally.
2. Methods that penalize the objective with a regularization term to force the model to
learn a network with (e.g `1 , `2 or lasso weight regularization) smaller weights and
prune the smallest weights.
3. Methods that compute the sensitivity of the loss function when weights are removed
and remove the weights that result in the smallest change in loss.
4. Search-based approaches (e.g particle filters, evolutionary algorithms, reinforcement
learning) that seek to learn or adapt a set of weights to links or paths within the neural
network and keep those which are salient for the task. Unlike (1) and (2), the pruning
technique does not involve gradient descent as apart of the pruning criteria (with the
exception of using deep RL).
However, the main issue with using the above quadratic penalty is that all parame-
ters decay exponentially at the same rate and disproportionately penalizes larger weights.
13
Therefore, Weigend et al. (1991) proposed the objective shown in Equation 13. When
f (w) := w2 /(1 + w2 ) this penalty term is small and when large it tends to 1. Therefore,
these terms can be considered as approximating the number of non-zero parameters in the
network.
h n 2 h X
C 2
X X wml X vpm
C(w, v) = 2 + 2
(13)
2 1 + wml 1 + vpm
m=1 l=1 m=1 p=1
The derivative f 0 (w) = 2w/(1 + w2 )2 computed during backprogation does not penalize
large weights as much as Equation 12. However, in the context of recent years where large
overparameterized network have shown better generalization when the weights are close
to 0, we conjecture that perhaps Equation 13 is more useful in the underparameterized
regime. The controls how the small weights decay faster than large weights. However, the
problem of not distinguishing between large weights and very large weights is also an issue.
Therefore, Weigend et al. (1991) further propose the objective in Equation 14.
h X
n 2 C 2 h X
n C
X βwml X βvpm X
2
X
2
C(w, v) = 1 2 + 2
+ 2 w ml + vpm (14)
1 + βwml 1 + βvpm
m=1 l=1 p=1 m=1 l=1 p=1
Wan et al. (2009) have proposed a Gram-Schmidth (GS) based variant of backpropogation
whereby GS determines which weights are updated and which ones remain frozen at each
epoch.
Li et al. (2016b) prune filters in CNNs by identifying filters which contribute least to
the overall accuracy. For a given layer, sum of the weight magnitudes are computed and
since the number of channels is the same across filters, this quantity represents the average
of weight value for each kernel. Kernels with weights that have small weight activations will
have weak activation and hence these will be pruned. This simple approach leads to less
sparse connections and leads to 37% accuracy reduction on average across the models tested
while still being close to the original accuracy. Figure 3 shows their figure that demonstrates
that pruning filters that have the lowest sum of weight magnitudes correspond to the best
maintaining of accuracy.
14
Figure 3: original source: Li et al. (2016b)
To obtain the importance weight for a unit, they calculate the loss derivative with respect
to α as ρ̂i = ∂L/αi α =1 where L in this context is the sum of squared errors. Units are
i
then pruned when ρ̂i falls below a set threshold. However, they find that ρ̂i can fluctuate
throughout training and so they propose an exponentially-decayed moving average over time
to smoothen the volatile gradient and also provide better estimates when the squared error
is very small. This moving average is given as,
∂L(t)
ρ̂i (t + 1) = β ρ̂i (t) + (1 − β) (15)
αi
where β = 0.8 in their experiments. Applying skeletonization to current DNNs is perhaps
be too slow to compute as it was originally introduced in the context of using neural networks
with a relatively small amount of parameters. However, assigning importance weights for
groups of weights, such as filters in a CNN is feasible and aligns with current literature (Wen
et al., 2016; Anwar et al., 2017) on structured pruning (discussed in subsection 3.4).
Pruning Weights with Low Sensitivity Karnin (1990) measure the sensitivity S of
the loss function with respect to weights and prune weights with low sensitivity. Instead of
removing each weight individually, they approximate S by the sum of changes experienced
by the weight during training as
N −1 f
X ∂L wij
Sij = − ∆wij (n) f (16)
∂wij i
wij − wij
n=0
where wf is the final weight value at each pruning step, wi is the initial weight after the
previous pruning step and N is the number of training epochs. Using backpropagation to
compute ∆w, Ŝij is expressed as,
N −1 f
X 2 wij
Ŝij = − ∆wij (n) f
(17)
∇(wij i )
− wij
n=0
15
If the sum of squared errors is less than that of the previous pruning step and if a weight
in a hidden layer with the smallest Sij changes less than the previous epoch, then these
weights are pruned. This is to ensure that weight with small initial sensitivity are not pruned
too early, as they may perform well given more retraining steps. If all incoming weights are
removed to a unit, the unit is also removed, thus, removing all outgoing weights from that
unit. Lastly, they lower bound the number of weights that can be pruned for each hidden
layer, therefore, towards the end of training there may be weights with low sensitivity that
remain in the network.
X
:0 + 1X 1X :0
hii δ w̆i2 + ||2 )
δL = gi
δ w̆
i hij δ w̆i δwj +
O(||
W̆ (18)
2 2
i i i6=j
where w̆ are perturbed weights of w, the δ w̆i ’s are the components of δ W̆ , gi are the
components of the gradient ∂L/∂ w̆i and hij are the elements of the Hessian H where
Hij := ∂ 2 L/∂ w̆i ∂ w̆j . Since most well-trained networks will have L ≈ 0, the 1st term is
16
≈ 0. Assuming the perturbations on W are small then the last term will also be small
andPhence LeCun et al. (1990) assume the off-diagonal values of H are 0 and hence
1/2 i6=j hij δ w̆i δwj := 0. Therefore, δL is expressed as,
1X
δL ≈ hii δ w̆i2 (19)
2
i
∂2L 0 2
X
2∂ L
2
00
2
2∂ L
= f (ai ) − wli − f (ai ) (20)
∂a2i ∂a2l ∂zi2
l
The derivative of the mean squared error with respect to the to the last linear layer
output is then
∂2L
= 2f 0 (ai )2 − 2(yi − zi )f 00 (ai ) (21)
∂a2i
The importance of weight wi is then sk ≈ hkk wk2 /2 and the portion of weights with
lowest sk are iteratively pruned during retraining.
Optimal Brain Surgeon Hassibi et al. (1994) improve over OBD by preserving the off
diagonal values of the Hessian, showing empirically that these terms are actually important
for pruning and assuming a diagonal Hessian hurts pruning accuracy.
To make this Hessian computation feasible, they exploit the recursive relation for cal-
culating the inverse hessian H−1 from training data and the structural information of the
network. Moreover, using H−1 has advantages over OBD in that it does require further
re-training post-pruning.
They denote a weight to be eliminated as wq = 0, δwq + wq = 0 with the objective to
minimize the following objective:
n 1 o
min min{ δwT · H · δw} s.t eTq · w + wq = 0 (22)
q δw 2
1
L = δwT · H · δw + λ(eTq · δw + wq ) (23)
2
where λ is a Lagrange undetermined multiplier. The functional derivatives are taken
and the constraints of Equation 22 are applied. Finally, matrix inversion is used to find the
optimal weight change and resulting change in error is expressed as,
wq −1 1 wq2
δw = −1 H eq and Lq = (24)
[Hqq ] 2 [H−1
qq ]
17
f (x;W)
Defining the first derivative as Xk := ∂W the Hessian is expressed as,
P n
1 XX
H= Xk,j · XTk,j (25)
P
k=1 j=1
for an n-dimensional output and P samples. This can be viewed as the sample covariance
of the gradient and H can be recursively computed as,
1 T
H−1 −1
m+1 = Hm + X · XTm+1 (26)
P m+1
where H0 = αI and HP = H. Here 10−8 ≤ α ≥ 10−4 is necessary to make H−1 less
sensitive to the initial conditions. For OBS, H−1 is required and to obtain it they use a
matrix inversion formula (Kailath, 1980) which leads to the following update:
H−1 T −1
m · Xm+1 · Xm+1 · Hm
H−1 −1
m+1 = Hm − where H0 = αI, HP = H (27)
P + X−1 −1
m+1 · Hm · Xm+1
This recursion step is then used as apart of Equation 24, can be computed in one pass
of the training data 1 ≤ m ≤ P and computational complexity of H remains the same as
H−1 as O(P n2 ). Hassibi et al. (1994) have also extended their work on approximating
the inverse hessian (Hassibi and Stork, 1993) to show that this approximation works for
any twice differentiable objective (not only constrained to sum of squared errors) using the
Fisher’s score.
Other methods to Hessian approximation include dividing the network into subsets to
use block diagonal approximations and eigen decomposition of H−1 (Hassibi et al., 1994)
and principal components of H−1 (Levin et al., 1994) (unlike aforementioned approxima-
tions, Levin et al. (1994) do not require the network to be trained to a local minimum).
However the main drawback is that the Hessian is relatively expensive to compute for these
methods, including OBD. For n weights, the Hessian requires O(n2 /2) elements to store and
performs O(P n2 ) calculations per pruning step, where P is total number of pruning steps.
18
Unlike OBD, they keep the absolute change |y| resulting from pruning, as the variance
σy2 is non-zero and correlated with stability of the ∂C/∂h throughout training, where h is
the activation of the hidden layer.√Under the assumption that samples are independent and
√
identically distributed, E(|y|) = σ 2/ π where σ is the standard deviation of y, known as
the expected value of the half-normal distribution. So, while y tends to zero, the expectation
of |y| is proportional to the variance of y, a value which is empirically more informative as a
pruning criterion.
They rank the order of filters pruned using the TE criterion and compare to an oracle
rank (i.e the best ranking for removing pruned filters) and find that it has higher spearman
correlation to the oracle when compared against other ranking schemes. This can also be
used to choose which filters should be transferred to a target task model. They compute the
importance of neurons or filters z by estimating the mutual information with target variable
MI(z; y) using information gain IG(y|z) = H(z) + H(y) − H(z, y) where H(z) is the entropy
of the variable z, which is quantized to make this estimation tractable.
Fisher Pruning Theis et al. (2018) extend the work of Molchanov et al. (2016) by
motivating the pruning scheme and providing computational cost estimates for pruning as
adjacent layers are successively being pruned. Unlike OBD and OBS, they use, fisher pruning
as it is more efficient since the gradient information is already computed during the backward
pass. Hence, this pruning technique uses 1st order information given by the 2nd TE term that
approximates the loss with respect to w. The fisher information is then computed during
backpropogation and uses as the pruning criterion.
The gradient can be formulated as Equation 29, where L(w) = EP [− log Qw (y|x)], d
represents a change in parameters, P is the underlying distribution, Qw (y|x) is the posterior
from the model H is the Hessian matrix.
1
g = ∇L(w), H = ∇2 L(w), L(w + d) − L(w) ≈ g T d + dT Hd (29)
2
L(W − Wk ei ) − L(W) + β · (C(W − Wk ei ) − C(W)) (30)
Piggyback Pruning Mallya and Lazebnik (2018) propose a dyanmic masking (i.e pruning)
strategy whereby a mask is learned to adapt a dense network to a sparse subnetwork for a
specific target task. The backward pass for binary mask is expressed as,
∂L ∂L ∂y
j
= · = ∂ ŷj · wji · xi , (31)
∂mji ∂ ŷj ∂mji
where mij is an entry in the mask m, L is the loss function and ŷj is the prediction
when the j − th mask is applied to the weights w. The matrix m can then be expressed as
∂L T
∂m = (δy · x )W. Note that although the threshold for the mask m is non-differentiable,
but they perform a backward pass anyway. The justification is that the gradients of m act
as a noisy estimate of the gradients of the real-valued mask weights mr . For every new task,
m is tuned with a new final linear layer.
19
matrix multiplications, with little to no non-zero entries in matrices and tensors. CNNs in
particular are suitable for this type of pruning since they are made up of sparse connections.
Hence, below we describe some work that use group-wise regularizers, structured variational,
Adversarial Bayesian methods to achieve structured pruning in CNNs.
where Γijs is the group of kernel tensor entries K(i, j, s, :) where (i, j) are the pixel of
i-th row and j-th column of the image for the s-th input feature map. This regularization
term forces some Γijs groups to be close to zero, which can be removed during retraining
depending on the amount of compression that the practitioner predefines.
Structured Sparsity Learning Wen et al. (2016) show that their proposed structural
regularization can reduce a ResNet architecture with 20 layers to 18 with 1.35 percentage
point accuracy increase on CIFAR-10, which is even higher than the larger 32 layer ResNet
architecture. They use a group lasso regularization to remove whole filters, across channels,
shape and depth as shown in Figure 4.
Equation 33 shows the loss to be opti-
mized to remove unimportant filters and
channels, where W(l)nl ,cl ,:,: is the c-th
channel of the l-th filter for a collec-
tion of all weights W and || · || is the
group Lassoqregularization term where
P|w(g) | 2
||w(g) ||g = i=1 w(g) and |w(g) |
Figure 4: original source: Wen et al. (2016):
is the number of weights in w(g) .
Structured Sparsity Learning
Since zeroing out the l-th filter leads to
the feature map output being redundant,
it results in the l + 1 channel being zeroed as well. Hence, structured sparsity learning is
carried out for both filters and channels simultaneously.
L X
X N X Cl
L X
L(W) = LD (W) + λn · ||W(l)
ml ,:,:,: ||g + λc · ||W(l)
cl ,:,:,: ||g (33)
l=1 nl =1 l=1 cl =1
20
(2016). In the case of filters in CNNs, this results in smaller reshaped matrices, leading to
smaller and faster CNNs. The GWSR is added as a regularization term during retraining
a pretrained CNN and after a set number of epochs, the groups with smallest `2 norm are
deleted and the number of groups are predefined as τ ∈ [0, 1] (a percentage of the size of
the network). However, they find that when choosing a value for τ , it is difficult to set the
regularization influence term λ and can be time consuming manually tuning it. Moreover
when τ is small, the regularization strength of λ is found to be too heavy, leading to many
weight groups being biased towards 0 but not being very close to it. This results in poor
performance as it becomes more unclear what groups should be removed. However, the drop
in accuracy due to this can be remedied by further retraining after performing OBD. Hence,
retraining occurs on the sparse network without using the GWSR.
N
X h i
L(φ) = max LD − DKL qφ (w)||p(w) where LD (φ) = Eqφ (w) log p(yn |xn , wn )
φ
n=1
(34)
They use the reparameterization trick to reduce variance in the gradient estimator when
√
α > 0.5 by replacing multiplicative noise 1 + αij · ij with additive noise σij · ij , where
2 = α · θ 2 is tuned by optimizing the variational lower bound w.r.t θ
ij ∼ N (0; 1) and σij ij ij
and σ. This difference with the original VD allow weights with high dropout rates to be
removed.
Since the prior and approximate posterior are fully factorized, the full KL-divergence
term in the lower bound is decomposed into a sum:
X
DKL (q(W|θ, α)||p(W)) = DKL (q(wij |θij , αij )||p(wij )) (35)
ij
5
Automatic relevance determination provides a data-dependent prior distribution to prune away redundant
features in the overparameterized regime i.e more features than samples
21
Since the uniform log-prior is an improper prior, the KL divergence is only computed up
to an additional constant (Kingma et al., 2015).
1
−DKL (q(wij |θij , αij )||p(wij )) =
log αij − E ∼ N (1, αij ) log | · | + C (36)
2
In the VD model this term is intractable, as the expectation E ∼ N (1, αij ) log | · |i n
cannot be computed analytically (Kingma et al., 2015). Hence, they approximate the negative
KL. The negative KL increases as αij increases which means the regularization term prefers
large values of αij and so the correspond weight wij is dropped from the model. Since using
SVD at the start of training tends to drop too many weights early since the weights are
randomly initialized, SVD is used after an initial pretraining stage and hence this is why we
consider it a pruning technique.
Bayesian Structured Pruning Structured pruning has also been achieved from a Bayesian
view (Louizos et al., 2017) of learning dropout rates. Sparsity inducing hierarchical priors
are placed over the units of a DNN and those units with high dropout rates are pruned.
Pruning by unit is more efficient from a hardware perspective than pruning weights as the
latter requires priors for each individual weight, being far more computationally expensive
and has the benefit of being more efficient from a hardware perspective as whole groups of
weights are removed.
If we consider a DNN as p(D|w) = N
Q
i=1 p(yi |xi , w) where xi is a given input sample with
a corresponding target yi , w are the weights of the network, governed by a prior distribution
p(w). Since computing the posterior p(w|D) = p(D|w)p(w)/p(D) explicitly is intractactble,
p(w) is approximated with a simpler distribution, such as a Gaussian q(w), parameterized
by variational parameters φ. The variational parameters are then optimized as,
LE = Eqφ (w) [log p(D|w)], LC = Eqφ (w) [log p(w)] + H(qφ (w)) (37)
L(φ) = LE + LC (38)
where H(·) denotes the entropy and L(φ) is known as the evidence-lower-bound (ELBO).
They note that LE is intractable for noisy weights and in practice Monte Carlo integration
is used. When the simpler qφ (w) is continuous the reparameterization trick is used to
backpropograte through the deterministic part φ and Gaussian noise ∼ N (0, σ 2 I). By
substituting this into Equation 37 and using the local reparameterization trick (Kingma
et al., 2015) they can express L(φ) as
L(φ) = Ep ()[log p(D|f (φ, ))] + Eq(w) [log p(w)] + H(qφ(w) ), s.t w = f (φ, ) (39)
with unbiased stochastic gradient estimates of the ELBO w.r.t the variational parameters
φ. They use mixture of a log-uniform prior and a half-Cauchy prior for p(w) which equates to
a horseshoe distribution (Carvalho et al., 2010). By minimizing the negative KL divergence
between the normal-Jeffreys scale prior p(z) and the Gaussian variational posterior qφ (z)
they can learn the dropout rate αi = σ 2 zi /µ2 zi as
X
−DKL (φ(z)||p(z)) ≈ A (k1 σ(k2 + k3 log αi ) − 0.5m(− log αi ) − k1 ) (40)
i
22
where σ(·) is the sigmoid function, m(·) is the softplus function and k1 = 0.64, k2 = 1.87
and k3 = 1.49. A unit i is pruned if its variational dropout rate does not exceed threshold t,
as log αi = (log σ 2 zi − log µ2 zi ) ≥ t.
It should be mentioned that this prior parametrization readily allows for a more flexible
marginal posterior over the weights as we now have a compound distribution,
Z
qφ (W ) = qφ (W |z)qφ (z)dz (41)
Pruning via Variational Information Bottleneck Dai et al. (2018) minimize the vari-
ational lower bound (VLB) to reduce the redundancy between adjacent layers by penalizing
their mutual information to ensure each layer contains useful and distinct information. A
subset of neurons are kept while the remaining neurons are forced toward 0 using sparse reg-
ularization that occurs as apart of their variational information bottleneck (VIB) framework.
They show that the sparsity inducing regularization has advantages over previous sparsity
regularization approaches for network pruning.
Equation 42 shows the objective for compressing neurons (or filters in CNNs) where γi
controls the amount of compression for the i-th layer and L is a weight on the data term
that is used to ensure that for deeper networks the sum of KL factors does not result in the
log likelihood term outweighed when finding the globally optimal solution.
L ri
X X µ2i,j h i
L= γi log 1 + 2 − LE{x,y}∼D, h∼p(h|x) log q(y|hL ) (42)
i=1 j=1
σi,j
L naturally arises from the VIB formulation unlike probabilistic networks models. The
log(1 + u) in the KL term is concave and non-decreasing for range [0, ∞] and therefore favors
solutions that are sparse with a subset of parameters exactly zero instead of many shrunken
−2
ratios αi,j : µ2i,j σi,j , ∀i, j.
Each layer is sampled i ∼ N (0, I) in the forward pass and hi is computed. Then the
gradients are updated after backpropogation for {µi , σi Wi }Li=1 and output weights Wy .
Figure 5 shows the conditional distribution p(hi |hi−1 ) and hi sampled by multiplying
fi (hi−1 ) with a random variable zi := µi + i ◦ σi .
They show that when using VIB net-
work, the mutual information increases
between x and h1 as it initially begins
to learn and later in training the mutual
information begins to drop as the model
enters the compression phase. In constrast,
the mututal information for the original
stayed consistently high tending towards
1.
Generative Adversarial-based Struc-
tured Pruning Lin et al. (2019) ex- Figure 5: original source: Dai et al. (2018) Vari-
tend beyond pruning well-defined struc- ational Information Structure
tures, such as filters, to more general struc-
tures which may not be predefined in the
23
network architecture. They do so applying a soft mask to the output of each structure in a
network to be pruned and minimize the mean squared error with a baseline network and
also a minimax objective between the outputs of the baseline and pruned network where
a discriminator network tries to distinguish between both outputs. During retraining, soft
mask weights are learned over each structure (i.e filters, channels, ) with a sparse regular-
ization term (namely, a fast iterative shrinkage-thresholding algorithm) to force a subset of
the weights of each structure to go to 0. Those structures which have corresponding soft
mask weight lower than a predefined threshold are then removed throughout the adversarial
learning. This soft masking scheme is motivated by previous work (Lin et al., 2018) that
instead used hard thresholding using binary masks, which results in harder optimization
due to non-smootheness. Although they claim that this sparse masking can be performed
with label-free data and transfer to other domains with no supervision, the method is largely
dependent on the baseline (i.e teacher network) which implicitly provides labels as it is
trained with supervision, and thus it pruned network transferability is largely dependent on
this.
24
to generate new members of the population for the next iteration. This is achieved using
3 distribution estimation algorithms (DEA). They find that DEAs can improve GA-based
pruning and that in pruned networks using GA-based pruning results in faster inference with
little to no difference in performance compared to the original network.
Recently, Hu et al. (2018) have pruned channels from a pretrained CNN using GAs and
performed knowledge distillation on the pruned network. A kernel is converted to a binary
string K with a length equal to the number of channels for that kernel. Then each channel is
encoded as 0 or 1 where channels with a 0 are pruned and the n-th kernel Kn is represented
a a binary series after sampling each bit from a Bernoulli distribution for all C channels.
Each member (i.e channels) in the population is evaluated and top-k are kept for the next
generation (i.e iteration) based on the fitness score where k corresponds to the total amount of
pruning. The Roulette Wheel algorithm is used as the selection strategy (Goldberg and Deb,
1991) whereby the n-th member of the m-th generation Im,n has a probability of selection
proportional to its fitness relative to all other members. This can simply be implemented
by inputting all fitness scores for all members into a softmax. To avoid members with high
fitness scores losing information post mutation and cross-over, they also copy the highest
fitness scoring members to the next generation along with their mutated versions.
The main contribution is a 2-stage fitness scoring process. First, a local TS approximation
of a layer-wise error function using the aforementioned OBS objective (Dong et al., 2017)
(recall that OBS mainly revolves around efficient Hessian approximation) is used sequentially
from the first layer to the last, followed by a few epochs of retraining to restore the accuracy
of the pruned network. Second, the pruned network is distilled usin a cross-entropy loss and
regularization term that forces the features maps of the pruned network to be similar to the
distilled model, using an attention map to ensure both corresponding layer feature maps are
of the same and fixed size. They achieve SoTA on ImageNet and CIFAR-10 for VGG-16 and
ResNet CNN architectures using this approach.
Pruning via Simulated Annlealing Noy et al. (2019) propose to reduce search time for
searching neural architectures by relaxing the discrete search to continuous that allows for a
differentiable simulated annealing that is optimized using gradient descent (following from the
DARTS (Liu et al., 2018a) approach). This leads to much faster solutions compared to using
black-box search since optimizing over the continuous search space is an easier combinatorial
optimization problem that in turn leads to faster convergence. This pruning technique is
not strictly consider compression in its standard definition, as it prunes during the initial
training period as opposed to pruning after pretraining. This falls under the category of
neural architecture search (NAS) and here they use an annealing schedule that controls the
amount of pruning during NAS to incrementally make it easier to search for sub-modules
that are found to have good performance in the search process. Their (0, δ)-PAC theorem
guarantees under few assumptions (see paper for further details on these assumptions) that
this anneal and prune approach prunes less important weights with high probability.
25
(PF) applies sequential Monte Carlo estimation with particle representing the probability
density where the posterior is estimated with a random sample and parameters that are
used for posterior estimation. PF propogates parameters with large magnitudes and deletes
parameters with the smallest weight in re-sampling process, similar to MBP. They use PF
to prune the network and retrain to compensate for the loss in performance due to PF
pruning. When applied to CNNs, they reduce the size of kernel and feature map tensors
while maintaining test accuracy.
Particle Swarm Optimized Pruning Particle Swarm Optimization (PSO) has also
been combined with correlation merging algorithm (CMA) for pruning (Tu et al., 2010).
Equation 43 shows the PSO update formula where the velocity Vid for i-th position of
particle Xi d (i.e a parameter vector in a DNN) at the d-th iteration,
Vid := Vid + c1 u(Pid − Xid ) + c2 u(Pgd − Xid ), where Xid = Xid + Vid (43)
where u ∼ Uniform(0, 1) and c1 , c2 are both learning rates, corresponding to the influence
social and cognition components of the swarm respectively (Kennedy and Eberhart, 1995).
Once the velocity vectors are
P updated for the DNN, the standard deviation is computed for
the i-th activation as si = np=1 (Vip − V̄i )2 where v̄i is the mean value of Vi over training
samples.
Then compute Pearson correlation coefficient between the i-th an j-th unit in the hidden
layer as Cij = (Vip Vjp − nV̄i V̄j )/Si Sj and if Cij > τ1 where τ is a predefined threshold,
then merge both units, delete the j-th unit and update the weights as,
26
retraining to improve computational cost and time. During training, they fine-tune best
explored model given by the policy search. The MBP ratio is constrained such that the
compressed model produced by the agent is below a resource constrained threshold in resource
constrained case. Moreover, the maximum amount of pruning for each layer is constrained
to be less than 80%, When the focus is to instead maintain accuracy, they define the reward
function to incorporate accuracy and the available hardware resources.
By only requiring 1/4 number of the FLOPS they still manage to achieve a 2.7% increase
in accuracy for MobileNet-V1. This also corresponds to a 1.53 times speed up on a Titan Xp
GPU and 1.95 times speed up on Google Pixel 1 Android phone.
Liu et al. (2018b) have further shown that the network architecture itself is more important
than the remaining weights after pruning pretrained networks, suggesting pruning is better
perceived as an effective architecture search. This coincides with Weight Agnostic Neural
Networks (WANN; Gaier and Ha, 2019) search which avoids weight training. Topologies of
WANNs are searched over by first sampling single shared weight for a small subnetwork and
evaluated over several randomly shared weight rollout. For each rollout the cumulative reward
over a trial is computed and the population of networks are ranked according to the resulting
performance and network complexity. This highest ranked networks are probabilistically
selected and mixed at random to form a new population. The process repeats until the
desired performance and time complexity is met.
The two aforementioned findings (there exists smaller sparse subnetworks that perform
well from scratch and the importance of architecture design) has revived interest in finding
criteria for finding sparse and trainable subnetworks that lead to strong performance.
However, the original LTH paper was demonstrated on relatively simpler CV tasks such
as MNIST and when scaled up it required careful fine-tuning of the learning rate for the
lottery ticket subnetwork to achieve the same performance as the full network. To scale up
LTH to larger architectures Frankle et al. (2019) in a stable way without requiring any
additional fine-tuning, they relax the restrictions of reverting to the lottery ticket being found
at initialization but instead revert back to the k-th epoch. This k typically corresponds to
only few training epochs from initialization. Since the lottery ticket (i.e subnetwork) no
longer corresponds to a randomly initialized subnetwork but instead a network trained from
k epochs, they refer to these subnetworks as matching tickets instead. This relaxation on
27
LTH allows tickets to be found on CIFAR-10 with ResNet-20 and ImageNet with ResNet-50,
avoiding the need for using optimizer warmups to precompute learning rate statistics.
Zhou et al. (2019b) have further investigate the importance of the three main factors
in pruning from scratch: (1) the pruning criteria used, (2) where the model is pruned from
(e.g from initialization or k-th epoch) and (3) the type of mask used. They find that the
measuring the distance between the weight value at intialization and its value after training
is a suitable criterion for pruning and performs at least as well as preserving weights based
on the largest magnitude. They also note that if the sign is the same after training, these
weights can be preserved. Lastly, they find for (3) that using a binary mask and setting
weights to 0 is plays an integral part in LTH. Given that these LTH based pruning masks
outperform random masks at initialization, leads to the question whether we can search for
architectures by pruning as a way of learning instead of traditional backpropogation training.
In fact, Zhou et al. (2019b) have also propose to use REINFORCE (Sutton et al., 2000) to
optimize and search for optimal wirings at each layer. In the next subsection, we discuss
recent work that aims to find optimal architectures using various criteria.
∂E p
Wk ← Wk − η − ηα + 2ηΓvk (46)
∂Wk
for k-th connection. Here, η is the learning rate, Γ is a temperature term, E is the error
function and the noise vk ∼ N (0, Iσ 2 ) for each active weight W. If the Wk < 0 then the
connection is frozen. When the set the number of dormant weights exceeds a threshold,
they reactivate dormant weights with uniform √ probability. The main difference between this
update rule and SGD lies in the noise term 2ηΓvk whereby the vk noise and the amount of
it controlled by Γ performs a type of random walk in the parameter space. Although unique,
this approach is computationally expensive and challenging to apply to large networks and
datasets.
Sparse evolutionary training (SET; Mocanu et al., 2018) simplifies prune–regrowth cycles
by replacing the top-k lowest magnitude weights with newly randomly initialized weights
and retrains and this process is repeated throughout each epoch of training. Dai et al.
(2019) carry out the same SET but using gradient magnitude as the criterion for pruning the
weights. Dynamic Sparse Reparameterization (DSR; Mostafa and Wang, 2019) implements
a prune–redistribute–regrowth cycle where target sparsity levels are redistributed among
layers, based on loss gradients (in contrast to SET, which uses fixed, manually configured,
sparsity levels). SparseMomentum (SM; Dettmers and Zettlemoyer, 2019) follows the same
cycle but instead using the mean momentum magnitude of each layer during the redistribute
phase. SM outperforms DSR on ImageNet for unstructured pruning by a small margin
but has no performance difference on CIFAR experiments. Our approach also falls in
the dynamic category but we use error compensation mechanisms instead of hand crafted
redistribute–regrowth cycles
28
Ramanujan et al. (2020)6 propose an edge-popup algorithm to optimize towards a
pruned subnetwork from a randomly initialized network that leads to optimal accuracy. The
algorithm works by switching edges until the optimal configuration is found. Each weight
is assigned a “popup” score suv from neuron u to v. The top-k % percentage of weights
with the highest popup score are preserved while the remaining weights are pruned. Since
the top-k threshold is a step function which is non-differentiable, they propose to use a
straight-through estimator to allow gradients to backpropogate and differentiate the loss with
respect to suv for each respective weight i.e the activation function g is treated as the identity
function in the backward pass. The scores suv are then updated via SGD. Unlike, Theis et al.
(2018) that use the absolute value of the gradient, they find that preserving the direction of
momentum leads to better performance. During training, removed edges that are not within
the top-k can switch to other positions of the same layer as the scores change. They show
that this shuffling of weights to find optimal permutation leads to lower cross-entropy loss
throughout training. Interestingly, this type of adaptive pruning training leads to competitive
performance on ImageNet when compared to ResNet-34 and can be performed on pretrained
networks.
29
leads to a cut-off of the gradient flow (in the extreme case, a whole layer is removed), leading
to poor signal propogation and maximal compression while allowing the gradient to flow is
referred to as critical compression.
They show that retraining with MBP avoids
layer-wise collapse because gradient-based op-
timiziation encourages compression with high
signal propogation. From this insight, they pro-
pose a measure for measuring synaptic flow, ex-
pressed in Equation 47. The parameters are
first masked as θµ ← µ θ0 . Then the iterative
synaptic flow pruning objective is evaluated as,
T
Y
T
L=1 ( |θ[l]µ |)1 (47)
l=1
Figure 6: original source: Tanaka et al.
where 1 is a vectors of ones. The score S is
∂R (2020) - Layer collapse in VGG-16 network
then computed as S = ∂θ θ µ and the threshold
µ for different pruning criteria on CIFAR-100
τ is defined as τ = (1 − ρ − k/n) where n is the
number of pruning iterations and ρ is the compression ratio. If S) > τ then the mask µ is
updated.
The effects of layer collapse for various random pruning, MBP, SNIP and synaptic flow
(SynFlow) are shown in Figure 6. We see that SynFlow achieves far higher compression ratio
for the same test accuracy without requiring any data.
where for a low rank k < r, W ∈ Rm×k and H ∈ Rk×n and || · ||F is the Frobenius norm.
A common technique for achieving this low rank TD is Singular Value Decomposition
(SVD). For orthogonal matrices U ∈ Rm×r , V ∈ Rn×r and a diagonal matrix Σ ∈ Rr×r of
singular values, we can express A as
A = UΣVT (49)
where if k < r then this is called truncated SVD. The nonzero elements of Σ are the
sorted in decreasing order and the top k Σk ∈ Rk×k are used as A ≈ Uk Σk VTk .
30
Randomized SVD (Halko et al., 2011) has also been introduced for faster approximation
using ideas from random matrix theory. An approximation of the range A by finding Q with
r othornomal columns and A ≈ QQT A. Then the SVD is found by constructing a matrix
B = QT A and SVD is instead computed on B as before using Equation 49. Since A ≈ QB,
we can see U = QS computes a LRD A ≈ U SVT
Then as A ≈ QQT A = Q(SΣVT ), we see that taking U = QS, we have computed a low
rank approximation A ≈ U SVT . Approximating Q is achieved by forming a Gaussian random
matrix ω ∈ Rn×l and computing Z = Aω, and using QR decomposition of Z, QR = Z, then
Q ∈ Rm×l has columns that are an orthonormal basis for the range of Z.
Numerical precision is maintained by taking intermediate QR and LU decompositions
during o power iterations of AAT to reduce Y ’s spectrum because if the singular values
of A are Σ, then the singular values of (AAT )o are Σ2o+1 . With each power iteration the
spectrum decays exponentially, therefore it only requires very few iterations.
These products are used when performing Canonical Polyadic ((CP) Hitchcock, 1927),
Tucker decompositions (Tucker, 1966), Tensor Train (TT Oseledets, 2011) to find the factor
matrices X := [[A, B . . . , C]]. For the sake of simplicity we’ll proceed with 3-way tensors.
As before in Equation 48, we can express the optimization objective as
X X
min ||xijk − ail bjl ckl ||2 (50)
A,B,C
i,j,k l
Since the components a, b, c are not orthogonal, we cannot compute SVD as was the
case for matrices. The rank r of A is also NP-hard and the solutions found for lower rank
approximations may not be apart of the solution for higher ranks. Unlike, when you rotate
the row or column vectors of a matrix and apply dimensionality reduction (e.g PCA) and
still get the same solution, this is not the case for TD. Unlike matrices where there can be
many low rank matrices, a tensor is requires to have a low-rank matrix that is compatible
for all tensor slices. This interconnection between different slices results in tensor being more
restrictive and hence the for weaker uniqueness conditions.
31
One way to perform TD using Equation 50 is to use alternating least squares (ALS)
which involves minimizing minA while fixing B, C and repeating this for minB and minC .
ALS is suitable because it is a nonconvex optimization problem but with convex subproblems.
CP can be generalized to different objectives apart from the squared loss, such as Rayleigh
(when entries are non-negative), Boolean (entries are binary) and Poisson (when entries are
counts) losses. Similar to randomized SVD described in the previous subsection, they have
also been successful when scaling up TD.
With this introduction, we now move onto how low rank TD has been applied to DNNs
for reducing the size of large weight matrices.
32
For N 2-d filters {f j }1≤j≤N , one can obtain a shared set of separable (rank-1) filters by
minimizing the objective in Equation 52
X N N
f j ∗ mji ||22 + λ ||mji ||1
X X
argmin ||xi − (52)
{f j },{mji } i j=1 j=1
where xi is an input image, ∗ denotes the convolution product operator, {mji }j=1...N are
the feature maps obtained during training and λ1 is a regularization coefficient. This can be
optimized using stochastic gradient descent (SGD) to optimize for mji latent features and f j
filters.
In the first approach they identify low-rank filters using the objective in Equation 53 to
penalize high-rank filters.
X N N N
||mji ||1 + λ∗
X X X
argmin{sj },{mj } ||xi − sj ∗ mij ||22 + λ1 ||sj ||∗ (53)
i
i j=1 j=1 j=1
where the sj are the learned linear filters, || ∗ || is the sum of singular values (convex
relaxation of the rank), and λ∗ is an additional regularization parameter. The second
approach involves separating the optimization of squared difference between the original
filter f i and the weighted combination of learned linear filters wkj sk , and the sum of singular
values of learned filters s.
X M M
wkj sk ||22 + λ∗
X X
argmin{s j ||f i − ||sj ||∗ (54)
k },{wk }
j k=1 k=1
They find empirically that decoupling the computation of the non-separable filters from
that of the separable ones leads to better results compared to jointly optimizing over sj , mji
and wkj which is a difficult optimization problem.
33
Figure 7: original source: Jaderberg et al. (2014) - Low Rank Expansion Methods: (a)
standard CNN filter, (b) LR approximation along the spatial dimension of 2d seperable
filters and (c) extending to 3D filters where each conv. layer is factored as a sequence ot two
standard conv. layers but with rectangular filters.
5. Knowledge Distillation
Knowledge distillation involves learning a smaller network from a large network using
supervision from the larger network and minimizing the entropy, distance or divergence
between their probabilistic estimates.
To our knowledge, Buciluǎ et al. (2006) first explored the idea of reducing model size
by learning a student network from an ensemble of models. They use a teacher network to
label a large amount of unlabeled data and train a student network using supervision from
the pseudo labels provided by the teacher. They find performance is close to the original
ensemble with 1000 times smaller network.
Hinton et al. (2015) a neural network knowledge distillation approach where a relatively
small model (2-hidden layer with 800 hidden units and ReLU activations) is trained using
supervision (class probability outputs) for the original “teacher” model (2-hidden layer, 1200
hidden units). They showed that learning from the larger network outperformed the smaller
network learning from scratch in the standard supervised classification setup. In the case of
learning from ensemble, the average class probability is used as the target.
The cross entropy loss is used between the class probability outputs of the student output
y S and one-hot target y and a second term is used to ensure that the student representation
z s is similar to the teacher output z T . This is expressed in terms of KL divergence as,
!
zT zS
LKD = (1 − α)H(y, y S ) + αρ2 H φ ,φ (55)
ρ ρ
34
Figure 8: original source Mirzadeh et al. (2019)
Teacher Assistant Knowledge Distillation Mirzadeh et al. (2019) show that the
performance of the student network degrades when the gap between the teacher and the
student is too large for the student to learn from. Hence, they propose an intermediate
‘teaching assistant’ network to supervise and distil the student network where the intermediate
networks is distilled from the teacher network.
Figure 8 shows their plot, where on the left side a) and b) we see that as the gap
between the student and teacher networks widen when the student network size is fixed,
the performance of student network gradually degrades. Similarly, on the right hand side,
a similar trend is observed when the student network size is increased with a fixed teacher
network.
Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets
and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
Their Figure 9 shows the loss surface of CNNs trained on CIFAR-100 for 3 different
approaches: (1) no distillation, (2) standard knowledge distillation and (3) teaching assisted
knowledge distillation. As shown, the teaching assisted knowledge distillation has a smoother
35
Figure 10: original source Cho and Hariharan (2019): Early Stopping Teacher Networks to
Improve Student Network Performance
surface around the local minima, corresponding to more robustness when the inputs are
perturbed and better generalization.
Avoid Training the Teacher Network with Label Smoothing Muller et al. (2019)
show that because label smoothing forces the same class sample representations to be closer
to each other in the embedding space, it provides less information to student network about
the boundary between each class and in turn leads to poorer generlization performance. They
quantify the variation in logit predictions due to the hard targets using mutual information
between the input and output logit and show that label smoothing reduces the mutual
information. Hence, they draw a connection between label smoothing and information
bottleneck principle and show through experiments that label smoothing can implicitly
calibrate the predictions of a DNN.
Distilling with Noisy labels Sau and Balasubramanian (2016) propose to use noise
to simulate learning from multiple teacher networks by simply adding Gaussian noise the
logit outputs of the teacher network, resulting in better compression when compared to
36
training with the original logits as targets for the teacher network. They choose a set of
samples from each mini-batch with a probability α to perturbed by noise while the remaining
samples are unchanged. They find that a relatively high α = 0.8 performed the best for
image classification task, corresponding to 80% of teacher logits having noise.
Li et al. (2017) distil models with noisy labels and use a small dataset with clean labels,
alongside a knowledge graph that contains the label relations, to estimate risk associated
with training using each noisy label. A model is trained on the clean dataset Dc and the
main model is trained over the whole dataset D with noisy labels using the loss function,
37
Figure 11: original source Lopes et al. (2017): Data Free Knowledge Distillation. The left
shows
strategy could also be useful for using the teacher network to provide samples to a smaller
student network that improve the learning of the student.
Layer Fusion Layer Fusion (LF) (Neill et al., 2020) is a technique to identify similar layers
in very deep pretrained networks and fuse the top-k most similar layers during retraining
for a target task. Various alignments measures are proposed that have desirable properties
of for layer fusion and freezing, averaging and dynamic mixing of top-k layer pairs are
all experimented with for fusing the layers. This can be considered as unique approach
to knowledge distillation as it does aim to preserve the knowledge in the network while
preserving network density, but without having to train a student network from scratch.
38
5.3 Distilling Recurrent (Autoregressive) Neural Networks
Although the work by Buciluǎ et al. (2006) and Hinton et al. (2015) has often proven
successful for reducing the size of neural models in other non-sequential tasks, many sequential
tasks in NLP and CV have high-dimensional outputs (machine translation, pixel generation,
image captioning etc.). This means using the teachers probabilistic outputs as targets can
be expensive.
Kim and Rush (2016) use the teachers hard targets (also 1-hot vectors) given by the
highest scoring beam search prediction from an encoder-decoder RNN, instead of the soft
output probability distribution. The teacher distribution q(yt |x) is approximated by its
mode:q(ys |x) ≈ 1t = argmaxyt ∈Y q(yt |x) with the following objective
X
LSEQ−M D = −Ex∼D p(yt |x) log p(yt |x) ≈ −Ex∼D , ŷs = argmax q(yt |x)[log p(yt = ŷs |x)]
yt ∈Y yt ∈Y
(58)
where yt ∈ Y are teacher targets (originally defined by the predictions with the highest
scoring beam search) in the space of possible target sequences. When the temperature τ → 0,
this is equivalent to standard knowledge distillation.
In sequence-level interpolation, the targets from the teacher with the highest similarity
with the ground truth are used as the targets for the student network. Experiments on
NMT showed performance improvements compared to soft targets and further pruning the
distilled model results in a pruned student that has 13 times fewer parameters than the
teacher network with a 0.4 decrease in BLEU metric.
39
and T . The TinyBERT distillation objective is shown below, where it combines multiple
reconstruction errors between S and T embeddings (when m=0), between the hidden and
attention layers of S and T when M ≥ m > 0 where M is index of the last hidden layer
before prediction layer and lastly the cross entropy between the predictions where t is the
temperature of the softmax.
MSE(ES We ET ) m=0
S T 1 Ph S T
Llayer Sm , Tg (m) = MSE(H Wh , H ) + h i=1 MSE(Ai , Ai ) M ≥ m > 0
softmax(z T ) · log-softmax(z S /t) m=M +1
Through many ablations in experimentations, they find distilling the knowledge from
multi-head attention layers to be an important step in improving distillation performance.
ALBERT Lan et al. (2019) proposed factorized embeddings to reduce the size of the
vocabulary embeddings and parameter sharing across layers to reduce the number of pa-
rameters without a performance drop and further improve performance by replacing next
sentence prediction with an inter-sentence coherence loss. ALBERT is. 5.5% the size of
original BERT and has produced state of the art results on top NLP benchmarks such as
GLUE (Wang et al., 2018a), SQuAD (Rajpurkar et al., 2016) and RACE (Lai et al., 2017).
BERT Distillation for Text Genera-
tion Chen et al. (2019) use a condi-
tional masked language model that enables
BERT to be used on generation tasks. The
outputs of a pretrained BERT teacher net-
work are used to provide sequence-level su-
pervision to improve Seq2Seq model and
allow them to plan ahead. Figure 12 il-
lustrates the process, showing where the
predicted probability distribution for the Figure 12: original source (Chen et al., 2019):
remaining tokens is minimized with re- BERT Distillation for Text Generation
spect to the masked output sequence from
the BERT teacher.
Applications to Machine Translation Zhou et al. (2019a) seek to better understand
why knowledge distillation leads to better non-autoregressive distilled models for machine
translation. They find that the student network finds it easier to model variations in the
output data since the teacher network reduces the complexity of the dataset.
40
its misclassification rate. Similarly this can used on the ensemble to learn which outputs
from each model to use for supervision. Instead of learning from a combination of teachers
that are best by using an oracle that approximates the best outcome of the ensemble for
automatic speech recognition (ASR) as
N
X
Poracle (s|x) = [O(u) = i]Pi (s|x) = PO (u)(s|x) (59)
i=1
where the oracle O(u) ∈ 1 . . . N that contains N teachers assigns all the weight to the
model that has the lowest word errors for a given utterance u. Each model is an RNN of
different architecture trained with different objectives and the student s is trained using the
Kullbeck Leibler (KL) divergence between oracle assigned teachers output and the student
network output. They achieve an 8.9% word error rate improvement over similarly structured
baseline models.
Freitag et al. (2017) apply knowledge distillation to NMT by distilling an ensemble of
networks and oracle BLEU teacher network into a single NMT system. The find a student
network of equal size to the teacher network outperforms the teacher after training. They
also reduce training time by only updating the student networks with filtered samples based
on the knowledge of the teacher network which further improves translation performance.
Cui et al. (2017) propose two strategies for learning from an ensemble of teacher network;
(1) alternate between each teacher in the ensemble when assigning labels for each mini-batch
and (2) simultaneously learn from multiple teacher distributions via data augmentation.
They experiment on both approaches where the teacher networks are deep VGG and LSTM
networks from acoustic models.
Cui et al. (2017) extend knowledge distillation to multilingual problems. They use
multiple pretrained teacher LSTMs trained on multiple low-resource languages to distil into
a smaller standard (fully-connected) DNN. They find that student networks with good input
features makes it easier to learn from the teachers labels and can improve over the original
teacher network. Moreover, from their experiments they suggest that allowing the ensemble
of teachers learn from one another, the distilled model further improves.
Mean Teacher Networks Tarvainen and Valpola (2017) find that averaging the models
weights of an ensemble at each epoch is more effective than averaging label predictions
for semi-supervised learning. This means the Mean Teacher can be used as unsupervised
learning distillation approach as the distiller does not need labels. than methods which
rely on supervision for each ensemble model. They find this straightforward approach to
outperform previous ensemble based distillation approaches (Laine and Aila, 2016) when
only given 1000 labels on the Street View House View Number (SVHN; Goodfellow et al.,
2013) dataset. Moreover, using Mean Teacher networks with Residual Networks achieved
SoTA with 4000 labels from 10.55% error to 6.28% error.
on-the-fly native ensemble Zhu et al. (2018) focus on using distillation on the fly in a
scenario where the teacher may not be fully pretrained or it does not have a high capacity.
This reduces compression from a two-phase (pretrain then distil) to one phase where both
student and teacher network learn together. They propose an On the fly Native Ensemble
(ONE) learning strategy that essentially learns a strong teacher network that assists the
41
student network as it is learning. Performance improvements for on the fly distillation are
found on the top benchmark image classification datasets.
Multi-Task Teacher Networks Liu et al. (2019a) perform knowledge distillation for
performing multi-task learning (MTL), using the outputs of teacher models from each natural
language understanding (NLU) task as supervision for the student network to perform
MTL. The distilled MT-DNN outperforms the original network on 7 out of 9 NLU tasks
(includes sentence classification, pairwise sentence classification and pairwise ranking) on the
GLUE (Wang et al., 2018a) benchmark dataset.
42
comparable, i.e., uh and r must be the same non-linearity. The teacher tries to imitate the
flow matrices from the teacher which are defined as the inner product between feature maps,
such as layers in a residual block.
N
" !#
1 X s 0 zt
0 z
s 2
L(x, y, Ws , Wt , α) = − yn log(zn ) + λT 2T DKL σ σ
N T T
n=1
M
X
+ λV LKL (Ws , α) + λg max WT :S (m, n, k, h, l) (62)
n,k,h,l
m=1
Figure 13 shows their training procedure and loss function that consist of the learning
compact and sparse student networks. The roles of different terms in variational loss function
are: likelihood - for independent student network’s learning; hint - learning induced from
teacher network; variational term - promotes sparsity by optimizing variational dropout
parameters, α; Block Sparse Regularization - promotes and transfers sparsity from the teacher
network.
min max Ex∼pdata [log(fw (x)] + Ez∼pz [log(1 − fw (gθ (z)))] (63)
θ∈Θ w∈W
where the global minimum is found when the generator distribution pg is similar to the
data distribution pdata (referred to as the nash equilibrium).
43
Figure 14: original source Wang et al. (2018c): Comparison among KD, NaGAN, and
KDGAN
Wang et al. (2018c) learn a Generative Adversarial Student Network where the generator
learns from the teacher network using the minimax objective in Equation 63. They reduce
the variance in gradient updates which leads less epochs requires to train to convergence, by
using the Gumbel-Max trick in the formulation of GAN knowledge distillation.
First they propose Naive GAN (NaGAN) which consists of a classifier C and a discrimi-
nator D where C generates pseudo labels given a sample x from a categorical distribution
pc (y|x) and D distinguishes between the true targets and the generated ones. The objective
for NaGAN is express as,
min max V (c, d) = Ey∼pu [log pd (x, y)] + Ey∼pc [log(1 − p%d (x, y))] (64)
c d
where V (c, d) is the value function. The scoring functions of C and D are h(x, y) and
g(x, y) respectively. Then pc (y|x) and p%d (x, y) are expressed as,
where φ is the softmax function and σ is the sigmoid function. However, NaGAN requires
a large number of samples and epochs to converge to nash equilibrium using this objective,
since the gradients from D that update C can often vanish or explode.
This brings us to their main contribution, Knowledge Distilled GAN (KDGAN).
KDGAN somewhat remedy the aforementioned converegence problem by introducing
a pretrained teacher network T along with C and D. The objective then consists of a
distillation `2 loss component between T and C and adversarial loss between T and D.
Therfore, both C and T aim to fool D by generating fake labels that seem real, while C tries
to distil the knowledge from T such that both C and T agree on a good fake label.
The student network convergence is tracked by observing the generator outputs and loss
changes. Since the gradient from T tend to have low variance, this can help C converge
faster, reaching a nash equilibrium. The difference between these models is illustrated in
Figure 14.
44
Student-teacher training framework with joint loss for student training. The teacher
generator was trained using deconvolutional GAN (DCGAN; Radford et al., 2015) framework.
They use a joint training loss to optimize that can be expressed as,
h i
min max Ex∼pdata [log(fw (x)] + Ez∼pz α log(1 − fw (gθ (z))) + (1 − α)gteacher ||(z) − gθ (z)||2
θ∈Θ w∈W
(66)
where α controls the influence of the MSE loss between the logit predictions gteacher (z)
and gθ (z) of teacher and student respectively. The terms with expectations correspond to
the standard adversarial loss.
where || · ||F denotes the Frobenius norm, I is the total number of layer pairs considered
and γ controls the influence of similarity preserving term between both networks.
In the transfer learning setting, their experiments show that similarity preserving can
be a robust way to deal with domain shift. Moreover, this method complements the SoTA
attention transfer (Zagoruyko and Komodakis, 2016a) approach.
Contrastive Representation Distillation Instead of minimizing the KL divergence
between the scalar outputs of teacher network T and student network S, Tian et al. (2019)
propose to preserve structural information of the embedding space. Similar to Hinton et al.
(2012b), they force the representations between the student and teacher network to be similar
but instead use a constrastive loss that moves positive paired representations closer together
while positive-negative pairs away. This contrastive objective is given by,
45
Figure 16: original source: Tian et al. (2019)
argmax max Eq (T, S|C = 1)[log h(T, S)] + N Eq (T, S|C = 0)[log(1 − h(T, S))] (68)
fS h
T (T )0 g S (S)0 /τ
eg
where h(T, S) = g T (T )g S (S)/τ
, M is number of data samples, τ is the temperature.
e +N M
If the dimensionality of the outputs from g T and g S are not equal, a linear transformation is
made to fixed size followed by an `2 normalization.
Figure 16 demonstrates how the correlations between student and teacher network
are accounted for in CRD (d) while in standard teacher-student networks (a) ignores the
correlations and to a less extent this is also found for attention transfer (b) (Zagoruyko and
Komodakis, 2016b) and the student network distilled by KL divergence (c) (Hinton et al.,
2015).
Distilling SimCLR Chen et al. (2020) shows that an unsupervised learned constrastive-
based CNN requires 10 times less labels to for fine-tuning on ImageNet compared to only
using a supervised contrastive CNN (ResNet architecture). They find a strong correlation
between the size of the pretrained network and the amount of labels it requires for fine-tuning.
Finally, the constrastive network is distilled into a smaller version without sacrificing little
classification accuracy.
46
Figure 17: original source Park et al. (2019): Individual knowledge distillation (IKD) vs.
relational knowledge distillation (RKD)
They find that measuring the angle between teacher and student outputs as input to the
huber loss Ldelta leads to improved performance when compared to previous SoTA on metric
learning tasks.
X
Lrmd−a = lδ ψA (ti , tj , tk ), ψA (si , sj , sk ) (70)
(xi ,xj ,xk )∈X 3
This is then used as a regularization terms to the task specific loss as,
Figure 18 shows the test data recall@1 on tested relational datasets. The teacher network
is trained with the triplet loss and student distils the knowledge using Equation 71. Left of
the dashed line are results on the training domain while on the right shows results on the
remaining domains.
Song et al. (2018) use attention-based knowl-
edge distillation for fashion matching that jointly
learns to match clothing items while incorporating
domain knowledge rules defined by clothing descrip-
tion where the attention learns to assign weights
corresponding to the rule confidence.
47
To speed up training, faster inference and reduce bandwidth memory requirements, ongo-
ing research has focused on training and performing inference with lower-precision networks
using integer precision (IP) as low as INT-8 INT-4, INT-2 or 1 bit representations (Dally,
2015). Designing such networks makes it easier to train such networks on CPUs, FPGAs,
ASICs and GPUs.
Two important features of quantization are the range of values that can be represented
and the bit spacing. For the range of signed integers with n bits, we represent a range of
[-2n-1, 2n-2] and for full precision (FP-32) the range is +/ − 3.4 × 1038. For signed integers,
there 2n values in that range and approximately 4.2 × 109 for FP-32. FP can represent a
large array of distributions which is useful for neural network computation, however this
comes at larger computational costs when compared to integer values. For integers to be
used to represent weight matrices and activations, a FP scale factor is often used, hence
many quantization approaches involve a hybrid of mostly integer formats with FP-32 scaling
numbers. This approach is often referred to as mixed-precision (MP) and different MP
strategies have been used to avoid overflows during training and/or inference of low resolution
networks given the limited range of integer formats.
In practice, this often requires the storage of hidden layer outputs with full-precision
(or at least with represented with more bits than the lower resolution copies). The main
forward-pass and backpropogation is carried out with lower resolution copies and convert
back to the full-precision stored “accumulators” for the gradient updates.
In the extreme case where binary weights (-1, 1) or 2-bit ternary weights (-1, 0, 1) are
used in fully-connected or convolutional layers, multiplications are not used, only additions
and subtractions. For binary activations, bitwise operations are used (Rastegari et al.,
2016) and therefore addition is not used. For example, Rastegari et al. (2016) proposed
XNOR-Networks, where binary operations are used in a network made up of xnor gates
which approximate convolutions leading to 58 times speedup and 32 times memory savings.
48
consider not adapting the min-max ranges online and clip outlying values that may occur as
a a result of this in order to drastically reduce the min-max range. They find SoTA speedups
for CNN parallelism, achieving a 50 time speedup over baselines on 96 GPUs.
Gupta et al. (2015) show that stochastic rounding techniques are important for FP-16
DNNs to converge and maintain test accuracy compared to their FP-32 counterpart models.
In stochastic rounding the weight x is rounded to the nearest target fixed point representation
[x] with probability 1 − (x − [x])/ where is the smallest positive number representable
in the fixed-point format, otherwise x is rounded to x + . Hence, if x is close to [x] then
the probability is higher of being assigned [x]. Wang et al. (2018b) train DNNs with FP-8
while using FP-16 chunk-based accumulations with the aforementioned stochastic rounding
hardware.
The necessity of stochastic rounding, and other requirements such as loss scaling, has
been avoided using customized formats such as Brain float point ((BFP) Kalamkar et al.,
2019) which use FP-16 with the same number of exponent bits as FP-32. Cambier et al.
(2020) recently propose a shifted and squeezed 8-bit FP format (S2FP-8) to also avoid the
need of stochastic rounding and loss scaling, while providing dynamic ranges for gradients,
weights and activations. Unlike other related 8-bit techniques (Mellempudi et al., 2019), the
first and last layer do not need to be in FP32 format, although the accmulator converts the
outputs to FP32.
Park et al. (2018a) exploit the fact that most the weight and activation values are scattered
around a narrower region while larger values outside such region can be represented with
higher precision. The distribution is demonstrated in Figure 19, which displays the weight
distribution for the 2nd layer in the LeNet CNN network. Instead of using linear quantization
shown in (c), a smaller bit interval is used for the region of where most values lie (d), leading
to less quantization errors.
They propose 3-bit activations for training quantized ResNet and Inception CNN archi-
tectures during retraining. For inference on this retrained low precision trained network,
weights are also quantized to 4-bits for inference with 1% of the network being 16-bit scaling
factor scalars, achieving accuracy within 1% of the original network. This was also shown
to be effective in LSTM network on language modelling, achieving similar perplexities for
bitwidths of 2, 3 and 4.
Migacz (2017) use relative entropy to measure the loss of information between two
encodings and aim minimize the KL divergence between activation output values. For each
layer they store histgrams of activations, generate quantized distributions with different
saturation thresholds and choose the threshold that minimizes the KL divergence between
the original distribution and the quantized distribution.
Banner et al. (2018) analyze the tradeoff between quantization noise and clipping
distortion and derive an expression for the mean-squared error degradation due to clipping.
Optimizing for this results in choosing clipping values that improve 40% accuracy over
standard quantization of VGG16-BN to 4-bit integer.
49
Another approach is to use scaling factors per group of weights (e.g channels in the case
of CNNs or internal gates in LSTMs) as opposed to whole layers, particularly useful when
the variance in weight distribution between the weight groupings is relatively high.
1
Forward : ro = round((2n − 1)ri ) (73)
2n
−1
∂L ∂L
Backward : = (74)
∂ri ∂ro
50
Figure 20: Quantized Knowledge Distillation (original source: (Zhou et al., 2017))
To compute the integer dot product of r0 with another n−bit vector, they use Equation 75,
with a computational complexity of O(M K), directly proportional to bitwidth of x and y.
Furthermore, bitwise kernels can also be used for faster training and inference
M
X −1 K−1
X
x·y = 2m+k bitcount[and(cm (x), ck (y))] (75)
m=0 k=0
cm (x)i, ck (y) i ∈ {0, 1} ∀i, m, k (76)
51
Figure 21: Mixed Precision Training (original source: Micikevicius et al. (2017)
training. It has also been observed that activations are more sensitive to quantization than
weights (Zhou et al., 2016)
Micikevicius et al. (2017) use half-precision (16-bit) floating point accuracy to represent
weights, activations and gradients, without losing model accuracy or having to modify
hyperparameters, almost halving the memory requirements. They round a single-precision
copy of the weights for forward and backward passes after performing gradient-updates,
use loss-scaling to preserve small magnitude gradient values and perform half-precision
computation that accumulates into single-precision outputs before storing again as half-
precision in memory.
Figure 21 illustrates MPT, where the forward and backward passes are performed with
FP-16 precision copies. Once the backward pass is performed the computed FP-16 gradients
are used to update the original FP-32 precision master weight. After training, the quantized
weights are used for inference along with quantized activation units. This can be used in any
type of layer, convolutional or fully-connected.
Others have focused solely on quantizing weights, keeping the activations at FP32
(Li et al., 2016a; Zhu et al., 2016).During gradient descent, Zhu et al. (2016) learn both
the quantized ternary weights and pick which of these values is assigned to each weight,
represented in a codebook.
Das et al. (2018) propose using Integer Fused-Multiply-and-Accumulate (FMA) operations
to accumulate results of multiplied INT-16 values into INT-32 outputs and use dynamic
fixed point scheme to use in tensor operations. This involves the use of a shared tensor-wide
exponent and down-conversion on the maximum value of an output tensor at each given
training iteration using stochastic, nearest and biased rounding. They also deal with overflow
by proposing a scheme that accumulates INT-32 intermediate results to FP-32 and can trade
off between precision and length of the accumulate chain to improve accuracy on the image
classification tasks. They argue that previous reported results on mixed-precision integer
training report on non-SoTA architectures and less difficult image tasks and hence they also
report their technique on SoTA architectures for the ImageNet 1K dataset.
52
ResNet network both from scratch, (2) a full precision trained network is transferred to
train a low-precision network from scratch and (3) a trained full-precision network guides a
smaller full-precision student randomly initialized network which is gradually becomes lower
precision throughout training. They find that (2) converges faster when supervised by an
already trained network and (3) outperforms (1) and set at that time was SoTA for Resnet
classifiers at ternary and 4-bit precision.
Lin et al. (2017b) replace FP-32 convolutions with multiple binary convolutions with
various scaling factors for each convolution, overall resulting in a large range.
Zhou et al. (2016) and Choi et al. (2018) have both reported that the first and last
convolutional layers are most sensitive to quantization and hence many works have avoided
quantization on such layers. However, Choi et al. (2018) find that if the quantization is not
very low (e.g 8-bit integers) then these layers are expressive enough to maintain accuracy.
Zhou et al. (2017) have overcome this problem by iteratively quantizing the network
instead of quantize the whole model at once. During the retraining of an FP-32 model, each
layer is iteratively quantized over consecutive epochs. They also consider using supervision
from a teacher network to learn a smaller quantized student network, combining knowledge
distillation with quantization for further reductions.
53
Equation 77 shows the proximal newton update step where wlt is the weight update at
iteration t for layer l, D is an approximation to the diagonal of the Hessian which is already
given as the 2nd momentum of the adaptive momentum (adam) optimizer. The t-th iteration
of the proximal Newton update is as follows:
where the loss ` w.r.t binarized version of `(wt ) is expressed in terms of the 2nd -order
TS expansion using a diagonal approximation of the Hessian Ht−1 , which estimates of the
Hessian at wt−1 . Similar to the 2nd order approximations discussed in subsection 3.2, the
Hessian is essential since ` is often flat in some directions but highly curved in others.
min +a1 Lp (Wl , Ŵl ) + E(Wl , Ŵl ) s.t.Ŵ ∈ {al ck |1 ≤ k ≤ K}, 1 ≤ l ≤ L (78)
Ŵl
where Lp is the loss difference between quantized and the original model ||L(Wl )−L(Ŵl )||,
E is the reconstruction error between the quantized and original weights ||Wl − Ŵl ||2 , al a
regularization coefficient for the l-th layer and ck is an integer and k is the number of weight
centroids.
Value-aware quantization Park et al. (2018a) like prior work mentioned in this work
have also succeeded in reduced precision by reducing the dynamic range by narrowing the
range where most of the weight values concentrate. Different to other work, they assign higher
precision to the outliers as opposed to mapping them to the extremum of the reduced range.
This small difference allow 3-bit activations to be used in ResNet-152 and DenseNet-201,
leading to a 41.6% and 53.7% reduction in network size respectively.
54
Differentiable Soft Quantization Gong et al. (2019) have proposed differentiable soft
quantization (DSQ) learn clipping ranges in the forward pass and approximating gradients
in the backward pass. To approximate the derivative of a binary quantization function, they
propose a differentiable asymptotic function (i.e smooth) which is closer to the quantization
function that it is to a full-precision tanh function and therefore will result in less of a
degradation in accuracy when converted to the binary quantization function post-training.
For multi-bit uniform quantization, given the bit width b and the floating-point activa-
tion/weight x following in the range (l, u), the complete quantization-dequantization process
of uniform quantization can be defined as: QU (x) = round(x∆)∆ where the original range
(l, u) is divided into 2b − 1 intervals Pi , i ∈ (0, 1, . . . 2b − 1), and ∆ = 2u−l
b −1 is the interval
length.
The DSQ function, shown in Equation 79, handles the point x depending what interval
in Pi lies.
The DSQ can be viewed as aligning the data with the quantization values with minimal
quantization error due to the bit spacing that is carried out to reflect the weight and activation
distributions. Figure 23b shows the DSQ curve without [-1, 1] scaling, noting standard
quantization is near perfectly approximated when the largest value on the curve bounded
(a) original source: Gong et al. (2019) (b) original source: Gong et al. (2019)
55
1
by +1 is small. They introduce a characteristic variable α := 1 − tanh(0.5k∆) = 1 − s and
given that,
u−l
∆= (81)
2b − 1
1
ϕ(0.5∆) = 1 ⇒ k = log(2/α − 1) (82)
∆
DSQ can be used as a piecewise uniform quantizer and when only one interval is used, it
is the equivalent of using DSQ for binarization.
Soft-to-hard vector quantization Agustsson et al. (2017) propose to compress both
the feature representations and the model by gradually transitioning from soft to hard
quantization during retraining and is end-to-end differentiable. They jointly learn the
quantization levels with the weights and show that vector quantization can be improved over
scalar quantization.
1 X ∂L
c←c−η where Jc = {(k, l) | c[IKL ] = c} (85)
|Jc | ∂bKL
(k,l)∈Jc
where L is the loss function, IKL is an index for the (k, l) subvector and η > 0 is the
codebook learning rate. This adapts the upper layers to the drift appearing in their inputs,
reducing the impact of the quantization approximation on the overall performance.
Quantization-Aware Training Instead of iPQ, Jacob et al. (2018) use a straight through
estimator ((STE) Bengio et al., 2013) to backpropogate through quantized weights and
activations of convolutional layers during training. Figure 24 shows the 8-bit weights and
activations, while the accumulator is represented in 32-bit integer.
They also note that in order to have a challenging architecture to compress, experiments
should move towards trying to compress architectures which are already have a minimal
56
Figure 24: original source ( (Jacob et al., 2018)): Integer-arithmetic-only quantization
number of parameter and perform relatively well to much larger predecessing architectures
e.g EfficientNet, SqueezeNet and ShuffleNet.
Quantization Noise Fan et al. (2020) argue that both iPQ and QAT are less suitable
for very low precision such as INT4, ternary and binary. They instead propose to randomly
simulate quantization noise on a subset of the network and only perform backward passes
on the remaining weights in the network. Essentially this is a combination of DropConnect
(instead of the Bernoulli function, it is a quantization noise function) and Straight Through
Estimation is used to backpropogate through the sample of subvectors chosen for quantization
for a given mini-batch.
Estimating quantization noise through randomly sampling blocks of weights to be quan-
tized allows the model to become robust to very low precision quantization without being
too severe, as is the case with previous quantization-aware training (Jacob et al., 2018). The
authors show that this iterative quantization approach allows large compression rates in
comparison to QAT while staying close to (few perplexity points in the case of language mod-
elling and accuracy for image classification) the uncompressed model in terms of performance.
They reach SoTA compression and accuracy tradeoffs for language modelling (compression
of Transformers such as RoBERTa on WikiText) and image classification (compressing
EfficientNet-B3 by 80% on ImageNet).
Hessian-Based Quantization The precision and order (by layer) of quantization has
been chosen using 2nd order information from the Hessian (Dong et al., 2019). They show that
on already relatively small CNNs (ResNet20, Inception-V3, SqueezeNext) that Hessian Aware
Quantization (HAWQ) training leads to SoTA compression on CIFAR-10 and ImageNet with
a compression ratio of 8 and in some cases exceed the accuracy of the original network with
no quantization.
Similarly, Shen et al. (2019) quantize transformer-based models such as BERT with mixed
precision by also using 2nd order information from the Hessian matrix. They show that each
layer exhibits varying amount of information and use a sensitivity measure based on mean
and variance of the top eigenvalues. They show the loss landscape as the two most dominant
eigenvectors of the Hessian are perturbed and suggest that layers that shower a smoother
curvature can undergo lower but precision. In the cases of MNLI and CoNLL datasets, upper
layers closer to the output show flatter curvature in comparison to lower layers. From this
observation, they are motivated to perform a group-wise quantization scheme whereby blocks
of a matrix have different amounts of quantization with unique quantization ranges and look
57
up table. A Hessian-based mixed precision scheme is then used to decide which blocks of
each matrix are assigned the corresponding low-bit precisions of varying ranges and analyse
the differences found for quantizing different parts of the self-attention block (self-attention
matrices and fully-connected feedforward layers) and their inputs (embeddings) and find the
highest compression ratios can be attributed to most of the parametes in the self-attention
blocks.
7. Summary
The above sections have provided descriptions of old and new compression methods and
techniques. We finish by providing general recommendations for this field and future research
directions that I deem to be important in the coming years.
7.1 Recommendations
Old Baselines May Still Be Competitive Evidently, there has been an extensive
amount of work in pruning, quantization, knowledge distillation and combinations of the
aforementioned for neural networks. We note that many of these approaches, particularly
pruning, were proposed in decades past (Cleary and Witten, 1984; Mozer and Smolensky,
1989; Hassibi et al., 1994; LeCun et al., 1990; Whitley et al., 1990; Karnin, 1990; Reed, 1993;
Fahlman and Lebiere, 1990). The current trend of deep neural networks growing ever larger
means that keeping track of new innovations on the topic of reducing network size becomes
increasingly important. Therefore, we suggest that comparing past and present techniques
for compression should be standardized across models, datasets and evaluation metrics such
that these comparisons are made direct. Ideally this would be carried out using the same
libraries in the same language (e.g PyTorch or Tensorflow in Python) to further minimize
any implementation differences that naturally occur.
58
7.2 Future Research Directions
The field of neural network compression has seen a resurgence in activity given the growing
size of state of the art of models that are pushing the boundaries of hardware and practitioners
resources. However, compression techniques are still in a relatively early stage of development.
Below, I discuss a few research directions I think are worth exploring for the future of model
compression.
Few-Shot Knowledge Distillation In cases where larger pretrained models are required
for a set (or single) of target tasks where there are only few samples, knowledge distillation
can be used to distill the knowledge of teacher specifically for that transfer domain. The
advantage of doing so is that we benefit from the transferability of teacher network while
also distilling these large feature sets into a smaller network.
Further Theoretical Analysis Recent work has aided in our understanding of general-
ization in deep neural networks (Neyshabur et al., 2018; Wei et al., 2018; Nakkiran et al.,
2019; Belkin et al., 2019a; Derezinski et al., 2019; Saxe et al., 2019) and proposed measures
for tracking generalization performance while training DNNs. Further theoretical analysis of
compression generalization is a worthwhile endeavour considering the growing importance
and usage of compressing already trained neural networks. This is distinctly different than
training models from random initialization and requires a new generalization paradigm to
understand how compression works for each type (i.e pruning, quantization etc.).
59
References
Prem Raj Adhikari and Jaakko Hollmen. Multiresolution mixture modeling using merging of mixture
components. In Asian Conference on Machine Learning, pages 17–32, 2012.
Angeline Aguinaldo, Ping-Yeh Chiang, Alex Gain, Ameya Patil, Kolten Pearson, and Soheil Feizi.
Compressing gans using knowledge distillation. arXiv preprint arXiv:1902.00159, 2019.
Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca
Benini, and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible
representations. In Advances in Neural Information Processing Systems, pages 1141–1151, 2017.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient
descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural
networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M Kitani. N2n learning: Network to
network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030,
2017.
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information
processing systems, pages 2654–2662, 2014.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Trellis networks for sequence modeling. arXiv
preprint arXiv:1810.06682, 2018.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural
Information Processing Systems, pages 688–699, 2019.
Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Aciq: Analytical clipping for integer
quantization of neural networks. openreview.net, 2018.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32):15849–15854, 2019a.
Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv
preprint arXiv:1903.07571, 2019b.
Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training
very sparse deep networks. arXiv preprint arXiv:1711.05136, 2017.
Yoshua Bengio, Nicholas Leonard, and Aaron Courville. Estimating or propagating gradients through
stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics
of computation, 19(92):577–593, 1965.
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 535–541. ACM, 2006.
Andres Buzo, A Gray, R Gray, and John Markel. Speech coding based upon vector quantization.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(5):562–574, 1980.
Leopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H Elibol, and Hanlin
Tang. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural
networks. arXiv preprint arXiv:2001.05674, 2020.
60
Erick Cantu-Paz. Pruning neural networks with distribution estimation algorithms. In Genetic and
Evolutionary Computation Conference, pages 790–800. Springer, 2003.
Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse
signals. Biometrika, 97(2):465–480, 2010.
Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm for
feedforward neural networks. IEEE transactions on Neural networks, 8(3):519–531, 1997.
Yevgen Chebotar and Austin Waters. Distilling knowledge from ensembles of neural networks for
speech recognition. In Interspeech, pages 3439–3443, 2016.
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big
self-supervised models are strong semi-supervised learners, 2020.
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing
neural networks with the hashing trick. In International conference on machine learning, pages
2285–2294, 2015.
Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu. Distilling the knowledge of
bert for text generation. arXiv preprint arXiv:1911.03829, 2019.
Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of
the IEEE International Conference on Computer Vision, pages 4794–4802, 2019.
Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan,
and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks.
arXiv preprint arXiv:1805.06085, 2018.
Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
John Cleary and Ian Witten. Data compression using adaptive coding and partial string matching.
IEEE transactions on Communications, 32(4):396–402, 1984.
Yaim Cooper. The loss landscape of overparameterized neural networks. arXiv preprint
arXiv:1804.10200, 2018.
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with
low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
Jia Cui, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, Tom Sercu, Kartik Audhkhasi,
Abhinav Sethy, Markus Nussbaum-Thom, and Andrew Rosenberg. Knowledge distillation across
ensembles of multilingual models for low-resource languages. In 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 4825–4829. IEEE, 2017.
Raj Dabre and Atsushi Fujita. Recurrent stacking of layers for compact neural machine translation
models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages
6292–6299, 2019.
Bin Dai, Chen Zhu, and David Wipf. Compressing neural networks using the variational information
bottleneck. arXiv preprint arXiv:1802.10399, 2018.
Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: A neural network synthesis tool based on a
grow-and-prune paradigm. IEEE Transactions on Computers, 68(10):1487–1497, 2019.
William Dally. High-performance hardware for machine learning. NIPS Tutorial, 2015.
Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha,
Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas,
et al. Mixed precision training of convolutional neural networks using integer operations. arXiv
preprint arXiv:1802.00930, 2018.
Lieven De Lathauwer. Decompositions of a higher-order tensor in block terms—part ii: Definitions
and uniqueness. SIAM Journal on Matrix Analysis and Applications, 30(3):1033–1066, 2008.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers. arXiv preprint arXiv:1807.03819, 2018.
Michal Derezinski, Feynman Liang, and Michael W Mahoney. Exact expressions for double descent
and implicit regularization via surrogate random design. arXiv preprint arXiv:1912.04533, 2019.
61
Tim Dettmers. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561,
2015.
Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing
performance. arXiv preprint arXiv:1907.04840, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise
optimal brain surgeon. In Advances in Neural Information Processing Systems, pages 4857–4867,
2017.
Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware
quantization of neural networks with mixed-precision. In Proceedings of the IEEE International
Conference on Computer Vision, pages 293–302, 2019.
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes
over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. Understanding deep architectures using a
recursive convolutional network. arXiv preprint arXiv:1312.1847, 2013.
Erich Elsen, Marat Dukhan, Trevor Gale, and Karen Simonyan. Fast sparse convnets, 2019.
Andries Petrus Engelbrecht. A new pruning heuristic based on variance analysis of sensitivity
information. IEEE transactions on Neural Networks, 12(6):1386–1399, 2001.
Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In Advances
in neural information processing systems, pages 524–532, 1990.
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and
Armand Joulin. Training with quantization noise for extreme model compression, 2020.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv preprint arXiv:1803.03635, 2018.
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. The lottery ticket
hypothesis at scale. arXiv preprint arXiv:1903.01611, 8, 2019.
Markus Freitag, Yaser Al-Onaizan, and Baskaran Sankaran. Ensemble distillation for neural machine
translation. arXiv preprint arXiv:1702.01802, 2017.
Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of
pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.
Adam Gaier and David Ha. Weight agnostic neural networks. In Advances in Neural Information
Processing Systems, pages 5364–5378, 2019.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional
sequence to sequence learning. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 1243–1252. JMLR. org, 2017.
David E Goldberg and Kalyanmoy Deb. A comparative analysis of selection schemes used in genetic
algorithms. In Foundations of genetic algorithms, volume 1, pages 69–93. Elsevier, 1991.
Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and
Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks.
In Proceedings of the IEEE International Conference on Computer Vision, pages 4852–4861, 2019.
Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number
recognition from street view imagery using deep convolutional neural networks. arXiv preprint
arXiv:1312.6082, 2013.
Qiushan Guo, Zhipeng Yu, Yichao Wu, Ding Liang, Haoyu Qin, and Junjie Yan. Dynamic recur-
sive neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5147–5156, 2019a.
Yong Guo, Yin Zheng, Mingkui Tan, Qi Chen, Jian Chen, Peilin Zhao, and Junzhou Huang. Nat:
Neural architecture transformer for accurate and compact architectures. In Advances in Neural
Information Processing Systems, pages 735–747, 2019b.
62
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with
limited numerical precision. In International Conference on Machine Learning, pages 1737–1746,
2015.
Philipp Gysel, Jon Pimentel, Mohammad Motamedi, and Soheil Ghiasi. Ristretto: A framework for
empirical study of resource-efficient inference in convolutional neural networks. IEEE transactions
on neural networks and learning systems, 29(11):5784–5789, 2018.
Masafumi Hagiwara. Removal of hidden units and weights for back propagation networks. In
Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan),
volume 1, pages 351–354. IEEE, 1993.
Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):
217–288, 2011.
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi
Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In
Advances in neural information processing systems, pages 8527–8537, 2018.
Hong-Gui Han and Jun-Fei Qiao. A structure optimisation algorithm for feedforward neural network
construction. Neurocomputing, 99:347–357, 2013.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain
surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
Babak Hassibi, David G Stork, and Gregory Wolff. Optimal brain surgeon: Extensions and per-
formance comparisons. In Advances in neural information processing systems, pages 263–270,
1994.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual
networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
compression and acceleration on mobile devices. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 784–800, 2018.
Srinidhi Hegde, Ranjitha Prasad, Ramya Hebbalaguppe, and Vishwajith Kumar. Variational student:
Learning compact and sparser networks in knowledge distillation framework. arXiv preprint
arXiv:1910.12061, 2019.
Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation
of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 3779–3787, 2019.
Geoffrey Hinton, Nitsh Srivastava, and Kevin Swersky. Neural networks for machine learning.
Coursera, video lectures, 264:1, 2012a.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov.
Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012b.
Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of
Mathematics and Physics, 6(1-4):164–189, 1927.
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
63
Lu Hou, Quanming Yao, and James T Kwok. Loss-aware binarization of deep networks. arXiv
preprint arXiv:1611.01600, 2016.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for
mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures
for matching natural language sentences. In Advances in neural information processing systems,
pages 2042–2050, 2014.
Yiming Hu, Siyang Sun, Jianquan Li, Xingang Wang, and Qingyi Gu. A novel channel pruning
method for deep neural network compression. arXiv preprint arXiv:1805.11394, 2018.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4700–4708, 2017.
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt
Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.
arXiv preprint arXiv:1602.07360, 2016.
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A
loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig
Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient
integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2704–2713, 2018.
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks
with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.
IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351,
2019.
Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980.
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth
Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A
study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
Ehud D Karnin. A simple procedure for pruning back-propagation trained neural networks. IEEE
transactions on neural networks, 1(2):239–242, 1990.
James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of ICNN’95-
International Conference on Neural Networks, volume 4, pages 1942–1948. IEEE, 1995.
Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image
super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1637–1645, 2016.
Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882,
2014.
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint
arXiv:1606.07947, 2016.
Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local repa-
rameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583,
2015.
Okan Köpüklü, Maryam Babaee, Stefan Hörmann, and Gerhard Rigoll. Convolutional neural networks
with layer reuse. In 2019 IEEE International Conference on Image Processing (ICIP), pages
345–349. IEEE, 2019.
64
Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A
whitepaper. arXiv preprint arXiv:1806.08342, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading
comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint
arXiv:1610.02242, 2016.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint
arXiv:1909.11942, 2019.
Philippe Lauret, Eric Fock, and Thierry Alex Mara. A node pruning algorithm based on a fourier
amplitude sensitivity test method. IEEE transactions on neural networks, 17(2):273–293, 2006.
Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2554–2564, 2016.
Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural
information processing systems, pages 598–605, 1990.
Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning
based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
Asriel U Levin, Todd K Leen, and John E Moody. Fast pruning using principal components. In
Advances in neural information processing systems, pages 35–42, 1994.
Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711,
2016a.
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for
efficient convnets. arXiv preprint arXiv:1608.08710, 2016b.
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape
of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from
noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer
Vision, pages 1910–1918, 2017.
Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolu-
tional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural
Information Processing Systems, pages 2181–2191, 2017a.
Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang. Accelerating
convolutional networks via global & dynamic filter pruning. In IJCAI, pages 2425–2432, 2018.
Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang,
and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
2790–2799, 2019.
Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In
Advances in Neural Information Processing Systems, pages 345–353, 2017b.
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv
preprint arXiv:1806.09055, 2018a.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neu-
ral networks via knowledge distillation for natural language understanding. arXiv preprint
arXiv:1904.09482, 2019a.
65
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
approach. arXiv preprint arXiv:1907.11692, 2019b.
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of
network pruning. arXiv preprint arXiv:1810.05270, 2018b.
Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):
129–137, 1982.
Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for deep
neural networks. arXiv preprint arXiv:1710.07535, 2017.
Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In
Advances in Neural Information Processing Systems, pages 3288–3298, 2017.
Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization.
In Advances in neural information processing systems, pages 7816–7827, 2018.
Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, fixed network by
learning to mask. arXiv preprint arXiv:1801.06519, 6(8), 2018.
Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision training
with 8-bit floating point. arXiv preprint arXiv:1905.12334, 2019.
Paul Merolla, Rathinakumar Appuswamy, John Arthur, Steve K Esser, and Dharmendra Modha.
Deep neural networks are robust to weight binarization and other non-linear distortions. arXiv
preprint arXiv:1606.01981, 2016.
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,
Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision
training. arXiv preprint arXiv:1710.03740, 2017.
Szymon Migacz. 8-bit inference with tensorrt. In GPU technology conference, volume 2, page 5, 2017.
Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Improved knowledge
distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint
arXiv:1902.03393, 2019.
Asit Mishra and Debbie Marr. Apprentice: Using knowledge distillation techniques to improve
low-precision network accuracy. arXiv preprint arXiv:1711.05852, 2017.
Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. Wrpn: wide reduced-precision
networks. arXiv preprint arXiv:1709.01134, 2017.
Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and
Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity
inspired by network science. Nature communications, 9(1):1–12, 2018.
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural
networks. arXiv preprint arXiv:1701.05369, 2017.
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional
neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440, 3, 2016.
Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks
by dynamic sparse reparameterization. arXiv preprint arXiv:1902.05967, 2019.
Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a
network via relevance assessment. In Advances in neural information processing systems, pages
107–115, 1989.
Rafael Muller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In
Advances in Neural Information Processing Systems, pages 4696–4705, 2019.
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep
double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.
Pramod L Narasimha, Walter H Delashmit, Michael T Manry, Jiang Li, and Francisco Maldonado.
An integrated growing-pruning method for feedforward network training. Neurocomputing, 71
(13-15):2831–2847, 2008.
66
James O’ Neill, Greg Ver Steeg, and Aram Galstyan. Compressing deep neural networks via layer
fusion, 2020.
Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards
understanding the role of over-parametrization in generalization of neural networks. arXiv preprint
arXiv:1805.12076, 2018.
Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural
networks. In Advances in neural information processing systems, pages 442–450, 2015.
Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural
computation, 4(4):473–493, 1992.
Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, Sivan Doveh, Itamar Friedman, Raja Giryes, and
Lihi Zelnik-Manor. Asap: Architecture search, anneal and prune. arXiv preprint arXiv:1904.04123,
2019.
Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):
2295–2317, 2011.
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V.
Le. Improved noisy student training for automatic speech recognition, 2020.
Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantization for training and inference
of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV),
pages 580–595, 2018a.
Sungrae Park, JunKeon Park, Su-Jin Shin, and Il-Chul Moon. Adversarial dropout for supervised
and semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018b.
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and
Dustin Tran. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In International
Conference on Machine Learning, pages 5142–5151, 2019.
Bryan A Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, and Kate Saenko. Shapeshifter
networks: Cross-layer parameter sharing for scalable and effective deep learning. arXiv preprint
arXiv:2006.10598, 2020.
Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quanti-
zation. arXiv preprint arXiv:1802.05668, 2018.
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimization
towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions
for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 11893–11902, 2020.
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet
classification using binary convolutional neural networks. In European conference on computer
vision, pages 525–542. Springer, 2016.
Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747,
1993.
Roberto Rigamonti, Amos Sironi, Vincent Lepetit, and Pascal Fua. Learning separable filters. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2754–2761,
2013.
67
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and
Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
C Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft. Microsoft Blog, 2019.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations
by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran.
Low-rank matrix factorization for deep neural network training with high-dimensional output
targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pages
6655–6659. IEEE, 2013.
Victor Sanh. Smaller, faster, cheaper, lighter: Introducing distilbert, a distilled version of bert, 2019.
URL https://medium.com/huggingface/distilbert-8cf3380435b5.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Bharat Bhusan Sau and Vineeth N Balasubramanian. Deep model compression: Distilling knowledge
from noisy teachers. arXiv preprint arXiv:1610.09650, 2016.
Pedro Savarese and Michael Maire. Learning implicitly recurrent cnns through parameter sharing.
arXiv preprint arXiv:1902.09701, 2019.
Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D
Tracey, and David D Cox. On the information bottleneck theory of deep learning. Journal of
Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019.
Benjamin Scellier and Yoshua Bengio. Equilibrium propagation: Bridging the gap between energy-
based models and backpropagation. Frontiers in computational neuroscience, 11:24, 2017.
Jurgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn:
the meta-meta-... hook. PhD thesis, Technische Universitat Munchen, 1987.
Rudy Setiono and Wee Kheng Leow. Pruned neural networks for regression. In Pacific Rim
International Conference on Artificial Intelligence, pages 500–509. Springer, 2000.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney,
and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint
arXiv:1909.05840, 2019.
Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan.
Hash kernels for structured data. Journal of Machine Learning Research, 10(Nov):2615–2637,
2009.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model
parallelism. arXiv preprint arXiv:1909.08053, 2019.
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information.
arXiv preprint arXiv:1703.00810, 2017.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.
Deterministic policy gradient algorithms. In Journal of Machine Learning Research, 2014.
Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. Neural compatibility
modeling with attentive knowledge distillation. In The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, pages 5–14, 2018.
Pierre Stock, Armand Joulin, Remi Gribonval, Benjamin Graham, and Herve Jegou. And the bit
goes down: Revisiting the quantization of neural networks. arXiv preprint arXiv:1907.05686, 2019.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model
compression. arXiv preprint arXiv:1908.09355, 2019.
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient
methods for reinforcement learning with function approximation. In Advances in neural information
processing systems, pages 1057–1063, 2000.
68
Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution via deep recursive residual network.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3147–3155,
2017.
Mingxing Tan and Quoc V. Le. Efficientnet: Improving accuracy and efficiency
through automl and model scaling, 2019a. URL https://ai.googleblog.com/2019/05/
efficientnet-improving-accuracy-and.html.
Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. arXiv preprint arXiv:1905.11946, 2019b.
Hidenori Tanaka, Daniel Kunin, Daniel LK Yamins, and Surya Ganguli. Pruning neural networks
without any data by iteratively conserving synaptic flow. arXiv preprint arXiv:2006.05467, 2020.
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency
targets improve semi-supervised deep learning results. In Advances in neural information processing
systems, pages 1195–1204, 2017.
Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszar. Faster gaze prediction with
dense networks and fisher pruning. arXiv preprint arXiv:1801.05787, 2018.
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv
preprint arXiv:1910.10699, 2019.
Juanjuan Tu, Yongzhao Zhan, and Fei Han. A neural network pruning method optimized with
pso algorithm. In 2010 Second International Conference on Computer Modeling and Simulation,
volume 3, pages 257–259. IEEE, 2010.
Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):
279–311, 1966.
Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the
IEEE International Conference on Computer Vision, pages 1365–1374, 2019.
Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression.
arXiv preprint arXiv:1702.04008, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, pages 5998–6008, 2017.
Christopher A Walsh. Peter huttenlocher (1931–2013), 2013.
Weishui Wan, Shingo Mabu, Kaoru Shimada, Kotaro Hirasawa, and Jinglu Hu. Enhancing the
generalization ability of neural networks through controlling the hidden layers. Applied Soft
Computing, 9(1):404–414, 2009.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:
A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint
arXiv:1804.07461, 2018a.
Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training
deep neural networks with 8-bit floating point numbers. In Advances in neural information
processing systems, pages 7675–7684, 2018b.
Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Kdgan: Knowledge distillation with generative
adversarial networks. In Advances in Neural Information Processing Systems, pages 775–786,
2018c.
Colin Wei, Jason Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural
networks. OpenReview, 2018.
Andreas S Weigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight-
elimination with application to forecasting. In Advances in neural information processing systems,
pages 875–882, 1991.
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature
hashing for large scale multitask learning. In Proceedings of the 26th annual international conference
on machine learning, pages 1113–1120, 2009.
69
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in
deep neural networks. In Advances in neural information processing systems, pages 2074–2082,
2016.
Darrell Whitley, Timothy Starkweather, and Christopher Bogart. Genetic algorithms and neural
networks: Optimizing connections and connectivity. Parallel computing, 14(3):347–361, 1990.
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. Sharing attention weights for
fast transformer. arXiv preprint arXiv:1906.11024, 2019.
Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1492–1500, 2017.
Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with
singular value decomposition. In Interspeech, pages 2365–2369, 2013.
Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong. Singular value decomposition based
low-footprint speaker adaptation and personalization for deep neural network. In 2014 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6359–6363,
2014.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural
information processing systems, pages 5754–5764, 2019.
Amir Yazdanbakhsh, Ahmed T Elthakeb, Prannoy Pilligundla, FatemehSadat Mireshghallah, and
Hadi Esmaeilzadeh. Releq: An automatic reinforcement learning approach for deep quantization
of neural networks. arXiv preprint arXiv:1811.01704, 2018.
Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, and Zenglin Xu. Learning
compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 9378–9387, 2018.
Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Yingyan Lin,
Zhangyang Wang, and Richard G Baraniuk. Drawing early-bird tickets: Towards more efficient
training of deep networks. arXiv preprint arXiv:1909.11957, 2019.
Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using
tensor-train rnns. Arxiv, 2017a.
Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao,
Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score
propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 9194–9203, 2018.
Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low
rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 7370–7379, 2017b.
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor-
mance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928,
2016a.
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,
2016b.
Dejiao Zhang, Haozhu Wang, Mario Figueiredo, and Laura Balzano. Learning to share: Simultaneous
parameter tying and sparsification in deep learning. OpenReview.net, 2018a.
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient
convolutional neural network for mobile devices. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 6848–6856, 2018b.
Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for
image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2472–2481, 2018c.
70
Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization:
Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044, 2017.
Aojun Zhou, Anbang Yao, Kuan Wang, and Yurong Chen. Explicit loss-error-aware quantization
for low-bit deep neural networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 9426–9435, 2018.
Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowledge distillation in non-
autoregressive machine translation. arXiv preprint arXiv:1911.02727, 2019a.
Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros,
signs, and the supermask. In Advances in Neural Information Processing Systems, pages 3597–3607,
2019b.
Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Train-
ing low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint
arXiv:1606.06160, 2016.
Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv
preprint arXiv:1612.01064, 2016.
Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for
model compression. arXiv preprint arXiv:1710.01878, 2017.
Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-fly native ensemble. In Advances
in neural information processing systems, pages 7517–7527, 2018.
Appendices
Appendix A. Low Resource and Efficient CNN Architectures
A.0.1 MobileNet
Howard et al. (2017) propose compression of convolutional neural networks for embedded
and mobile vision applications using depth-wise separable convolutions (DSC) and use two
hyperparameters that tradeoff latency and accuracy. DSCs factorize a standard convolution
into a depthwise convolution and 1 × 1 pointwise convolution. Each input channel is passed
through a DSC filter followed by a pointwise 1 × 1 convolution that combines the outputs
of the DSC. Unlike standard convolutions, DSCs split the convolution into two steps, first
filtering then combining outputs of each DSC filter, which is why this is referred to as a
factorization approach.
Experiments on ImageNet image classification demonstrated that these smaller networks
can achieve ac curacies similar to much larger networks.
A.0.2 SqueezeNet
Iandola et al. (2016) reduce the network architecture by reducing 3 × 3 filters to 1 × 1 filters
(squeeze layer), reduce the number of input channels to 3 × 3 filters using squeeze layers
and downsample later in the network to avoid the bottleneck of information through the
network too early and in turn lead to better performance. A fire module is made up of the
squeeze layer is into an expand layer that is a mix of 1 × 1 and 3 × 3 convolution filters and
the number of filters per fire module is increased as it gets closer to the last layer.
By using these architectural design decisions, SqueezeNet can compete with AlexNet
with 50 times smaller network and even outperforms layer decomposition and pruning for
71
deep compression. When combined with INT8 quantization, SqueezeNet yields a 0.66 MB
model which is 363 times smaller than 32-bit AlexNet, while still maintaining performance.
A.0.3 ShuffleNet
ShuffleNet (Zhang et al., 2018b) uses pointwise group convolutions (Krizhevsky et al., 2012)(i.e
using a different set of convolution filter groups on the same input features, this allows for
model parallelization) and channel shuffles (randomly shuffling helps information flow across
feature channels) to reduce compute while maintaining accuracy. ShuffleNet is made up
economical 3 × 3 depthwise convolutional filters and replace 1 × 1 layer with pointwise group
convolutional followed by the channel shuffle. Unlike predecessor models (Xie et al., 2017;
Chollet, 2017), ShuffleNet is efficient for smaller networks as they find big improvements
when tested on ImageNet and MSCOCO object detection using 40 Mega FLOPs and achieves
13 times faster training over AlexNet without sacrificing much accuracy.
A.0.4 DenseNet
Gradients can vanish in very deep networks because the error becomes more difficult to
backpropogate as the number of matrix multiplications increase. DenseNets (Huang et al.,
2017) address gradient vanishing connecting the feature maps of the previous layer to the
inputs of the next layer, similar to ResNet skip connections. This reusing of features mean
the network efficient with its use of parameters. Although, deep and thin DenseNetworks
can be parameter efficient, they do tradeoff with memory/speed efficiency in comparison to
shallower yet wider network ( (Zagoruyko and Komodakis, 2016b)) because all layer outputs
need to be stored to perform backpropogation. However, DenseNets too can be made wider
and shallower to become more memory effecient if required.
Transformer Architecture Search Most neural architecture search (NAS) methods learn
to apply modules in the network with no regard for the computational cost of adding them,
such as Neural architecture optimization (Luo et al., 2018) which uses an encoder-decoder
model to reconstruct an architecture from a continuous space. Guo et al. (2019b) instead have
proposed to learn a transformer architecture while minimizing the computational burden,
avoiding modules with large number of parameters if necessary. However, solving such
problem is NP-hard. Therefore, they propose to treat the optimization problem as a Markov
72
Decision Process (MDP) and optimize the policies w.r.t. to the different architectures using
reinforcement learning. These different architectures are replace redundant transformations
with more efficient ones such as skip connections or removing connections altogether.
73