0% found this document useful (0 votes)
11 views

2021-Seul Ki Yeom-Pruning by explaining A novel criterion for deep neural network pruning

Uploaded by

leowong19019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

2021-Seul Ki Yeom-Pruning by explaining A novel criterion for deep neural network pruning

Uploaded by

leowong19019
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Pattern Recognition 115 (2021) 107899

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

Pruning by explaining: A novel criterion for deep neural network


pruning
Seul-Ki Yeom a,i, Philipp Seegerer a,h, Sebastian Lapuschkin c, Alexander Binder d,e,
Simon Wiedemann c, Klaus-Robert Müller a,f,g,b,∗, Wojciech Samek c,b,∗
a
Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
b
BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
c
Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany
d
ISTD Pillar, Singapore University of Technology and Design, Singapore 487372, Singapore
e
Department of Informatics, University of Oslo, 0373 Oslo, Norway
f
Department of Artificial Intelligence, Korea University, Seoul 136–713, Korea
g
Max Planck Institut für Informatik, 66123 Saarbrücken, Germany
h
Aignostics GmbH, 10557 Berlin, Germany
i
Nota AI GmbH, 10117 Berlin, Germany

a r t i c l e i n f o a b s t r a c t

Article history: The success of convolutional neural networks (CNNs) in various applications is accompanied by a sig-
Received 18 December 2019 nificant increase in computation and parameter storage costs. Recent efforts to reduce these overheads
Revised 28 January 2021
involve pruning and compressing the weights of various layers while at the same time aiming to not
Accepted 8 February 2021
sacrifice performance. In this paper, we propose a novel criterion for CNN pruning inspired by neural
Available online 22 February 2021
network interpretability: The most relevant units, i.e. weights or filters, are automatically found using
Keywords: their relevance scores obtained from concepts of explainable AI (XAI). By exploring this idea, we connect
Pruning the lines of interpretability and model compression research. We show that our proposed method can
Layer-wise relevance propagation (LRP) efficiently prune CNN models in transfer-learning setups in which networks pre-trained on large corpora
Convolutional neural network (CNN) are adapted to specialized tasks. The method is evaluated on a broad range of computer vision datasets.
Interpretation of models Notably, our novel criterion is not only competitive or better compared to state-of-the-art pruning criteria
Explainable AI (XAI)
when successive retraining is performed, but clearly outperforms these previous criteria in the resource-
constrained application scenario in which the data of the task to be transferred to is very scarce and
one chooses to refrain from fine-tuning. Our method is able to compress the model iteratively while
maintaining or even improving accuracy. At the same time, it has a computational cost in the order of
gradient computation and is comparatively simple to apply without the need for tuning hyperparameters
for pruning.
© 2021 The Authors. Published by Elsevier Ltd.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

1. Introduction at times even outperforming humans. Furthermore, in specialized


domains where limited training data is available, e.g., due to the
Deep CNNs have become an indispensable tool for a wide range cost and difficulty of data generation (medical imaging from fMRI,
of applications [1], such as image classification, speech recognition, EEG, PET etc.), transfer learning can improve the CNN performance
natural language processing, chemistry, neuroscience, medicine by extracting the knowledge from the source tasks and applying it
and even are applied for playing games such as Go, poker or Su- to a target task which has limited training data.
per Smash Bros. They have achieved high predictive performance, However, the high predictive performance of CNNs often comes
at the expense of high storage and computational costs, which
are related to the energy expenditure of the fine-tuned network.

Corresponding authors. These deep architectures are composed of millions of parameters
E-mail addresses: [email protected] (S.-K. Yeom), philipp.seegerer@tu- to be trained, leading to overparameterization (i.e. having more pa-
berlin.de (P. Seegerer), [email protected] (S. Lapuschkin),
rameters than training samples) of the model [2]. The run-times
[email protected] (A. Binder), [email protected] (S. Wiedemann),
[email protected] (K.-R. Müller), [email protected] are typically dominated by the evaluation of convolutional layers,
(W. Samek). while dense layers are cheap but memory-heavy [3]. For instance,

https://doi.org/10.1016/j.patcog.2021.107899
0031-3203/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

the VGG-16 model has approximately 138 million parameters, tak- consumption, computational power or energy consumption is con-
ing up more than 500MB in storage space, and needs 15.5 bil- strained. Such transfer learning with restrictions is common in mo-
lion floating-point operations (FLOPs) to classify a single image. bile or embedded applications.
ResNet50 has approx. 23 million parameters and needs 4.1 bil- Our experimental results on various benchmark datasets and
lion FLOPs. Note that overparameterization is helpful for an ef- four different popular CNN architectures show that the LRP crite-
ficient and successful training of neural networks, however, once rion for pruning is more scalable and efficient, and leads to bet-
the trained and well generalizing network structure is established, ter performance than existing criteria regardless of data types and
pruning can help to reduce redundancy while still maintaining model architectures if retraining is performed (Scenario 1).
good performance [4]. Especially, if retraining is prohibited due to external constraints
Reducing a model’s storage requirements and computational after pruning, the LRP criterion clearly outperforms previous crite-
cost becomes critical for a broader applicability, e.g., in embedded ria on all datasets (Scenario 2). Finally, we would like to note that
systems, autonomous agents, mobile devices, or edge devices [5]. our proposed pruning framework is not limited to LRP and image
Neural network pruning has a decades long history with inter- data, but can be also used with other explanation techniques and
est from both academia and industry [6] aiming to eliminate the data types.
subset of network units (i.e. weights or filters) which is the least The rest of this paper is organized as follows: Section 2 sum-
important w.r.t. the network’s intended task. For network prun- marizes related works for network compression and introduces the
ing, it is crucial to decide how to identify the “irrelevant” subset typical criteria for network pruning. Section 3 describes the frame-
of the parameters meant for deletion. To address this issue, pre- work and details of our approach. The experimental results are il-
vious researches have proposed specific criteria based on Taylor lustrated and discussed in Section 4, while our approach is dis-
expansion, weight, gradient, and others, to reduce complexity and cussed in relation to previous studies in Section 5. Section 6 gives
computation costs in the network. Related works are introduced in conclusions and an outlook to future work.
Section 2.
From a practical point of view, the full capacity (in terms of
weights and filters) of an overparameterized model may not be re- 2. Related work
quired, e.g., when
(1) parts of the model lie dormant after training (i.e., are per- We start the discussion of related research in the field of net-
manently ”switched off”), work compression with network quantization methods which have
(2) a user is not interested in the model’s full array of possible been proposed for storage space compression by decreasing the
outputs, which is a common scenario in transfer learning (e.g. the number of possible and unique values for the parameters [15,16].
user only has use for 2 out of 10 available network outputs), or Tensor decomposition approaches decompose network matrices
(3) a user lacks data and resources for fine-tuning and running into several smaller ones to estimate the informative parameters
the overparameterized model. of the deep CNNs with low-rank approximation/factorization [17].
In these scenarios the redundant parts of the model will still More recently, [18] also propose a framework of architecture
occupy space in memory, and information will be propagated distillation based on layer-wise replacement, called Lightweight-
through those parts, consuming energy and increasing runtime. Net for memory and time saving. Algorithms for designing efficient
Thus, criteria able to stably and significantly reduce the com- models focus more on acceleration instead of compression by op-
putational complexity of deep neural networks across applications timizing convolution operations or architectures directly (e.g. [19]).
are relevant for practitioners. Network pruning approaches remove redundant or irrelevant
In this paper, we propose a novel pruning framework based on units — i.e., nodes, filters, or layers — from the model which are
Layer-wise Relevance Propagation (LRP) [7]. LRP was originally de- not critical for performance [6,20]. Network pruning is robust to
veloped as an explanation method to assign importance scores, so various settings and gives reasonable compression rates while not
called relevance, to the different input dimensions of a neural net- (or minimally) hurting the model accuracy. Also it can support
work that reflect the contribution of an input dimension to the both training from scratch and transfer learning from pre-trained
model’s decision, and has been applied to different fields of com- models. Early works have shown that network pruning is effective
puter vision (e.g., [8–10]). The relevance is backpropagated from in reducing network complexity and simultaneously addressing
the output to the input and hereby assigned to each unit of the over-fitting problems. Current network pruning techniques make
deep model. Since relevance scores are computed for every layer weights or channels sparse by removing non-informative connec-
and neuron from the model output to the input, these relevance tions and require an appropriate criterion for identifying which
scores essentially reflect the importance of every single unit of a units of the model are not relevant for solving a problem. Thus,
model and its contribution to the information flow through the it is crucial to decide how to quantify the relevance of the param-
network — a natural candidate to be used as pruning criterion. The eters (i.e., weights or channels) in the current state of the learn-
LRP criterion can be motivated theoretically through the concept of ing process for deletion without sacrificing predictive performance.
Deep Taylor Decomposition (DTD) (c.f. [11–13]). Moreover, LRP is In previous studies, pruning criteria have been proposed based on
scalable and easy to apply, and has been implemented in software the magnitude of their 1) weights, 2) gradients, 3) Taylor expan-
frameworks such as iNNvestigate [14]. Furthermore, it has linear sion/derivative, and 4) other criteria, as described in the following
computational cost in terms of network inference cost, similar to section.
backpropagation. Taylor expansion: Early approaches towards neural network
We systematically evaluate the compression efficacy of the LRP pruning — optimal brain damage [4] and optimal brain sur-
criterion compared to common pruning criteria for two different geon [21] — leveraged a second-order Taylor expansion based on
scenarios. the Hessian matrix of the loss function to select parameters for
Scenario 1: We prune pre-trained CNNs followed by subse- deletion. However, computing the inverse of Hessian is computa-
quent fine-tuning. This is the usual setting in CNN pruning and tionally expensive. The work of [22,23] used a first-order Taylor
requires a sufficient amount of data and computational power. expansion as a criterion to approximate the change of loss in the
Scenario 2: In this scenario a pretrained model needs to be objective function as an effect of pruning away network units. We
transferred to a related problem as well, but the data available for contrast our novel criterion to the computationally more compara-
the new task is too scarce for a proper fine-tuning and/or the time ble first-order Taylor expansion from [22].

2
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Gradient: Liu and Wu [24] proposed a hierarchical global prun- Algorithm 1 Neural Network Pruning.
ing strategy by calculating the mean gradient of feature maps in
1: Input: model net, reference data xr , training data xt
each layer. They adopt a hierarchical global pruning strategy be-
2: pruning t, pruning criterion c, pruning ratio r
tween the layers with similar sensitivity. Sun et al. [25] proposes
3: while t not reached do
a sparsified back-propagation approach for neural network train-
4: // Step 1: assess network substructure importance
ing using the magnitude of the gradient to find essential and non-
5: for all layer in net do
essential features in Multi-Layer Perceptron (MLP) and Long Short-
6: for all units in layer do
Term Memory Network (LSTM) models, which can be used for
7:  compute importance of unit w.r.t. c (and xr )
pruning. We implement the gradient-based pruning criterion af-
8: end for
ter [25].
9: if required for c then
Weight: A recent trend is to prune redundant, non-informative
10:  globally regularize importance per unit
weights in pre-trained CNN models, based on the magnitude of the
11: end if
weights themselves. Han et al. [26] and Han et al. [27] proposed
12: end for
the pruning of weights for which the magnitude is below a cer-
13: // Step 2: remove least important units in groups of r
tain threshold, and to subsequently fine-tune with an l p -norm reg-
14:  remove r units from net where importance is minimal
ularization. This pruning strategy has been used on fully-connected
15:  remove orphaned connections of each removed unit
layers and introduced sparse connections with BLAS libraries, sup-
16: if desired then
porting specialized hardware to achieve its acceleration. In the
17: // Step 2.1: optional fine-tuning to recover performance
same context, Structured Sparsity Learning (SSL) added group spar-
18:  fine-tune net on xt
sity regularization to penalize unimportant parameters by remov-
19: end if
ing some weights [28]. Li et al. [29], against which we compare in
20: end while
our experiments, proposed a one-shot channel pruning method us-
21: // return the pruned network upon hitting threshold t (e.g.,
ing the l p -norm of weights for filter selection, provided that those
model performance or size)
channels with smaller weights always produce weaker activations.
22: return net
Other criteria: [30] proposed the Neuron Importance Score
Propagation (NISP) algorithm to propagate the importance scores
of final responses before the softmax, classification layer in the
that are important for the classification decision. LRP thus origi-
network. The method is based on — in contrast to our proposed
nally served as a tool for interpreting non-linear learning machines
metric — a per-layer pruning process which does not consider
and has been applied as such in various fields, amongst others for
global importance in the network. Luo et al. [31] proposed ThiNet,
general image recognition, medical imaging and natural language
a data-driven statistical channel pruning technique based on the
processing, cf. [34]. The direct linkage of the relevances to the
statistics computed from the next layer. Further hybrid approaches
classifier output, as well as the conservativity constraint imposed
can be found in, e.g. [32], which suggests a fusion approach to
on the propagation of relevance between layers, makes LRP not
combine with weight-based channel pruning and network quan-
only attractive for model explaining, but can also naturally serve
tization. More recently, Dai et al. [33] proposed an evolutionary
as pruning criterion (see Section 4.1).
paradigm for weight-based pruning and gradient-based growing to
The main characteristic of LRP is a backward pass through the
reduce the network heuristically.
network during which the network output is redistributed to all
units of the network in a layer-by-layer fashion. This backward
3. LRP-based network pruning
pass is structurally similar to gradient backpropagation and has
therefore a similar runtime. The redistribution is based on a con-
A feedforward CNN consists of neurons established in a se-
servation principle such that the relevances can immediately be in-
quence of multiple layers, where each neuron receives the input
terpreted as the contribution that a unit makes to the network out-
data from one or more previous layers and propagates its output
put, hence establishing a direct connection to the network output
to every neuron in the succeeding layers, using a potentially non-
and thus its predictive performance. Therefore, as a pruning cri-
linear mapping. Network pruning aims to sparsify these units by
terion, the method is efficient and easily scalable to generic net-
eliminating weights or filters that are non-informative (according
work structures. Independent of the type of neural network layer
to a certain criterion). We specifically focus our experiments on
— that is pooling, fully-connected, convolutional layers — LRP al-
transfer learning, where the parameters of a network pre-trained
lows to quantify the importance of units throughout the network,
on a source domain is subsequently fine-tuned on a target domain,
given a global prediction context.
i.e., the final data or prediction task. Here, the general pruning pro-
cedure is outlined in Algorithm 1.
Even though most approaches use an identical process, choos- 3.2. LRP-based pruning
ing a suitable pruning criterion to quantify the importance of
model parameters for deletion while minimizing performance The procedure of LRP-based pruning is summarized in Fig. 1. In
drop (Step 1) is of critical importance, governing the success of the the first phase, a standard forward pass is performed by the net-
approach. work and the activations at each layer are collected. In the sec-
ond phase, the score f (x ) obtained at the output of the network is
propagated backwards through the network according to LRP prop-
3.1. Layer-wise relevance propagation
agation rules [7]. In the third phase, the current model is pruned
by eliminating the irrelevant (w.r.t. the “relevance” quantity R ob-
In this paper, we propose a novel criterion for pruning neural
tained via LRP) units and is (optionally) further fine-tuned.
network units: the relevance quantity computed with LRP [7]. LRP
LRP is based on a layer-wise conservation principle that allows
decomposes a classification decision into proportionate contribu-
the propagated quantity (e.g. relevance for a predicted class) to
tions of each network unit to the overall classification score, called
be preserved between neurons of two adjacent layers. Let Ri(l ) be
“relevances”.
When computed for the input dimensions of a CNN and visu- the relevance of neuron i at layer l and R(jl+1 ) be the relevance of
alized as a heatmap, these relevances highlight parts of the input neuron j at the next layer l + 1. Stricter definitions of conservation

3
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 1. Illustration of the LRP-based sequential process for pruning. A. Forward propagation of a given image (i.e. here, of a cat) through a pre-trained model. B. Evaluation
on relevance for weights/filters using LRP, C. Iterative pruning by eliminating the least relevant units (depicted by circles) and fine-tuning if necessary. The units can be
individual neurons, filters, or other arbitrary grouping of parameters, depending on the model architecture.

that involve only subsets of neurons can further impose that By selecting α = 1, the propagation rule simplifies to
relevance is locally redistributed in the lower layers and we define
Ri(←
l)
as the share of R(jl+1 ) that is redistributed to neuron i in the  +
j  ai(l ) wi j
lower layer. The conservation property always satisfies Ri(l ) =   + R(jl+1) , (2)
j i a  (l ) w 
i i j

 (l )
Ri← j = R(jl+1) , (1) where Ri(l ) denotes relevance attributed to the ith neuron at layer
i l, as an aggregation of downward-propagated relevance messages
Ri(←
l ,l +1 )
j
. The terms (· )+ indicate the positive part of the forward
propagated pre-activation from layer l, to layer (l + 1 ). The i is
where the sum runs over all neurons i of the (during inference) a running index over all input activations a1 . Note that a choice
preceeding layer l. When using relevance as a pruning criterion, of α = 1 only decomposes w.r.t. the parts of the inference signal
this property helps to preserve its quantity layer-by-layer, regard- supporting the model decision for the class of interest.
less of hidden layer size and the number of iteratively pruned Equation (2) is locally conservative, i.e., no quantity of relevance
neurons for each layer. At each layer l, we can extract the global gets lost or injected during the distribution of R j where each term
importance of node i as its attributed relevance Ri(l ) . of the sum corresponds to a relevance message R j←k . For this rea-
In this paper, we specifically adopt relevance quantities com- son, LRP has the following technical advantages over other pruning
puted with the LRP-α1 β0 -rule as pruning criterion. The LRP-αβ - techniques such as gradient-based or activation-based methods:
rule was developed with feedforward-DNNs with ReLU activa- (1) Localized relevance conservation implicitly ensures layer-
tions in mind and assumes positive (pre-softmax) logit activa- wise regularized global redistribution of importances from each
tions flogit (x ) > 0 for decomposition. The rule has been shown to network unit.
work well in practice in such a setting [35]. This particular vari- (2) By summing relevance within each (convolutional) filter
ant of LRP is tightly rooted in DTD [11], and other than the crite- channel, the LRP-based criterion is directly applicable as a mea-
ria based on network derivatives we compare against [22,25], al- sure of total relevance per node/filter, without requiring a post-hoc
ways produces continuous explanations, even if backpropagation is layer-wise renormalization, e.g., via l p -norm.
performed through the discontinuous (and commonly used) ReLU (3) The use of relevance scores is not restricted to a global
nonlinearity [12]. When used as a criterion for pruning, its assess- application of pruning but can be easily applied to locally and
ment of network unit importance will change less abruptly with (neuron- or filter-)group-wise constrained pruning without reg-
(small) changes in the choice of reference samples, compared to ularization. Different strategies for selecting (sub-)parts of the
gradient-based criteria. model might still be considered, e.g., applying different weight-
The propagation rule performs two separate relevance propaga- ings/priorities for pruning different parts of the model: should the
tion steps per layer: one exclusively considering activatory parts of aim of pruning be the reduction of FLOPs required during infer-
the forward propagated quantities (i.e., all ai(l ) wi j > 0) and another ence, one would prefer to focus on primarily pruning units of the
only processing the inhibitory parts (ai(l ) wi j < 0) which are subse- convolutional layers. In case the aim is a reduction of the mem-
quently merged in a sum with components weighted by α and β ory requirement, pruning should focus on the fully-connected lay-
(s.t. α + β = 1) respectively. ers instead.

4
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

In the context of Algorithm 1, Step 1 of the LRP-based assess- each criterion. This is equivalent to removing 10 0 0 learned (yet in-
ment of neuron and filter importance is performed as a single LRP significant, according to the criterion) filters from the model. Af-
backward pass through the model, with an aggregation of rele- ter pruning, we observed the changes in the decision boundaries
vance per filter channel as described above, for convolutional lay- and re-evaluated for classification accuracy using the original train-
ers, and does not require additional normalization or regulariza- ing samples and re-sampled datapoints across criteria. This exper-
tion. We would like to point out that instead of backpropagating iment is performed with n ∈ [1, 2, 5, 10, 20, 50, 100, 200] reference
the model output fc (x ) for the true class c of any given sample x samples for testing and the computation of pruning criteria. Each
(as it is commonly done when LRP is used for explaining a predic- setting is repeated 50 times, using the same set of random seeds
tion [7,8]), we initialize the algorithm with Rc(L ) = 1 at the output (depending on the repetition index) for each n across all pruning
layer L. We thus gain robustness against the model’s (in)confidence criteria to uphold comparability.
in its predictions on the previously unseen reference samples x and Fig. 2 shows the data distributions of the generated toy
ensure an equal weighting of the influence of all reference samples datasets, an exemplary set of n = 5 samples generated for criteria
in the identification of relevant neural pathways. computation, as well as the qualitative impact to the models’ deci-
sion boundary when removing a fixed set of 10 0 0 neurons as se-
4. Experiments lected via the compared criteria. Fig. 3 investigates how the prun-
ing criteria preserve the models’ problem solving capabilities as a
We start by an attempt to intuitively illuminate the proper- function of the number of samples selected for computing the cri-
ties of different pruning criteria, namely, weight magnitude, Tay- teria. Fig. 4 then quantitatively summarizes the results for specific
lor, gradient and LRP, via a series of toy datasets. We then show numbers of unseen samples (n ∈ [1, 5, 20, 100]) for computing the
the effectiveness of the LRP criterion for pruning on widely-used criteria. Here we report the model accuracy on the training set in
image recognition benchmark datasets — i.e., the Scene 15 [36], order to relate the preservation of the decision function as learned
Event 8 [37], Cats & Dogs [38], Oxford Flower 102 [39], CIFAR-101 , from data between unpruned (2nd column) to pruned models and
and ILSVRC 2012 [40] datasets — and four pre-trained feed-forward pruning criteria (remaining columns).
deep neural network architectures, AlexNet and VGG-16 with only The results in Fig. 4 show that, among all criteria based on ref-
a single sequence of layers, and ResNet-18 and ResNet-50 [41], erence samples for the computation of relevance, the LRP-based
which both contain multiple parallel branches of layers and skip measure consistently outperforms all other criteria in all reference
connections. set sizes and datasets. Only in the case of n = 1 reference sam-
The first scenario focuses specifically on pruning of pre-trained ple per class, the weight criterion preserves the model the best.
CNNs with subsequent fine-tuning, as it is common in pruning re- Note that using the weight magnitude as a measure of network
search [22]. We compare our method with several state-of-the-art unit importance is a static approach, independent from the choice
criteria to demonstrate the effectiveness of LRP as a pruning crite- of reference samples. Given n = 5 points of reference per class, the
rion in CNNs. LRP-based criterion already outperforms also the weight magni-
In the second scenario, we tested whether the proposed prun- tude as a criterion for pruning unimportant neural network struc-
ing criterion also works well if only a very limited number of sam- tures, while successfully preserving the functional core of the pre-
ples is available for pruning the model. This is relevant in case dictor. Fig. 2 demonstrates how the toy models’ decision bound-
of devices with limited computational power, energy and storage aries change under influence of pruning with all four criteria. We
such as mobile devices or embedded applications. can observe that the weight criterion and LRP preserve the models’
learned decision boundary well. Both the Taylor and gradient mea-
4.1. Pruning toy models sures degrade the model significantly. Compared to weight- and
LRP-based criteria, models pruned by gradient-based criteria mis-
First, we systematically compare the properties and effective- classify a large part of samples.
ness of the different pruning criteria on several toy datasets in or- The first row of Fig. 3 shows that all (data dependent) measures
der to foster an intuition about the properties of all approaches, benefit from increasing the number of reference points. LRP is able
in a controllable and computationally inexpensive setting. To this to find and preserve the functionally important network compo-
end we evaluate all four criteria on different toy data distribu- nents with only very little data, while at the same time being
tions qualitatively and quantitatively. We generated three k-class considerably less sensitive to the choice of reference points than
toy datasets (“moon” (k = 2), “circle” (k = 2) and “multi” (k = 4)), other metrics, visible in the measures’ standard deviations. Both
using respective generator functions2 , 3 . the gradient and Taylor-based measures do not reach the perfor-
Each generated 2D dataset consists of 10 0 0 training samples mance of LRP-based pruning, even with 200 reference samples for
per class. We constructed and trained the models as a sequence each class. The performance of pruning with the weight magnitude
of three consecutive ReLU-activated dense layers with 10 0 0 hidden based measure is constant, as it does only depend on the learned
neurons each. After the first linear layer, we have added a DropOut weights itself. The bottom row of Fig. 3 shows the test performance
layer with a dropout probability of 50%. The model receives inputs of the pruned models as a function of the number of samples used
from R2 and has — depending on the toy problem set — k ∈ {2, 4} for criteria computation. Here, we tested on 500 samples per class,
output neurons: drawn from the datasets’ respective distributions, and perturbed
Dense(1000) → ReLU → DropOut(0.5) → with additional Gaussian noise (N (0, 0.3 )) added after data gener-
Dense(1000) → ReLU → Dense(1000) → ation. Due to the large amounts of noise added to the data, we see
ReLU → Dense(k) the prediction performance of the pruned and unpruned models to
We then sample a number of new datapoints (unseen dur- decrease in all settings. Here we can observe that two out of three
ing training) for the computation of the pruning criteria. During times the LRP-pruned models outperforming all other criteria. Only
pruning, we removed a fixed number of 10 0 0 of the 30 0 0 hidden once, on the “moon” dataset, pruning based on the weight criterion
neurons that have the least relevance for prediction according to yields a higher performance than the LRP-pruned model. Most re-
markably though, only the models pruned with the LRP-based cri-
1
https://www.cs.toronto.edu/∼kriz/cifar.html. terion exhibit prediction performance and behavior — measured in
2
https://scikit-learn.org/stable/datasets. mean and standard deviation of accuracies measured over all 50
3
https://github.com/seulkiyeom/LRP_Pruning_toy_example. random seeds per n reference samples on the deliberatly heavily

5
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 2. Qualitative comparison of the impact of the pruning criteria on the decision function on three toy datasets. 1st column: scatter plot of the training data and decision
boundary of the trained model, 2nd column: data samples randomly selected for computing the pruning criteria, 3rd to 6th columns: changed decision boundaries after the
application of pruning w.r.t. different criteria.

Fig. 3. Pruning performance (accuracy) comparison of criteria depending on the number of reference samples per class used for criterion computation. 1st row: Model
evaluation on the training data. 2nd row: Model evaluation on an unseen test dataset with added Gaussian noise (N (0, 0.3 )), which has not been used for the computation
of pruning criteria. Columns: Results over different datasets. Solid lines show the average post-pruning performance of the models pruned w.r.t. to the evaluated criteria
weight (black), Taylor (blue), grad(ient) (green) and LRP (red) over 50 repetitions of the experiment. The dashed black line indicates the model’s evaluation performance
without pruning. Shaded areas around the lines show the standard deviation over the repetitions of experiments. Further results for noise levels N (0, 0.1 ) and N (0, 0.01 )
are available on github3 . (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

noisy data — highly similar to the original and unpruned model, ent) deviate, as it can be expected that both metrics at least select
from only n = 5 reference samples per class on, on all datasets. highly overlapping sets of network units for pruning and preser-
This yields another strong indicator that LRP is, among the com- vation. We therefore investigate in all three toy settings — across
pared criteria, most capable at preserving the relevant core of the the different number of reference samples and random seeds —
learned network function, and at dismissing unimportant parts of the (dis)similarities and (in)consistencies in neuron selection and
the model during pruning. ranking by measuring the set similarities (S1 ∩ S2 )/ min(|S1 |, |S2 | )
The strong results of LRP, and the partial similarity between of the k neurons selected for pruning (ranked first) and preserva-
the results on the training datasets between LRP and weight raises tion (ranked last) between and within criteria. Since the weight
the question where and how both metrics (and Taylor and gradi- criterion is not influenced by the choice of reference samples for

6
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 4. Comparison of training accuracy after pruning one third of all filters w.r.t one of the four metrics on toy datasets, with n ∈ [1, 5, 20, 100] reference samples used for
criteria computation for Weight, Gradient, Taylor and LRP. The experiment is repeated 50 times. Note that the Weight criterion is not influenced by the number of reference
samples n. Compare to Supplementary Table 1.

Table 1
Similarity analysis of neuron selection between LRP and the other criteria, computed over 50 different random seeds. Higher values indicate higher similarity
in neuron selection of the first/last k neurons for pruning compared to LRP. Note that below table reports results only for n = 10 reference samples for criteria
computation (Weight, Taylor, Gradient and LRP) and k = 250 and k = 10 0 0. Similar observations have been made for n ∈ [1, 2, 5, 20, 50, 100, 200] and k ∈ [125, 500]
and can be found on github3 .

Dataset first-250 last-250 first-10 0 0 last-10 0 0

W T G L W T G L W T G L W T G L

moon 0.002 0.006 0.006 1.000 0.083 0.361 0.369 1.000 0.381 0.639 0.626 1.000 0.409 0.648 0.530 1.000
circle 0.033 0.096 0.096 1.000 0.086 0.389 0.405 1.000 0.424 0.670 0.627 1.000 0.409 0.623 0.580 1.000
mult 0.098 0.220 0.215 1.000 0.232 0.312 0.299 1.000 0.246 0.217 0.243 1.000 0.367 0.528 0.545 1.000

Table 2
A consistency comparison of neuron selection and ranking for network pruning with criteria (Weight, Taylor, Gradient and
LRP), averaged over all 1225 unique random seed combinations. Higher values indicate higher consistency in selecting the
same sets of neurons and generating neuron rankings for different sets of reference samples. We report results for n = 10
reference samples and k = 250. Observations for n ∈ [1, 2, 5, 20, 50, 100, 200] and k ∈ [125, 500, 1000] are available on github3 .

Dataset first-250 last-250 Spearman Correlation

W T G L W T G L W T G L

moon 1.000 0.920 0.918 0.946 1.000 0.508 0.685 0.926 1.000 0.072 0.146 0.152
circle 1.000 0.861 0.861 0.840 1.000 0.483 0.635 0.936 1.000 0.074 0.098 0.137
mult 1.000 0.827 0.829 0.786 1.000 0.463 0.755 0.941 1.000 0.080 0.131 0.155

computation, it is expected that the resulting neuron order is per- neurons rated as irrelevant, their volatility in the preservation of
fectly consistent with itself in all settings (cf. Table 2). What is un- neurons which constitute the functional core of the network after
expected however, given the results in Fig. 3 and Fig. 4 indicating pruning yields dissimilarities in the resulting predictor function.
similar model behavior after pruning to be expected between LRP- The high consistency reported for LRP in terms of neuron sets se-
and weight-based criteria, at least on the training data, is the mini- lected for pruning and preservation, given the relatively low Spear-
mal set overlap between LRP and weight, given the higher set simi- man correlation coefficient points out only minor local perturba-
larities between LRP and the gradient and Taylor criteria, as shown tions of the pruning order due to the selection of reference sam-
in Table 1. Overall, the set overlap between the neurons ranked ples. We find a direct correspondence between the here reported
in the extremes of the orderings show that LRP-derived pruning (in)consistency of pruning behavior for the three data-dependent
strategies have very little in common with the ones originating criteria, and the explanation continuity” observed for LRP [12] (and
from the other criteria. This observation can also be made on more discontinuity observed for gradient and Taylor) in neural networks
complex networks at hand of Fig. 7, as shown and discussed later containing the commonly used ReLU activation function, which
in this Section. provides an explanation for the high pruning consistency obtained
Table 2 reports the self-similarity in neuron selection in the ex- with LRP, and the extreme volatility for gradient and Taylor.
tremes of the ranking across random seeds (and thus sets of refer- A supplementary analysis of the neuron selection consistency
ence samples), for all criteria and toy settings. While LRP yields a of LRP over different counts of reference samples n, demonstrating
high consistency in neuron selection for both the pruning (first-k) the requirement of only very few reference samples per class in
and the preservation (last-k) of neural network units, both gradient order to obtain stable pruning results, can be found in Supplemen-
and more so Taylor exhibit lower self-similarities. The lower con- tary Results 1.
sistency of both latter criteria in the model components ranked last Taken together, the results of Tables 1 to 2 and Supplemen-
(i.e., preserved in the model the longest during pruning) yields an tary Tables 1 and 2 elucidate that LRP constitutes — compared
explanation for the large variation in results observed earlier: al- to the other methods — an orthogonal pruning criterion which is
though gradient and Taylor are highly consistent in the removal of very consistent in its selection of (un)important neural network

7
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

units, while remaining adaptive to the selection of reference sam- Scenario 1, we subsequently fine-tune and to re-evaluate the model
ples for criterion computation. Especially the similarity in post- to account for dependency across parameters and regain perfor-
pruning model performance to the static weight criterion indicates mance, as it is common.
that both metrics are able to find valid, yet completely different
pruning solutions. However, since LRP can still benefit from the in- 4.2.1. Scenario 1: Pruning with fine-tuning
fluence of reference samples, we will show in Section 4.2.2 that On the first scenario, we retrain the model after each itera-
our proposed criterion is able to outperform not only weight, but tion of pruning in order to regain lost performance. We then eval-
all other criteria in Scenario 2, where pruning is used instead of uate the performance of the different pruning criteria after each
fine-tuning as a means of domain adaptation. This will be dis- pruning-retraining-step.
cussed in the following sections. That is, we quantify the importance of each filter by the mag-
nitude of the respective criterion and iteratively prune 5% of all fil-
4.2. Pruning deep image classifiers for large-scale benchmark data ters (w.r.t. the original number of filters in the model) rated least
important in each pruning step. Then, we compute and record the
We now evaluate the performance of all pruning criteria on training loss, test accuracy, number of remaining parameters and
VGG-16, AlexNet as well as ResNet-18 and ResNet-50, — popu- total estimated FLOPs. We assume that the least important filters
lar CNNs in compression research [42] — all of which are pre- should have only little influence on the prediction and thus incur
trained on ILSVRC 2012 (ImageNet). VGG-16 consists of 13 convo- the lowest performance drop if they are removed from the net-
lutional layers with 4224 filters and 3 fully-connected layers and work.
AlexNet contains 5 convolutional layers with 1552 filters and 3 Fig. 5 (and Supplementary Fig. 2) depict test accuracies with in-
fully-connected layers. In dense layers, there exist 4,096+4,096+k creasing pruning rate in VGG-16 and ResNet-50 (and AlexNet and
neurons (i.e. filters), respectively, where k is the number of output ResNet-18, respectively) after fine-tuning for each dataset and each
classes. In terms of complexity of the model, the pre-trained VGG- criterion. It is observed that LRP achieves higher test accuracies
16 and AlexNet on ImageNet originally consist of 138.36/60.97 mil- compared to other criteria in a large majority of cases (see Fig. 6
lion of parameters and 154.7/7.27 Giga Multiply-Accumulate Oper- and Supplementary Fig. 1). These results demonstrate that the per-
ations (GMACs) (as a measure of FLOPs), respectively. formance of LRP-based pruning is stable and independent of the
ResNet-18 and ResNet-50 consist of 20/53 convolutional layers chosen dataset.
with 4,800/26,560 filters. In terms of complexity of the model, the Apart from performance, regularization by layer is a critical
pre-trained ResNet-18 and ResNet-50 on ImageNet originally con- constraint which obstructs the expansion of some of the crite-
sist of 11.18/23.51 million of parameters and 1.82/4.12 GMACs, re- ria toward several pruning strategies such as local pruning, global
spectively. pruning, etc. Except for the LRP criterion, all criteria perform sub-
Furthermore, since the LRP scores are not implementation- stantially worse without l p regularization compared to those with
invariant and depend on the LRP rules used for the batch nor- l p regularization and result in unexpected network interruptions
malization (BN) layers, we convert a trained ResNet into a can- during the pruning process due to the biased redistribution of im-
onized version, which yields the same predictions up to numer- portance in the network (cf. top rows of Fig. 5 and Supplementary
ical errors. The canonization fuses a sequence of a convolution Fig. 2).
and a BN layer into a convolution layer with updated weights4 Table 3 shows the predictive performance of the different cri-
and resets the BN layer to be the identity function. This re- teria in terms of training loss, test accuracy, number of remain-
moves the BN layer effectively by rewriting a sequence of two ing parameters and FLOPs, for the VGG-16 and ResNet-50 models.
affine mappings into one updated affine mapping [43]. The sec- Similar results for AlexNet and ResNet-18 can be found in Supple-
ond change replaced calls to torch.nn.functional methods mentary Table 2. Except for CIFAR-10, the highest compression rate
and the summation in the residual connection by classes derived (i.e., lowest number of parameters) could be achieved by the pro-
from torch.nn.Module which then were wrapped by calls to posed LRP-based criterion (row “Params”) for VGG-16, but not for
torch.autograd.function to enable custom backward com- ResNet-50. However, in terms of FLOPs, the proposed criterion only
putations suitable for LRP rule computations. outperformed the weight criterion, but not the Taylor and gradient
Experiments are performed within the PyTorch and torchvi- criteria (row “FLOPs”). This is due to the fact that a reduction in
sion frameworks under Intel(R) Xeon(R) CPU E5-2660 2.20GHz and number of FLOPs depends on the location where pruning is ap-
NVIDIA Tesla P100 with 12GB for GPU processing. We evalu- plied within the network: Fig. 7 shows that the LRP and weight
ated the criteria on six public datasets (Scene 15 [36], Event criteria focus the pruning on upper layers closer to the model out-
8, Cats and Dogs [38], Oxford Flower 102 [39], CIFAR-10, and put, whereas the Taylor and gradient criteria focus more on the
ILSVRC 2012 [40]). For more detail on the datasets and the pre- lower layers.
processing, see Supplementary Methods 1. Our complete experi- Throughout the pruning process usually a gradual decrease in
mental setup covering these datasets is publicly available at https: performance can be observed. However, with the Event 8, Ox-
//github.com/seulkiyeom/LRP_pruning. ford Flower 102 and CIFAR-10 datasets, pruning leads to an ini-
In order to prepare the models for evaluation, we first fine- tial performance increase, until a pruning rate of approx. 30%
tuned the models for 200 epochs with constant learning rate 0.001 is reached. This behavior has been reported before in the liter-
and batch size of 20. We used the Stochastic Gradient Descent ature and might stem from improvements of the model struc-
(SGD) optimizer with momentum of 0.9. In addition, we also ap- ture through elimination of filters related to classes in the source
ply dropout to the fully-connected layers with probability of 0.5. dataset (i.e., ILSVRC 2012) that are not present in the target dataset
Fine-tuning and pruning are performed on the training set, while anymore [44].
results are evaluated on each test dataset. Throughout the experi- Supplementary Table 3 and Supplementary Fig. 2 similarly show
ments, we iteratively prune 5% of all the filters in the network by that LRP achieves the highest test accuracy in AlexNet and ResNet-
eliminating units including their input and output connections. In 18 for nearly all pruning ratios with almost every dataset.
Fig. 7 shows the number of the remaining convolutional filters
4
See bnafterconv_overwrite_intoconv(conv,bn) in the file for each iteration. We observe that, on the one hand, as pruning
lrp_general6.py in https://github.com/AlexBinder/LRP_Pytorch_Resnets_ rate increases, the convolutional filters in earlier layers that are
Densenet. associated with very generic features, such as edge and blob

8
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 5. Comparison of test accuracy in different criteria as pruning rate increases on VGG-16 (top) and ResNet-50 (bottom) with five datasets. Pruning with fine-tuning.
Prematurely terminated lines in above row of panels indicate that during pruning, the respective criterion removed filters vital to the network structure by disconnecting
the model input from the output.

Fig. 6. Performance comparison of the proposed method (i.e., LRP) and other criteria on VGG-16 and ResNet-50 with five datasets. Each point in the scatter plot corresponds
to the performance at a specific pruning rate of two criteria, where the vertical axis shows the performance of our LRP criterion and the horizontal axis the performance of
a single other criterion (compare to Fig. 5 that displays the same data for more than two criteria). The black dashed line shows the set of points where models pruned by
one of the compared criteria would exhibit identical performance to LRP. For accuracy, higher values are better. For loss, lower values are better.

A. Cats and Dogs


LRP Weight Gradient Taylor
100
Remaining
filters (%)

B. Oxford Flower 102


100
Remaining
filters (%)

Index of Convolutional Layer

Fig. 7. An observation of per-layer pruning performed w.r.t the different evaluated criteria on VGG-16 and two datasets. Each colored line corresponds to a specific (global)
ratio of filters pruned from the network (black (top) : 0%, red : 15%, green: 30%, blue: 45%, violet: 75% and black (bottom) 90%). The dots on each line identify the ratio of
pruning applied to specific convolutional layers, given a global ratio of pruning, depending on the pruning criterion. (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article.)

9
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Table 3
A performance comparison between criteria (Weight, Taylor, Gradient with l2 -norm each and LRP) and the Unpruned model for VGG-16 (top) and
ResNet-50 (bottom) on five different image benchmark datasets. Criteria are evaluated at fixed pruning rates per model and dataset, identified as
dataset @ percent_pruned_filters %. We report test accuracy (in %), (training) loss (×10−2 ), number of remaining parameters (×107 ) and FLOPs (in
GMAC) per forward pass. For all measures except accuracy, lower outcomes are better.

VGG-16 Scene 15 @ 55% Event 8 @ 55% Cats & Dogs @ 60%

U W T G L U W T G L U W T G L

Loss 2.09 2.27 1.76 1.90 1.62 0.85 1.35 1.01 1.18 0.83 0.19 0.50 0.51 0.57 0.44
Accuracy 88.59 82.07 83.00 82.72 83.99 95.95 90.19 91.79 90.55 93.29 99.36 97.90 97.54 97.19 98.24
Params 119.61 56.17 53.10 53.01 49.67 119.58 56.78 48.48 50.25 47.35 119.55 47.47 51.19 57.27 43.75
FLOPs 15.50 8.03 4.66 4.81 6.94 15.50 8.10 5.21 5.05 7.57 15.50 7.02 3.86 3.68 6.49

Oxford Flower 102 @ 70% CIFAR-10 @ 30%

U W T G L U W T G L

Loss 3.69 3.83 3.27 3.54 2.96 1.57 1.83 1.76 1.80 1.71
Accuracy 82.26 71.84 72.11 70.53 74.59 91.04 93.36 93.29 93.05 93.42
Params 119.96 39.34 41.37 42.68 37.54 119.59 74.55 97.30 97.33 89.20
FLOPs 15.50 5.48 2.38 2.45 4.50 15.50 11.70 8.14 8.24 9.93

ResNet-50 Scene 15 @ 55% Event 8 @ 55% Cats & Dogs @ 60%

U W T G L U W T G L U W T G L

Loss 0.81 1.32 1.08 1.32 0.50 0.33 1.07 0.63 0.85 0.28 0.01 0.05 0.06 0.21 0.02
Accuracy 88.28 80.17 80.26 78.71 85.38 96.17 88.27 87.55 86.38 94.22 98.42 97.02 96.33 93.13 98.03
Params 23.54 14.65 12.12 11.84 13.73 23.52 13.53 11.85 11.93 14.05 23.51 12.11 10.40 10.52 12.48
FLOPs 4.12 3.22 2.45 2.42 3.01 4.12 3.16 2.48 2.47 3.10 4.12 3.04 2.40 2.27 2.89

Oxford Flower 102 @ 70% CIFAR-10 @ 30%

U W T G L U W T G L

Loss 0.82 3.04 2.18 2.69 0.83 0.003 0.002 0.004 0.009 0.003
Accuracy 77.82 51.88 58.62 53.96 76.83 93.55 93.37 93.15 92.76 93.23
Params 23.72 9.24 8.82 8.48 9.32 23.52 19.29 18.10 17.96 18.11
FLOPs 4.12 2.55 1.78 1.81 2.38 1.30 1.14 1.06 1.05 1.16

detectors, tend to generally be preserved as opposed to those in data on the 1) Cats & Dogs and 2) subsets from the ILSVRC 2012
latter layers which are associated with abstract, task-specific fea- classes.
tures. On the other hand, the LRP- and weight-criterion first keep On the Cats & Dogs dataset, we only used 10 samples each from
the filters in early layers in the beginning, but later aggressively the “cat” and “dog” classes to prune the (on ImageNet) pre-trained
prune filters near the input which now have lost functionality as AlexNet, VGG-16, ResNet-18 and ResNet-50 networks with the goal
input to later layers, compared to the gradient-based criteria such of domain/dataset adaption. The binary classification (i.e., “cat” vs.
as gradient and Taylor-based approaches. Although gradient-based “dog”) is a subtask within the ImageNet taxonomy and correspond-
criteria also adopt the greedy layer-by-layer approach, we can ing output neurons can be identified by its WordNet5 associations.
see that gradient-based criteria pruned the less important filters This experiment implements the task of domain adaptation.
almost uniformly across all the layers due to re-normalization of In a second experiment on the ILSVRC 2012 dataset, we ran-
the criterion in each iteration. However, this result contrasts with domly chose k = 3 classes for the task of model specialization, se-
previous gradient-based works [22,25] that have shown that units lected only n = 10 images per class from the training set and used
deemed unimportant in earlier layers, contribute significantly them to compare the different pruning criteria. For each criterion,
compared to units deemed important in latter layers. In contrast we used the same selection of classes and samples. In both experi-
to this, LRP can efficiently preserve units in the early layers — as mental settings, we do not fine-tune the models after each prun-
long as they serve a purpose — despite of iterative global pruning. ing iteration, in contrast to Scenario 1 in Section 4.2.1. The ob-
tained post-pruning model performance is averaged over 20 ran-
dom selections of classes (ImageNet) and samples (Cats & Dogs)
4.2.2. Scenario 2: Pruning without fine-tuning to account for randomness. Please note that before pruning, we
In this section, we evaluate whether pruning works well if first restructured the models’ fully connected output layers to only
only a (very) limited number of samples is available for quan- preserve the task-relevant k network outputs by eliminating the
tifying the pruning criteria. To the best of our knowledge, there 10 0 0 − k redundant output neurons.
are no previous studies that show the performance of pruning ap- Furthermore, as our target datasets are relatively small and only
proaches when acting w.r.t. very small amounts of data. With large have an extremely reduced set of target classes, the pruned mod-
amounts of data available (and even though we can expect rea- els could still be very heavy w.r.t. memory requirements if the
sonable performance after pruning), an iterative pruning and fine- pruning process would be limited to the convolutional layers, as
tuning procedure of the network can amount to a very time con- in Section 4.2.1. More specifically, while convolutional layers domi-
suming and computationally heavy process. From a practical point nantly constitute the source of computation cost (FLOPs), fully con-
of view, this issue becomes a significant problem, e.g., with limited nected layers are proven to be more redundant [29]. In this re-
computational resources (mobile devices or in general; consumer- spect, we applied pruning procedures in both fully connected lay-
level hardware) and reference data (e.g., private photo collections), ers and convolutional layers in combination for VGG-16.
where capable and effective pruning approaches are desired.
To investigate whether pruning is possible also in these scenar-
ios, we performed experiments with a relatively small number of 5
http://www.image-net.org/archive/wordnet.is_a.txt.

10
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 8. Test accuracy after pruning of n% of convolutional (rows) and m% of fully connected (columns) filters on VGG-16 without fine-tuning for a random subset of the
classes from ILSVRC 2012 (k = 3) based on different criteria (averaged over 20 repetitions). Each color represents a range of 5% in test accuracy. The brighter the color the
better the performance after a given degree of pruning.

Fig. 9. Test accuracy after pruning of n% of convolutional filters on ResNet18 and ResNet50 without fine-tuning for a random subset of the classes from ILSVRC 2012 (k = 3)
based on the criteria Weight, Taylor, Gradient with l2 -norm and LRP (averaged over 20 repetitions). Compare to Fig. 8.

For pruning, we iterate a sequence of first pruning filters from fully-connected layers). During pruning at fully-connected layers,
the convolutional layers, followed by a step of pruning neurons no significant difference across different pruning ratios can be
from the model’s fully connected layers. Note that both evaluated observed. Without further fine-tuning, pruning weights/filters
ResNet architectures mainly consist of convolutional- and pooling at the fully connected layers can retain performance efficiently.
layers, and conclude in a single dense layer, of which the set of However, there is a certain difference between LRP and other
input neurons are only affected via their inputs by pruning the be- criteria with increasing pruning ratio of convolutional layers for
low convolutional stack. We therefore restrict the iterative pruning VGG-16/ResNet-18/ResNet-50, respectively: (LRP vs. Taylor with
filters from the sequence of dense layers of the feed-forward archi- l2 -norm; up to of 9.6/61.8/51.8%, LRP vs. gradient with l2 -norm;
tecture of the VGG-16. up to 28.0/63.6/54.5 %, LRP vs. weight with l2 -norm; up to
The model performance after the application of each cri- 27.1/48.3/30.2 %). Moreover, pruning convolutional layers needs
terion for classifying a small number of classes (k = 3) from to be carefully managed compared to pruning fully connected
the ILSVRC 2012 dataset is indicated in Fig. 8 for VGG 16 and layers. We can observe that LRP is applicable for pruning any
Fig. 9 for ResNets (please note again that ResNets do not have layer type (i.e., fully connected, convolutional, pooling, etc.) ef-

11
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

Fig. 10. Performance comparison of pruning without fine-tuning for AlexNet, VGG-16, ResNet-18 and ResNet-50 based on only few (10) samples per class from the Cats &
Dogs dataset, as a means for domain adaption. Additional results on further target domains can be found in the Supplement with Supplementary Figure 3.

ficiently. Additionally, as mentioned in Section 3.1, our method might not always result in the cheapest sub-network after prun-
can be applied to general network architectures because it can ing in terms of parameter count and FLOPs per inference, however
automatically measure the importance of weights or filters in a it consistently is able to identify the network components for re-
global (network-wise) context without further normalization. moval and preservation leading to the best performing model after
Fig. 10 shows the test accuracy as a function of the pruning ra- pruning. Latter results resonate also strongly in our experiments
tio for the domain adaptation task from ImageNet to CatsNDogs. of Scenario 2 on both image and toy data, where, without the ad-
As the pruning ratio increases, we can see that even without fine- ditional fine-tuning step, the LRP-pruned models vastly outperform
tuning, using LRP as pruning criterion can keep the test accuracy their competitors. The results obtained in multiple toy settings ver-
not only stable, but close to 100%, given the extreme scarcity of ify that only the LRP-based pruning criterion is able to preserve the
data in this experiment. In contrast, the performance decreases sig- original structure of the prediction function (cf. Figs. 2 and 3).
nificantly when using the other criteria requiring an application Unlike the weight criterion, which is a static quantity once the
of the l2 -norm. Initially, the performance is even slightly increas- network is not in training anymore, the criteria Taylor, gradient
ing when pruning with LRP. During iterative pruning, unexpected and LRP require reference samples for computation, which in turn
changes in accuracy with LRP (for 2 out of 20 repetitions of the ex- may affect the estimation of neuron importance. From the latter
periment) have been shown around 50 - 55% pruning ratio, but ac- three criteria, however, only LRP provides a continuous measure of
curacy is regained quickly again. However, only the VGG-16 model network structure importance (cf. Sec 7.2 in [12]) which does not
seems to be affected for this task. For both ResNet models, this suffer from abrupt changes in the estimated importance measures
phenomenon occurs for the other criteria instead. A series of in- with only marginal steps between reference samples. This quality
depth investigations of this momentary decrease in performance of continuity is reflected in the stability and quality of LRP results
did not lead to any insights and will be subject of future work6 . reported in Section 4.1, compared to the high volatility in neuron
By pruning close to 99% of convolutional filters in the networks selection for pruning and model performance after pruning observ-
using our proposed method, we can have 1) greatly reduced com- able for the gradient and Taylor criteria. From this observation it
putational cost, 2) faster forward and backward processing (e.g. for can also be deduced that LRP requires relatively few data points to
the purpose of further training, inference or the computation of at- converge to a pruning solution that possesses a similar prediction
tribution maps), and 3) a lighter model even in the small sample behavior as the original model. Hence, we conclude that LRP is a
case, all while adapting off-the-shelf pre-trained ImageNet models robust pruning criterion that is broadly applicable in practice. Es-
towards a dog-vs.-cat classification task. pecially in a scenario where no fine-tuning is applied after pruning
(see Section 4.2.2), the LRP criterion allows for pruning of a large
5. Discussion part of the model without significant accuracy drops.
In terms of computational cost, LRP is comparable to the Tay-
Our experiments demonstrate that the novel LRP criterion con- lor and Gradient criteria because these criteria require both a for-
sistently performed well compared to other criteria across various ward and a backward pass for all reference samples. The weight
datasets, model architectures and experimental settings, and often- criterion is substantially cheaper to compute since it does not re-
times outperformed the competing criteria. This is especially pro- quire to evaluate any reference samples; however, its performance
nounced in our Scenario 2 (cf. Section 4.2.2), where only little re- falls short in most of our experiments. Additionally, our experi-
sources are available for criterion computation, and no fine-tuning ments demonstrate that LRP requires less reference samples than
after pruning is allowed. Here, LRP considerably outperformed the the other criteria (cf. Fig. 3 and Fig. 4), thus the required com-
other metrics on toy data (cf. Section 4.1) and image processing putational cost is lower in practical scenarios, and better perfor-
benchmark data (cf. Section 4.2.2). The strongly similar results be- mance can be expected if only low numbers of reference samples
tween criteria observed in Scenario 1 (cf. Section 4.2.2) are also are available (cf. Fig. 10).
not surprising, as an additional fine-tuning step after pruning may Unlike all other criteria, LRP does not require explicit regu-
allow the pruned neural network model to recover its original per- larization via l p -normalization, as it is naturally normalized via
formance, as long as the model has the capacity to do so [22]. its enforced relevance conservation principle during relevance back-
From the results of Table 3 and Supplementary Table 3 we can propagation, which leads to the preservation of important net-
observe that with a fixed pruning target of n% filters removed, LRP work substructures and bottlenecks in a global model context. In
line with the findings by [22], our results in Fig. 5 and Supple-
6
mentary Fig. 2 show that additional normalization after criterion
We consequently have to assume that this phenomenon marks the downloaded
computation for weight, gradient and Taylor is not only vital to
pre-trained VGG-16 model as an outlier in this respect. A future line of research
will dedicate inquiries about the circumstances leading to intermediate loss and obtain good performance, but also to avoid disconnected model
later recovery of model performance during pruning.

12
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

segments — something which is prevented out-of-the-box with and TraMeExCo (ref. 01IS18056A), as well as the Grants 01GQ1115
LRP. and 01GQ0850; and by Deutsche Forschungsgesellschaft (DFG) un-
However, our proposed criterion still provides several open der Grant Math+, EXC 2046/1, Project ID 390685689; by the In-
questions that deserve a deeper investigation in future work. First stitute of Information & Communications Technology Planning &
of all, LRP is not implementation invariant, i.e., the structure and Evaluation (IITP) grant funded by the Korea Government (No. 2019-
composition of the analyzed network might affect the computation 0-0 0 079, Artificial Intelligence Graduate School Program, Korea
of the LRP-criterion and “network canonization” — a functionally University); and by ST-SUTD Cyber Security Corporate Laboratory;
equivalent restructuring of the model — might be required for op- the AcRF Tier2 grant MOE2016-T2-2-154; the TL project Intent In-
timal results, as discussed early in Section 4 and [43]. Furthermore, ference; and the SUTD internal grant Fundamentals and Theory
while our LRP-criterion does not require additional hyperparame- of AI Systems. The authors would like to express their thanks to
ters, e.g., for normalization, the pruning result might still depend Christopher J Anders for insightful discussions.
on the chosen LRP variant. In this paper, we chose the α1 β0 -rule
in all layers, because this particular parameterization identifies the Declaration of Competing Interest
network’s neural pathways positively contributing to the selected
output neurons for which reference samples are provided, and is The authors declare that they have no known competing finan-
robust against the detrimental effects of shattered gradients af- cial interests or personal relationships that could have appeared to
fecting especially very deep CNNs [11] (i.e., other than gradient- influence the work reported in this paper.
based methods, it does not suffer from potential discontinuities
in the backpropagated quantities), and has a mathematical well- Supplementary material
motivated foundation in DTD [11,12]. However, other work from lit-
erature provide [14] or suggest [8,9] alternative parameterizations Supplementary material associated with this article can be
to optimize the method for explanatory purposes. It is an interest- found in the online version, at doi:10.1016/j.patcog.2021.107899.
ing direction for future work to examine whether these findings
also apply to LRP as a pruning criterion. References

[1] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang,
6. Conclusion
J. Cai, T. Chen, Recent advances in convolutional neural networks, Pattern
Recognit 77 (2018) 354–377.
Modern CNNs typically have a high capacity with millions of [2] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, N. de Freitas, Predicting parameters in
deep learning, in: Advances in Neural Information Processing Systems (NIPS),
parameters as this allows to obtain good optimization results in
2013, pp. 2148–2156.
the training process. After training, however, high inference costs [3] V. Sze, Y. Chen, T. Yang, J.S. Emer, Efficient processing of deep neural networks:
remain, despite the fact that the number of effective parameters a tutorial and survey, Proc. IEEE 105 (12) (2017) 2295–2329.
in the deep model is actually significantly lower (see e.g. [45]). [4] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, in: Advances in Neural
Information Processing Systems (NIPS), 1989, pp. 598–605.
To alleviate this, pruning aims at compressing and accelerating the [5] Y. Tu, Y. Lin, Deep neural network compression technique towards efficient
given models without sacrificing much of predictive performance. digital signal modulation recognition in edge device, IEEE Access 7 (2019)
In this paper, we have proposed a novel criterion for the iterative 58113–58119.
[6] Y. Cheng, D. Wang, P. Zhou, T. Zhang, Model compression and acceleration
pruning of CNNs based on the explanation method LRP, linking for for deep neural networks: the principles, progress, and challenges, IEEE Sig-
the first time two so far disconnected lines of research. nal Process Mag 35 (1) (2018) 126–136.
LRP has a clearly defined meaning, namely the contribution of [7] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pix-
el-wise explanations for non-linear classifier decisions by layer-wise relevance
an individual network unit, i.e. weight or filter, to the network out- propagation, PLoS ONE 10 (7) (2015) e0130140.
put. Removing units according to low LRP scores thus means dis- [8] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, K.-R. Müller,
carding all aspects in the model that do not contribute relevance to Unmasking Clever Hans predictors and assessing what machines really learn,
Nat Commun 10 (1) (2019) 1096.
its decision making. Hence, as a criterion, the computed relevance
[9] M. Hägele, P. Seegerer, S. Lapuschkin, M. Bockmayr, W. Samek, F. Klauschen,
scores can easily and cheaply give efficient compression rates with- K.-R. Müller, A. Binder, Resolving challenges in deep learning-based analyses of
out further postprocessing, such as per-layer normalization. histopathological images using explanation methods, Sci Rep 10 (2020) 6423.
[10] P. Seegerer, A. Binder, R. Saitenmacher, M. Bockmayr, M. Alber, P. Jurmeister,
Besides, technically LRP is scalable to general network struc-
F. Klauschen, K.-R. Müller, Interpretable deep neural network to predict estro-
tures and its computational cost is similar to the one of a gradient gen receptor status from haematoxylin-eosin images, in: Artificial Intelligence
backward pass. and Machine Learning for Digital Pathology: State-of-the-Art and Future Chal-
In our experiments, the LRP criterion has shown favorable com- lenges, Springer International Publishing, Cham, 2020, pp. 16–37.
[11] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, K.-R. Müller, Explaining non-
pression performance on a variety of datasets both with and with- linear classification decisions with deep taylor decomposition, Pattern Recognit
out retraining after pruning. Especially when pruning without re- 65 (2017) 211–222.
training, our results for small datasets suggest that the LRP crite- [12] G. Montavon, W. Samek, K.-R. Müller, Methods for interpreting and under-
standing deep neural networks, Digit Signal Process 73 (2018) 1–15.
rion outperforms the state of the art and therefore, its application [13] W. Samek, G. Montavon, S. Lapuschkin, C.J. Anders, K.-R. Müller, Explaining
is especially recommended in transfer learning settings where only Deep Neural Networks and Beyond: A Review of Methods and Applications,
a small target dataset is available. Proceedings of the IEEE 109 (3) (2021) 1–32, doi:10.1109/JPROC.2021.3060483.
[14] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, K.T. Schütt, G. Montavon,
In addition to pruning, the same method can be used to visually W. Samek, K.-R. Müller, S. Dähne, P.-J. Kindermans, iNNvestigate neural net-
interpret the model and explain individual decisions as intuitive works!, Journal of Machine Learning Research 20 (2019) 93:1–93:8.
relevance heatmaps. Therefore, for future work, we propose to use [15] S. Wiedemann, K.-R. Müller, W. Samek, Compact and computationally efficient
representation of deep neural networks, IEEE Trans Neural Netw Learn Syst 31
these heatmaps to elucidate and explain which image features are
(3) (2020) 772–785.
most strongly affected by pruning to additionally avoid that the [16] F. Tung, G. Mori, Deep neural network compression by in-parallel prun-
pruning process leads to undesired Clever Hans phenomena [8]. ing-quantization, IEEE Trans Pattern Anal Mach Intell 42 (3) (2020) 568–579.
[17] K. Guo, X. Xie, X. Xu, X. Xing, Compressing by learning in a low-rank and
sparse decomposition form, IEEE Access 7 (2019) 150823–150832.
Acknowledgements [18] T. Xu, P. Yang, X. Zhang, C. Liu, LightweightNet: toward fast and lightweight
convolutional neural networks via architecture distillation, Pattern Recognit 88
This work was supported by the German Ministry for Educa- (2019) 272–284.
[19] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolu-
tion and Research (BMBF) through BIFOLD (refs. 01IS18025A and tional neural network for mobile devices, in: IEEE Conference on Computer
01IS18037A), MALT III (ref. 01IS17058), Patho234 (ref. 031L0207D) Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.

13
S.-K. Yeom, P. Seegerer, S. Lapuschkin et al. Pattern Recognition 115 (2021) 107899

[20] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz, Importance estimation for [41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
neural network pruning, in: IEEE Conference on Computer Vision and Pattern in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
Recognition (CVPR), 2019, pp. 11264–11272. 2016, Las Vegas, NV, USA, June 27–30, 2016, 2016, pp. 770–778.
[21] B. Hassibi, D.G. Stork, Second order derivatives for network pruning: Optimal [42] H. Wang, Q. Zhang, Y. Wang, H. Hu, Structured probabilistic pruning for con-
brain surgeon, in: Advances in Neural Information Processing Systems (NIPS), volutional neural network acceleration, in: British Machine Vision Conference
1992, pp. 164–171. (BMVC), 2018, p. 149.
[22] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional neural [43] M. Guillemot, C. Heusele, R. Korichi, S. Schnebert, L. Chen, Breaking batch nor-
networks for resource efficient transfer learning, in: Proceedings of the Inter- malization for better explainability of deep neural networks through layer–
national Conference on Learning Representations (ICLR), 2017. wise relevance propagation, CoRR abs/2002.11018 (2020).
[23] C. Yu, J. Wang, Y. Chen, X. Qin, Transfer channel pruning for compressing [44] J. Liu, Y. Wang, Y. Qiao, Sparse deep transfer learning for convolutional neural
deep domain adaptation models, Int. J. Mach. Learn. Cybern. 10 (11) (2019) network, in: AAAI Conference on Artificial Intelligence, 2017, pp. 2245–2251.
3129–3144. [45] N. Murata, S. Yoshizawa, S. Amari, Network information criterion-determining
[24] C. Liu, H. Wu, Channel pruning based on mean gradient for accelerating con- the number of hidden units for an artificial neural network model, IEEE Trans.
volutional neural networks, Signal Processing 156 (2019) 84–91. Neural Networks 5 (6) (1994) 865–872.
[25] X. Sun, X. Ren, S. Ma, H. Wang, meprop: sparsified back propagation for accel-
erated deep learning with reduced overfitting, in: International Conference on Seul-Ki Yeom received a Ph.D. degree in Brain-Computer Interfacing from Korea
Machine Learning (ICML), 2017, pp. 3299–3308. University, in 2018. From 2018 to 2020, he was associated to the Machine Learn-
[26] S. Han, J. Pool, J. Tran, W.J. Dally, Learning both weights and connections for ef- ing Group at Technische Universität Berlin. Since 2020, Seul-Ki holds a position as
ficient neural network, in: Advances in Neural Information Processing Systems Senior Research Engineer at Nota.ai. His research interests include brain-computer
(NIPS), 2015, pp. 1135–1143. interface, machine learning, and model compression.
[27] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: effi-
cient inference engine on compressed deep neural network, in: International Philipp Seegerer received a M.Sc. degree in Medical Image and Data Processing
Symposium on Computer Architecture (ISCA), 2016, pp. 243–254. from Friedrich-Alexander-Universität Erlangen-Nürnberg, in 2017. He is currently a
[28] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in Doctoral Researcher in the Machine Learning Group at Technische Universität Berlin
deep neural networks, in: Advances in Neural Information Processing Systems and since 2019 he is associated to Aignostics as a Machine Learning Engineer. His
(NIPS), 2016, pp. 2074–2082. research interests are machine learning and medical image and data analysis, in
[29] H. Li, A. Kadav, I. Durdanovic, H. Samet, H.P. Graf, Pruning filters for efficient particular computational pathology.
convnets, in: International Conference on Learning Representations, (ICLR),
2017. Sebastian Lapuschkin received an M.Sc. degree in Computer Science in 2013 and a
[30] R. Yu, A. Li, C. Chen, J. Lai, V.I. Morariu, X. Han, M. Gao, C. Lin, L.S. Davis, Ph.D. degree from Technische Universität Berlin, in 2018. He is currently the Head
NISP: pruning networks using neuron importance score propagation, in: of the Explainable AI Group at the Fraunhofer Heinrich Hertz Institute. His research
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, interests are explainability and efficiciency in computer vision, machine learning
pp. 9194–9203. and data analysis.
[31] J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, W. Lin, ThiNet: pruning CNN
filters for a thinner net, IEEE Trans Pattern Anal Mach Intell 41 (10) (2019) Alexander Binder obtained a Dr. rer. nat. degree from Technical University Berlin
2525–2538. in 2013. He is currently an Associate Professor in the Institute of Informatics at the
[32] J. Gan, W. Wang, K. Lu, Compressing the CNN architecture for in-air handwrit- University of Oslo. He has been an Assistant Professor at SUTD from 2015 to 2020.
ten chinese character recognition, Pattern Recognit Lett 129 (2020) 190–197. His research interests include computer vision, machine learning, explaining non-
[33] X. Dai, H. Yin, N.K. Jha, Nest: a neural network synthesis tool based on a linear predictions and medical applications.
grow-and-prune paradigm, IEEE Trans. Comput. 68 (10) (2019) 1487–1497.
[34] Explainable AI: interpreting, explaining and visualizing deep learning, in: Simon Wiedemann received a M.Sc. degree in applied mathematics from Tech-
W. Samek, G. Montavon, A. Vedaldi, L.K. Hansen, K.-R. Müller (Eds.), Lecture nische Universität Berlin, in 2018. He is currently a Research Associate in the De-
Notes in Computer Science, 11700, Springer, 2019. partment of Artificial Intelligence, Fraunhofer Heinrich-Hertz-Institute. His research
[35] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, K.-R. Müller, Evaluating the interests include information theory and efficient machine learning, in particular
visualization of what a deep neural network has learned, IEEE Trans Neural compression, efficient inference and training of neural networks.
Netw Learn Syst 28 (11) (2017) 2660–2673.
[36] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid Klaus-Robert Müller (Ph.D. 92) has been a Professor of computer science at TU
matching for recognizing natural scene categories, in: IEEE Conference on Berlin since 2006; co-director Berlin Big Data Center. He won the 1999 Olympus
Computer Vision and Pattern Recognition (CVPR), 2006, pp. 2169–2178. Prize of German Pattern Recognition Society, the 2006 SEL Alcatel Communication
[37] L. Li, F. Li, What, where and who? Classifying events by scene and object Award, and the 2014 Science Prize of Berlin. Since 2012, he is an elected member
recognition, in: IEEE International Conference on Computer Vision (ICCV), of the German National Academy of Sciences Leopoldina.
2007, pp. 1–8.
[38] J. Elson, J.R. Douceur, J. Howell, J. Saul, Asirra: a CAPTCHA that ex- Wojciech Samek received a Diploma degree in Computer Science from Humboldt
ploits interest-aligned manual image categorization, in: Proceedings of the University Berlin in 2010 and the Ph.D. degree in Machine Learning from Technische
2007 ACM Conference on Computer and Communications Security (CCS), 2007, Universität Berlin, in 2014. Currently, he directs the Department of Artificial Intel-
pp. 366–374. ligence at Fraunhofer Heinrich Hertz Institute. His research interests include neural
[39] M. Nilsback, A. Zisserman, Automated flower classification over a large number networks, interpretability and federated learning.
of classes, in: Sixth Indian Conference on Computer Vision, Graphics & Image
Processing (ICVGIP), 2008, pp. 722–729.
[40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M.S. Bernstein, A.C. Berg, F. Li, Imagenet large scale visual recog-
nition challenge, Int J Comput Vis 115 (3) (2015) 211–252.

14

You might also like