0% found this document useful (0 votes)
7 views

Soft Filter Pruning For Accelerating Deep Convolutional Neural Networks

Uploaded by

cy.zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Soft Filter Pruning For Accelerating Deep Convolutional Neural Networks

Uploaded by

cy.zzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks

Yang He1,2 , Guoliang Kang2 , Xuanyi Dong2 , Yanwei Fu3∗ , Yi Yang1,2∗


1
SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology
2
CAI, University of Technology Sydney
3
The School of Data Science, Fudan University
{yang.he-1, guoliang.kang, xuanyi.dong}@student.uts.edu.au,
[email protected], [email protected]
arXiv:1808.06866v1 [cs.CV] 21 Aug 2018

Abstract Input Output

This paper proposed a Soft Filter Pruning (SFP)


Convolution
Operation
×
method to accelerate the inference procedure of Filters
deep Convolutional Neural Networks (CNNs). Hard pruned Training Still zero Capacity
Specifically, the proposed SFP enables the pruned never update Reduced
filters to be updated when training the model af-
ter pruning. SFP has two advantages over previ- × Hard
Filter ×
ous works: (1) Larger model capacity. Updat- Pruning
ing previously pruned filters provides our approach
with larger optimization space than fixing the fil- Soft pruned Training Non-zero Capacity
allow update Maintained
ters to zero. Therefore, the network trained by our
method has a larger model capacity to learn from × Soft
Filter
×
the training data. (2) Less dependence on the pre-
Pruning
trained model. Large capacity enables SFP to train
from scratch and prune the model simultaneously.
In contrast, previous filter pruning methods should Figure 1: Hard Filter Pruning v.s. Soft Filter Pruning. We mark
be conducted on the basis of the pre-trained model the pruned filter as the green dashed box. For the hard filter pruning,
the pruned filters are always fixed during the whole training proce-
to guarantee their performance. Empirically, SFP dure. Therefore, the model capacity is reduced and thus harms the
from scratch outperforms the previous filter prun- performance because the dashed blue box is useless during train-
ing methods. Moreover, our approach has been ing. On the contrary, our SFP allows the pruned filters to be updated
demonstrated effective for many advanced CNN ar- during the training procedure. In this way, the model capacity is
chitectures. Notably, on ILSCRC-2012, SFP re- recovered from the pruned model, and thus leads a better accuracy.
duces more than 42% FLOPs on ResNet-101 with
even 0.2% top-5 accuracy improvement, which has
advanced the state-of-the-art. Code is publicly cumbersome model significantly exceed the computing limi-
available on GitHub: https://github.com/he-y/soft- tation of current mobile devices. Therefore, it is essential to
filter-pruning maintain the small size of the deep CNN models which has
relatively low computational cost but high accuracy in real-
world applications.
1 Introduction Recent efforts have been made either on directly deleting
The superior performance of deep CNNs usually comes from weight values of filters [Han et al., 2015b] (i.e., weight prun-
the deeper and wider architectures, which cause the pro- ing) or totally discarding some filters (i.e., filter pruning) [Li
hibitively expensive computation cost. Even if we use more et al., 2017; He et al., 2017; Luo et al., 2017]. However, the
efficient architectures, such as residual connection [He et al., weight pruning may result in the unstructured sparsity of fil-
2016a] or inception module [Szegedy et al., 2015], it is still ters, which may still be less efficient in saving the memory
difficult in deploying the state-of-the-art CNN models on mo- usage and computational cost, since the unstructured model
bile devices. For example, ResNet-152 has 60.2 million pa- cannot leverage the existing high-efficiency BLAS libraries.
rameters with 231MB storage spaces; besides, it also needs In contrast, the filter pruning enables the model with struc-
more than 380MB memory footprint and six seconds (11.3 tured sparsity and more efficient memory usage than weight
billion float point operations, FLOPs) to process a single im- pruning, and thus takes full advantage of BLAS libraries to
age on CPU. The storage, memory, and computation of this achieve a more realistic acceleration. Therefore, the filter
pruning is more advocated in accelerating the networks.

Corrsponding Author Nevertheless, most of the previous works on filter pruning
still suffer from the problems of (1) the model capacity re- Han et al., 2015a; Guo et al., 2016] pruning weights of
duction and (2) the dependence on pre-trained model. Specif- neural network resulting in small models. For example,
ically, as shown in Fig. 1, most previous works conduct the [Han et al., 2015b] proposed an iterative weight pruning
“hard filter pruning”, which directly delete the pruned filters. method by discarding the small weights whose values are
The discarded filters will reduce the model capacity of origi- below the threshold. [Guo et al., 2016] proposed the dy-
nal models, and thus inevitably harm the performance. More- namic network surgery to reduce the training iteration while
over, to maintain a reasonable performance with respect to the maintaining a good prediction accuracy. [Wen et al., 2016;
full models, previous works [Li et al., 2017; He et al., 2017; Lebedev and Lempitsky, 2016] leveraged the sparsity prop-
Luo et al., 2017] always fine-tuned the hard pruned model erty of feature maps or weight parameters to accelerate the
after pruning the filters of a pre-trained model, which how- CNN models. A special case of weight pruning is neuron
ever has low training efficiency and often requires much more pruning. However, pruning weights always leads to unstruc-
training time than the traditional training schema. tured models, so the model cannot leverage the existing effi-
To address the above mentioned two problems, we propose cient BLAS libraries in practice. Therefore, it is difficult for
a novel Soft Filter Pruning (SFP) approach. The SFP dynam- weight pruning to achieve realistic speedup.
ically prunes the filters in a soft manner. Particularly, before Filter Pruning. Concurrently with our work, some fil-
first training epoch, the filters of almost all layers with small ter pruning strategies [Li et al., 2017; Liu et al., 2017;
`2 -norm are selected and set to zero. Then the training data He et al., 2017; Luo et al., 2017] have been explored. Prun-
is used to update the pruned model. Before the next training ing the filters leads to the removal of the corresponding fea-
epoch, our SFP will prune a new set of filters of small `2 - ture maps. This not only reduces the storage usage on de-
norm. These training process is continued until converged. vices but also decreases the memory footprint consumption
Finally, some filters will be selected and pruned without fur- to accelerate the inference. [Li et al., 2017] uses `1 -norm to
ther updating. The SFP algorithm enables the compressed select unimportant filters and explores the sensitivity of lay-
network to have a larger model capacity, and thus achieve a ers for filter pruning. [Liu et al., 2017] introduces `1 regu-
higher accuracy than others. larization on the scaling factors in batch normalization (BN)
Contributions. We highlight three contributions: (1) We layers as a penalty term, and prune channel with small scal-
propose SFP to allow the pruned filters to be updated dur- ing factors in BN layers. [Molchanov et al., 2017] proposes
ing the training procedure. This soft manner can dramatically a Taylor expansion based pruning criterion to approximate
maintain the model capacity and thus achieves the superior the change in the cost function induced by pruning. [Luo et
performance. (2) Our acceleration approach can train a model al., 2017] adopts the statistics information from next layer to
from scratch and achieve better performance compared to the guide the importance evaluation of filters. [He et al., 2017]
state-of-the-art. In this way, the fine-tuning procedure and proposes a LASSO-based channel selection strategy, and a
the overall training time is saved. Moreover, using the pre- least square reconstruction algorithm to prune filers. How-
trained model can further enhance the performance of our ap- ever, for all these filter pruning methods, the representative
proach to advance the state-of-the-art in model acceleration. capacity of neural network after pruning is seriously affected
(3) The extensive experiment on two benchmark datasets by smaller optimization space.
demonstrates the effectiveness and efficiency of our SFP. We Discussion. To the best of our knowledge, there is only one
accelerate ResNet-110 by two times with about 4% relative approach that uses the soft manner to prune weights [Guo et
accuracy improvement on CIFAR-10, and also achieve state- al., 2016]. We would like to highlight our advantages com-
of-the-art results on ILSVRC-2012. pared to this approach as below: (1) Our SPF focuses on
the filter pruning, but they focus on the weight pruning. As
discussed above, weight pruning approaches lack the practi-
2 Related Works cal implementations to achieve the realistic acceleration. (2)
Most previous works on accelerating CNNs can be roughly [Guo et al., 2016] paid more attention to the model com-
divided into three categories, namely, matrix decomposition, pression, whereas our approach can achieve both compres-
low-precision weights, and pruning. In particular, the ma- sion and acceleration of the model. (3) Extensive experiments
trix decomposition of deep CNN tensors is approximated by have been conducted to validate the effectiveness of our pro-
the product of two low-rank matrices [Jaderberg et al., 2014; posed approach both on large-scale datasets and the state-of-
Zhang et al., 2016; Tai et al., 2016]. This can save the the-art CNN models. In contrast, [Guo et al., 2016] only had
computational cost. Some works [Zhu et al., 2017; Zhou the experiments on Alexnet which is more redundant the ad-
et al., 2017] focus on compressing the CNNs by using low- vanced models, such as ResNet.
precision weights. Pruning-based approaches aim to remove
the unnecessary connections of the neural network [Han et 3 Methodology
al., 2015b; Li et al., 2017]. Essentially, the work of this pa-
per is based on the idea of pruning techniques; and the ap- 3.1 Preliminaries
proaches of matrix decomposition and low-precision weights We will formally introduce the symbol and annotations in this
are orthogonal but potentially useful here – it may be still section. The deep CNN network can be parameterized by
worth simplifying the weight matrix after pruning filters, {W(i) ∈ RNi+1 ×Ni ×K×K , 1 ≤ i ≤ L} W(i) denotes a
which would be taken as future work. matrix of connection weights in the i-th layer. Ni denotes the
Weight Pruning. Many recent works [Han et al., 2015b; number of input channels for the i-th convolution layer. L
denotes the number of layers. The shapes of input tensor U Algorithm 1 Algorithm Description of SFP
and output tensor V are Ni × Hi × Wi and Ni+1 × Hi+1 × Input: training data: X, pruning rate: Pi
Wi+1 , respectively. The convolutional operation of the i-th
the model with parameters W = {W(i) , 0 ≤ i ≤ L}.
layer can be written as:
Initialize the model parameter W
Vi,j = Fi,j ∗ U for 1 ≤ j ≤ Ni+1 , (1) for epoch = 1; epoch ≤ epochmax ; epoch + + do
Ni ×K×K Update the model parameter W based on X
where Fi,j ∈ R represents the j-th filter of the i-th for i = 1; i ≤ L; i + + do
layer. W(i) consists of {Fi,j , 1 ≤ j ≤ Ni+1 }. The Vi,j Calculate the `2 -norm for each filter kFi,j k2 , 1 ≤
represents the j-th output feature map of the i-th layer. j ≤ Ni+1
Pruning filters can remove the output feature maps. In Zeroize Ni+1 Pi filters by `2 -norm filter selection
this way, the computational cost of the neural network will end for
reduce remarkably. Let us assume the pruning rate of end for
SFP is Pi for the i-th layer. The number of filters of Obtain the compact model with parameters W∗ from W
this layer will be reduced from Ni+1 to Ni+1 (1 − Pi ), Output: The compact model and its parameters W∗
thereby the size of the output tensor Vi,j can be reduced to
Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . As the output tensor of i-th
layer is the input tensor of i + 1-th layer, we can reduce the unimportant filters for the i-th weighted layer. In other words,
input size of i-th layer to achieve a higher acceleration ratio. the lowest Ni+1 Pi filters are selected, e.g., the blue filters in
Fig. 2. In practice, `2 -norm is used based on the empirical
3.2 Soft Filter Pruning (SFP) analysis.
Most of previous filter pruning works [Li et al., 2017; Liu et v
u Ni K K
al., 2017; He et al., 2017; Luo et al., 2017] compressed the uX X X
deep CNNs in a hard manner. We call them as the hard filter
p
kFi,j kp = t |Fi,j (n, k1 , k2 )|p , (2)
n=1 k1 =1 k2 =1
pruning. Typically, these algorithms firstly prune filters of a
single layer of a pre-trained model and fine-tune the pruned Filter Pruning. We set the value of selected Ni+1 Pi filters
model to complement the degrade of the performance. Then to zero (see the filter pruning step in Fig. 2). This can tem-
they prune the next layer and fine-tune the model again until porarily eliminate their contribution to the network output.
the last layer of the model is pruned. However, once filters are Nevertheless, in the following training stage, we still allow
pruned, these approaches will not update these filters again. these selected filters to be updated, in order to keep the repre-
Therefore, the model capacity is drastically reduced due to sentative capacity and the high performance of the model.
the removed filters; and such a hard pruning manner affects In the filter pruning step, we simply prune all the weighted
the performance of the compressed models negatively. layers at the same time. In this way, we can prune each
As summarized in Alg. 1, the proposed SFP algorithm can filter in parallel, which would cost negligible computation
dynamically remove the filters in a soft manner. Specifically, time. In contrast, the previous filter pruning methods al-
the key is to keep updating the pruned filters in the train- ways conduct layer by layer greedy pruning. After prun-
ing stage. Such an updating manner brings several bene- ing filters of one single layer, existing methods always re-
fits. It not only keeps the model capacity of the compressed quire training to converge the network [Luo et al., 2017;
deep CNN models as the original models, but also avoids the He et al., 2017]. This procedure cost much extra computa-
greedy layer by layer pruning procedure and enable pruning tion time, especially when the depth increases. Moreover, we
almost all layers at the same time. More specifically, our use the same pruning rate for all weighted layers. Therefore,
approach can prune a model either in the process of train- we need only one hyper-parameter Pi = P to balance the
ing from scratch, or a pre-trained model. In each training acceleration and accuracy. This can avoid the inconvenient
epoch, the full model is optimized and trained on the training hyper-parameter search or the complicated sensitivity anal-
data. After each epoch, the `2 -norm of all filters are com- ysis [Li et al., 2017]. As we allow the pruned filters to be
puted for each weighted layer and used as the criterion of our updated, the model has a large model capacity and becomes
filter selection strategy. Then we will prune the selected fil- more flexible and thus can well balance the contribution of
ters by setting the corresponding filter weights as zero, which each filter to the final prediction.
is followed by next training epoch. Finally, the original deep Reconstruction. After the pruning step, we train the net-
CNNs are pruned into a compact and efficient model. The work for one epoch to reconstruct the pruned filters. As
details of SFP is illustratively explained in Alg. 1, which can shown in Fig. 2, the pruned filters are updated to non-zero by
be divided into the following four steps. back-propagation. In this way, SFP allows the pruned model
Filter selection. We use the `p -norm to evaluate the impor- to have the same capacity as the original model during train-
tance of each filter as Eq. (2). In general, the convolutional ing. In contrast, hard filter pruning decreases the number of
results of the filter with the smaller `p -norm lead to relatively feature maps. The reduction of feature maps would dramat-
lower activation values; and thus have a less numerical im- ically reduce the model capacity, and further harm the per-
pact on the final prediction of deep CNN models. In term formance. Previous pruning methods usually require a pre-
of this understanding, such filters of small `p -norm will be trained model and then fine-tune it. However, as we inte-
given high priority of being pruned than those of higher `p - grate the pruning step into the normal training schema, our
norm. Particularly, we use a pruning rate Pi to select Ni+1 Pi approach can train the model from scratch. Therefore, the
k-th training epoch Pruned model (k+1)-th training epoch
filters ‖•‖p importance filters ‖•‖p importance filters ‖•‖p importance
1.231 1.231 2.512
0.331 0 1.324
2.056 Filter Pruning 2.056 0.056
Reconstruction
0.275 0 0.897

1.572 1.572 3.742

a b c
Figure 2: Overview of SFP. At the end of each training epoch, we prune the filters based on their importance evaluations. The filters are ranked
by their `p -norms (purple rectangles) and the small ones (blue circles) are selected to be pruned. After filter pruning, the model undergoes a
reconstruction process where pruned filters are capable of being reconstructed (i.e., updated from zeros) by the forward-backward process.
(a): filter instantiations before pruning. (b): filter instantiations after pruning. (c): filter instantiations after reconstruction.

fine-tuning stage is no longer necessary for SFP. As we will for computation complexity comparison, which is commonly
show in experiments, the network trained from scratch by used in previous work [Li et al., 2017; Luo et al., 2017].
SFP can obtain the competitive results with the one trained However, reduced FLOPs cannot bring the same level of re-
from a well-trained model by others. By leveraging the pre- alistic speedup because non-tensor layers (e.g., BN and pool-
trained model, SFP obtains a much higher performance and ing layers) also need the inference time on GPU [Luo et al.,
advances the state-of-the-art. 2017]. In addition, the limitation of IO delay, buffer switch
Obtaining Compact Model. SFP iterates over the filter and efficiency of BLAS libraries also lead to the wide gap be-
selection, filter pruning and reconstruction steps. After the tween theoretical and realistic speedup ratio. We compare the
model gets converged, we can obtain a sparse model contain- theoretical and realistic speedup in Section 4.3.
ing many “zero filters”. One “zero filter” corresponds to one
feature map. The features maps, corresponding to those “zero
filters”, will always be zero during the inference procedure. 4 Evaluation and Results
There will be no influence to remove these filters as well as 4.1 Benchmark Datasets and Experimental Setting
the corresponding feature maps. Specifically, for the prun-
ing rate Pi in the i-th layer, only Ni+1 (1 − Pi ) filters are Our method is evaluated on two benchmarks: CIFAR-
non-zero and have an effect on the final prediction. Consider 10 [Krizhevsky and Hinton, 2009] and ILSVRC-2012 [Rus-
pruning the previous layer, the input channel of i-th layer is sakovsky et al., 2015]. The CIFAR-10 dataset contains
changed from Ni to Ni (1 − Pi−1 ). We can thus re-build 50,000 training images and 10,000 testing images, which are
the i-th layer into a smaller one. Finally, a compact model categorized into 10 classes. ILSVRC-2012 is a large-scale
{W∗ (i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 )×K×K } is obtained. dataset containing 1.28 million training images and 50k val-
idation images of 1,000 classes. Following the common set-
3.3 Computation Complexity Analysis ting in [Luo et al., 2017; He et al., 2017; Dong et al., 2017a],
Theoretical speedup analysis. Suppose the filter pruning we focus on pruning the challenging ResNet model in this
rate of the ith layer is Pi , which means the Ni+1 × Pi fil- paper. SFP should also be effective on different computer
ters are set to zero and pruned from the layer, and the other vision tasks, such as [Kang et al., 2017; Ren et al., 2015;
Ni+1 × (1 − Pi ) filters remain unchanged, and suppose the Dong et al., 2018; Shen et al., 2018b; Yang et al., 2010;
size of the input and output feature map of ith layer is Hi ×Wi Shen et al., 2018a; Dong et al., 2017b], and we will explore
and Hi+1 × Wi+1 . Then after filter pruning, the dimension this in future.
of useful output feature map of the ith layer decreases from In the CIFAR-10 experiments, we use the default parame-
Ni+1 × Hi+1 × Wi+1 to Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . ter setting as [He et al., 2016b] and follow the training sched-
Note that the output of ith layer is the input of (i + 1) th ule in [Zagoruyko and Komodakis, 2016]. On ILSVRC-2012,
layer. And we further prunes the (i + 1)th layer with a fil- we follow the same parameter settings as [He et al., 2016a;
ter pruning rate Pi+1 , then the calculation of (i + 1)th layer He et al., 2016b]. We use the same data argumentation strate-
is decrease from Ni+2 × Ni+1 × k 2 × Hi+2 × Wi+2 to gies with PyTorch official examples [Paszke et al., 2017].
Ni+2 (1 − Pi+1 ) × Ni+1 (1 − Pi ) × k 2 × Hi+2 × Wi+2 . In We conduct our SFP operation at the end of every training
other words, a proportion of 1 − (1 − Pi+1 ) × (1 − Pi ) of the epoch. For pruning a scratch model, we use the normal train-
original calculation is reduced, which will make the neural ing schedule. For pruning a pre-trained model, we reduce the
network inference much faster. learning rate by 10 compared to the schedule for the scratch
Realistic speedup analysis. In theoretical speedup anal- model. We run each experiment three times and report the
ysis, other operations such as batch normalization (BN) and “mean ± std”. We compare the performance with other state-
pooling are negligible comparing to convolution operations. of-the-art acceleration algorithms, e.g., [Dong et al., 2017a;
Therefore, we consider the FLOPs of convolution operations Li et al., 2017; He et al., 2017; Luo et al., 2017].
Depth Method Fine-tune? Baseline Accu. (%) Accelerated Accu. (%) Accu. Drop (%) FLOPs Pruned FLOPs(%)
[Dong et al., 2017a] N 91.53 91.43 0.10 3.20E7 20.3
Ours(10%) N 92.20 ± 0.18 92.24 ± 0.33 -0.04 3.44E7 15.2
20
Ours(20%) N 92.20 ± 0.18 91.20 ± 0.30 1.00 2.87E7 29.3
Ours(30%) N 92.20 ± 0.18 90.83 ± 0.31 1.37 2.43E7 42.2
[Dong et al., 2017a] N 92.33 90.74 1.59 4.70E7 31.2
Ours(10%) N 92.63 ± 0.70 93.22 ± 0.09 -0.59 5.86E7 14.9
32
Ours(20%) N 92.63 ± 0.70 90.63 ± 0.37 0.00 4.90E7 28.8
Ours(30%) N 92.63 ± 0.70 90.08 ± 0.08 0.55 4.03E7 41.5
[Li et al., 2017] N 93.04 91.31 1.75 9.09E7 27.6
[Li et al., 2017] Y 93.04 93.06 -0.02 9.09E7 27.6
[He et al., 2017] N 92.80 90.90 1.90 - 50.0
[He et al., 2017] Y 92.80 91.80 1.00 - 50.0
Ours(10%) N 93.59 ± 0.58 93.89 ± 0.19 -0.30 1.070E8 14.7
56
Ours(20%) N 93.59 ± 0.58 93.47 ± 0.24 0.12 8.98E7 28.4
Ours(30%) N 93.59 ± 0.58 93.10 ± 0.20 0.49 7.40E7 41.1
Ours(30%) Y 93.59 ± 0.58 93.78 ± 0.22 -0.19 7.40E7 41.1
Ours(40%) N 93.59 ± 0.58 92.26 ± 0.31 1.33 5.94E7 52.6
Ours(40%) Y 93.59 ± 0.58 93.35 ± 0.31 0.24 5.94E7 52.6
[Li et al., 2017] N 93.53 92.94 0.61 1.55E8 38.6
[Li et al., 2017] Y 93.53 93.30 0.20 1.55E8 38.6
[Dong et al., 2017a] N 93.63 93.44 0.19 - 34.2
110 Ours(10%) N 93.68 ± 0.32 93.83 ± 0.19 -0.15 2.16E8 14.6
Ours(20%) N 93.68 ± 0.32 93.93 ± 0.41 -0.25 1.82E8 28.2
Ours(30%) N 93.68 ± 0.32 93.38 ± 0.30 0.30 1.50E8 40.8
Ours(30%) Y 93.68 ± 0.32 93.86 ± 0.21 -0.18 1.50E8 40.8

Table 1: Comparison of pruning ResNet on CIFAR-10. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-trained model
as initialization or not, respectively. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative
number means the accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better.

4.2 ResNet on CIFAR-10 method [Luo et al., 2017], but the accuracy of our pruned
Settings. For CIFAR-10 dataset, we test our SFP on ResNet- model exceeds their model by 2.57%. Moreover, for prun-
20, 32, 56 and 110. We use several different pruning rates, ing a pre-trained ResNet-101, SFP reduces more than 40%
and also analyze the difference between using the pre-trained FLOPs of the model with even 0.2% top-5 accuracy in-
model and from scratch. crease, which is the state-of-the-art result. In contrast, the
Results. Tab. 1 shows the results. Our SFP could achieve performance degradation is inevitable for hard filter pruning
a better performance than the other state-of-the-art hard filter method. Maintained model capacity of SFP is the main rea-
pruning methods. For example, [Li et al., 2017] use the hard son for the superior performance. In addition, the non-greedy
pruning method to accelerate ResNet-110 by 38.6% speedup all-layer pruning method may have a better performance than
ratio with 0.61% accuracy drop when without fine-tuning. the locally optimal solution obtained from previous greedy
When using pre-trained model and fine-tuning, the accuracy pruning method, which seems to be another reason. Occa-
drop becomes 0.20%. However, we can accelerate the infer- sionally, large performance degradation happens for the pre-
ence of ResNet-110 to 40.8% speed-up with only 0.30% ac- trained model (e.g., 14.01% top-1 accuracy drop for ResNet-
curacy drop without fine-tuning. When using the pre-trained 50). This will be explored in our future work.
model, we can even outperform the original model by 0.18% To test the realistic speedup ratio, we measure the forward
with about more than 40% FLOPs reduced. time of the pruned models on one GTX1080 GPU with a
These results validate the effectiveness of SFP, which can batch size of 64 (shown in Tab. 3). The gap between theo-
produce a more compressed model with comparable perfor- retical and realistic model may come from and the limitation
mance to the original model. of IO delay, buffer switch and efficiency of BLAS libraries.

4.3 ResNet on ILSVRC-2012 4.4 Ablation Study


Settings. For ILSVRC-2012 dataset, we test our SFP on We conducted extensive ablation studies to further analyze
ResNet-18, 34, 50 and 101; and we use the same pruning rate each component of SFP.
30% for all the models. All the convolutional layer of ResNet Filter Selection Criteria. The magnitude based criteria
are pruned with the same pruning rate at the same time. (We such as `p -norm are widely used to filter selection because
do not prune the projection shortcuts for simplification, which computational resources cost is small [Li et al., 2017]. We
only need negligible time and do not affect the overall cost.) compare the `2 -norm and `1 -norm. For `1 -norm criteria,
Results. Tab. 2 shows that SFP outperforms other state- the accuracy of the model under pruning rate 10%, 20%,
of-the-art methods. For ResNet-34, SFP without fine- 30% are 93.68±0.60%, 93.68±0.76% and 93.34±0.12%,
tuning achieves more inference speedup to the hard pruning respectively. While for `2 -norm criteria, the accuracy
Fine- Top-1 Accu. Top-1 Accu. Top-5 Accu. Top-5 Accu. Top-1 Accu. Top-5 Accu. Pruned
Depth Method tune? Baseline(%) Accelerated(%) Baseline(%) Accelerated(%) Drop(%) Drop(%) FLOPs(%)
[Dong et al., 2017a] N 69.98 66.33 89.24 86.94 3.65 2.30 34.6
18
Ours(30%) N 70.28 67.10 89.63 87.78 3.18 1.85 41.8
[Dong et al., 2017a] N 73.42 72.99 91.36 91.19 0.43 0.17 24.8
34 [Li et al., 2017] Y 73.23 72.17 - - 1.06 - 24.2
Ours(30%) N 73.92 71.83 91.62 90.33 2.09 1.29 41.1
[He et al., 2017] Y - - 92.20 90.80 - 1.40 50.0
[Luo et al., 2017] Y 72.88 72.04 91.14 90.67 0.84 0.47 36.7
50
Ours(30%) N 76.15 74.61 92.87 92.06 1.54 0.81 41.8
Ours(30%) Y 76.15 62.14 92.87 84.60 14.01 8.27 41.8
Ours(30%) N 77.37 77.03 93.56 93.46 0.34 0.10 42.2
101
Ours(30%) Y 77.37 77.51 93.56 93.71 -0.14 -0.20 42.2

Table 2: Comparison of pruning ResNet on ImageNet. “Fine-tune?” and ”Accu. Drop” have the same meaning with Tab. 1.

Baseline Pruned Realistic Theoretical different SFP intervals may lead to different performance; so
Model time (ms) time (ms) Speed-up(%) Speed-up(%) we explore the sensitivity of SFP interval. We use the ResNet-
ResNet-18 37.10 26.97 27.4 41.8 110 under pruning rate 30% as a baseline, and change the SFP
ResNet-34 63.97 45.14 29.4 41.1
ResNet-50 135.01 94.66 29.8 41.8
interval from one epoch to ten epochs, as shown in Fig. 3(b).
ResNet-101 219.71 148.64 32.3 42.2 It is shown that the model accuracy has no large fluctuation
along with the different SFP intervals. Moreover, the model
accuracy of most (80%) intervals surpasses the accuracy of
Table 3: Comparison on the theoretical and realistic speedup. We one epoch interval. Therefore, we can even achieve a better
only count the time consumption of the forward procedure.
performance if we fine-tune this parameter.
Selection of pruned layers. Previous works always prune
94 94
a portion of the layers of the network. Besides, different lay-
Accuracy (%)

Accuracy (%)

ers always have different pruning rates. For example, [Li et


93 al., 2017] only prunes insensitive layers, [Luo et al., 2017]
93 skips the last layer of every block of the ResNet, and [Luo
92 Baseline Model Other epochs et al., 2017] prunes more aggressive for shallower layers and
Accelerated Model One epoch
0 20 40 60 92 1 3 5 7 9
prune less for deep layers. Similarly, we compare the perfor-
mance of pruning first and second layer of all basic blocks
Pruning rate (%) Epoch of ResNet-110. We set the pruning rate as 30%. The model
(a) Different Pruning Rates (b) Different SFP Intervals with all the first layers of blocks pruned has an accuracy of
93.96 ± 0.13%, while that with the second layers of blocks
Figure 3: Accuracy of ResNet-110 on CIFAR-10 regarding differ- pruned has an accuracy of 93.38 ± 0.44%. Therefore, differ-
ent hyper-parameters. (Solid line and shadow denotes the mean and ent layers have different sensitivity for SFP, and careful selec-
standard deviation of three experiment, respectively.) tion of pruned layers would potentially lead to performance
improvement, although more hyper-parameters are needed.
are 93.89±0.19%, 93.93±0.41% and 93.38±0.30%, respec-
tively. The performance of `2 -norm criteria is slightly better 5 Conclusion and Future Work
than that of `1 -norm criteria. The result of `2 -norm is dom- In this paper, we propose a soft filter pruning (SFP) approach
inated by the largest element, while the result of `1 -norm is to accelerate the deep CNNs. During the training procedure,
also largely affected by other small elements. Therefore, fil- SFP allows the pruned filters to be updated. This soft manner
ters with some large weights would be preserved by the `2 - can maintain the model capacity and thus achieve the supe-
norm criteria. So the corresponding discriminative features rior performance. Remarkably, SFP can achieve the competi-
are kept so the performance of the pruned model is better. tive performance compared to the state-of-the-art without the
Varying pruning rates. To comprehensively understand pre-trained model. Moreover, by leveraging the pre-trained
SFP, we test the accuracy of different pruning rates for model, SFP achieves a better result and advances the state-
ResNet-110, shown in Fig. 3(a). As the pruning rate in- of-the-art. Furthermore, SFP can be combined with other
creases, the accuracy of the pruned model first rises above the acceleration algorithms, e.g., matrix decomposition and low-
baseline model and then drops approximately linearly. For precision weights, to further improve the performance.
the pruning rate between 0% and about 23%, the accuracy of
the accelerated model is higher than the baseline model. This
shows that our SFP has a regularization effect on the neural Acknowledgments
network because SFP reduces the over-fitting of the model. Yi Yang is the recipient of a Google Faculty Research Award.
Sensitivity of SFP interval. By default, we conduct our We acknowledge the Data to Decisions CRC (D2D CRC), the
SFP operation at the end of every training epoch. However, Cooperative Research Centres Programme and ARC’s DE-
CRA (project DE170101415) for funding this research. We [Luo et al., 2017] Jian-Hao Luo, Jianxin Wu, and Weiyao
thank Amazon for the AWS Cloud Credits. Lin. ThiNet: A filter level pruning method for deep neural
network compression. In ICCV, 2017.
References [Molchanov et al., 2017] Pavlo Molchanov, Stephen Tyree,
[Dong et al., 2017a] Xuanyi Dong, Junshi Huang, Yi Yang, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolu-
and Shuicheng Yan. More is less: A more complicated tional neural networks for resource efficient transfer learn-
network with less inference complexity. In CVPR, 2017. ing. In ICLR, 2017.
[Dong et al., 2017b] Xuanyi Dong, Deyu Meng, Fan Ma, [Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith
and Yi Yang. A dual-network progressive approach to Chintala, Gregory Chanan, Edward Yang, Zachary De-
weakly supervised object detection. In ACM Multimedia, Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and
2017. Adam Lerer. Automatic differentiation in pytorch. In
[Dong et al., 2018] Xuanyi Dong, Shoou-I Yu, Xinshuo NIPS-W, 2017.
Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir-
Supervision-by-Registration: An unsupervised approach shick, and Jian Sun. Faster r-cnn: Towards real-time object
to improve the precision of facial landmark detectors. In detection with region proposal networks. In NIPS, 2015.
CVPR, 2018.
[Russakovsky et al., 2015] Olga Russakovsky, Jia Deng,
[Guo et al., 2016] Yiwen Guo, Anbang Yao, and Yurong
Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Chen. Dynamic network surgery for efficient DNNs. In Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael
NIPS, 2016. Bernstein, et al. ImageNet large scale visual recognition
[Han et al., 2015a] Song Han, Huizi Mao, and William J challenge. IJCV, 2015.
Dally. Deep compression: Compressing deep neural net-
[Shen et al., 2018a] Tao Shen, Tianyi Zhou, Guodong Long,
works with pruning, trained quantization and huffman cod-
ing. In ICLR, 2015. Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Di-
rectional self-attention network for rnn/cnn-free language
[Han et al., 2015b] Song Han, Jeff Pool, John Tran, and understanding. In AAAI, 2018.
William Dally. Learning both weights and connections for
efficient neural network. In NIPS, 2015. [Shen et al., 2018b] Tao Shen, Tianyi Zhou, Guodong Long,
Jing Jiang, and Chengqi Zhang. Bi-directional block self-
[He et al., 2016a] Kaiming He, Xiangyu Zhang, Shaoqing attention for fast and memory-efficient sequence model-
Ren, and Jian Sun. Deep residual learning for image recog- ing. In ICLR, 2018.
nition. In CVPR, 2016.
[Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing
[He et al., 2016b] Kaiming He, Xiangyu Zhang, Shaoqing
Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Ren, and Jian Sun. Identity mappings in deep residual
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
networks. In ECCV, 2016.
novich. Going deeper with convolutions. In CVPR, 2015.
[He et al., 2017] Yihui He, Xiangyu Zhang, and Jian Sun.
Channel pruning for accelerating very deep neural net- [Tai et al., 2016] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang
works. In ICCV, 2017. Wang, et al. Convolutional neural networks with low-rank
regularization. In ICLR, 2016.
[Jaderberg et al., 2014] Max Jaderberg, Andrea Vedaldi, and
Andrew Zisserman. Speeding up convolutional neural net- [Wen et al., 2016] Wei Wen, Chunpeng Wu, Yandan Wang,
works with low rank expansions. In BMVC, 2014. Yiran Chen, and Hai Li. Learning structured sparsity in
deep neural networks. In NIPS, 2016.
[Kang et al., 2017] Guoliang Kang, Jun Li, and Dacheng
Tao. Shakeout: A new approach to regularized deep neural [Yang et al., 2010] Yi Yang, Dong Xu, Feiping Nie,
network training. IEEE T-PAMI, 2017. Shuicheng Yan, and Yueting Zhuang. Image clustering us-
[Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geof- ing local discriminant models and global integration. IEEE
frey Hinton. Learning multiple layers of features from tiny T-IP, 2010.
images. 2009. [Zagoruyko and Komodakis, 2016] Sergey Zagoruyko and
[Lebedev and Lempitsky, 2016] Vadim Lebedev and Victor Nikos Komodakis. Wide residual networks. In BMVC,
Lempitsky. Fast ConvNets using group-wise brain dam- 2016.
age. In CVPR, 2016. [Zhang et al., 2016] Xiangyu Zhang, Jianhua Zou, Kaiming
[Li et al., 2017] Hao Li, Asim Kadav, Igor Durdanovic, He, and Jian Sun. Accelerating very deep convolutional
Hanan Samet, and Hans Peter Graf. Pruning filters for networks for classification and detection. IEEE T-PAMI,
efficient ConvNets. In ICLR, 2017. 2016.
[Liu et al., 2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, [Zhou et al., 2017] Aojun Zhou, Anbang Yao, Yiwen Guo,
Gao Huang, Shoumeng Yan, and Changshui Zhang. Lin Xu, and Yurong Chen. Incremental network quantiza-
Learning efficient convolutional networks through net- tion: Towards lossless cnns with low-precision weights. In
work slimming. In ICCV, 2017. ICLR, 2017.
[Zhu et al., 2017] Chenzhuo Zhu, Song Han, Huizi Mao, and
William J Dally. Trained ternary quantization. In ICLR,
2017.

You might also like