Soft Filter Pruning For Accelerating Deep Convolutional Neural Networks
Soft Filter Pruning For Accelerating Deep Convolutional Neural Networks
a b c
Figure 2: Overview of SFP. At the end of each training epoch, we prune the filters based on their importance evaluations. The filters are ranked
by their `p -norms (purple rectangles) and the small ones (blue circles) are selected to be pruned. After filter pruning, the model undergoes a
reconstruction process where pruned filters are capable of being reconstructed (i.e., updated from zeros) by the forward-backward process.
(a): filter instantiations before pruning. (b): filter instantiations after pruning. (c): filter instantiations after reconstruction.
fine-tuning stage is no longer necessary for SFP. As we will for computation complexity comparison, which is commonly
show in experiments, the network trained from scratch by used in previous work [Li et al., 2017; Luo et al., 2017].
SFP can obtain the competitive results with the one trained However, reduced FLOPs cannot bring the same level of re-
from a well-trained model by others. By leveraging the pre- alistic speedup because non-tensor layers (e.g., BN and pool-
trained model, SFP obtains a much higher performance and ing layers) also need the inference time on GPU [Luo et al.,
advances the state-of-the-art. 2017]. In addition, the limitation of IO delay, buffer switch
Obtaining Compact Model. SFP iterates over the filter and efficiency of BLAS libraries also lead to the wide gap be-
selection, filter pruning and reconstruction steps. After the tween theoretical and realistic speedup ratio. We compare the
model gets converged, we can obtain a sparse model contain- theoretical and realistic speedup in Section 4.3.
ing many “zero filters”. One “zero filter” corresponds to one
feature map. The features maps, corresponding to those “zero
filters”, will always be zero during the inference procedure. 4 Evaluation and Results
There will be no influence to remove these filters as well as 4.1 Benchmark Datasets and Experimental Setting
the corresponding feature maps. Specifically, for the prun-
ing rate Pi in the i-th layer, only Ni+1 (1 − Pi ) filters are Our method is evaluated on two benchmarks: CIFAR-
non-zero and have an effect on the final prediction. Consider 10 [Krizhevsky and Hinton, 2009] and ILSVRC-2012 [Rus-
pruning the previous layer, the input channel of i-th layer is sakovsky et al., 2015]. The CIFAR-10 dataset contains
changed from Ni to Ni (1 − Pi−1 ). We can thus re-build 50,000 training images and 10,000 testing images, which are
the i-th layer into a smaller one. Finally, a compact model categorized into 10 classes. ILSVRC-2012 is a large-scale
{W∗ (i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 )×K×K } is obtained. dataset containing 1.28 million training images and 50k val-
idation images of 1,000 classes. Following the common set-
3.3 Computation Complexity Analysis ting in [Luo et al., 2017; He et al., 2017; Dong et al., 2017a],
Theoretical speedup analysis. Suppose the filter pruning we focus on pruning the challenging ResNet model in this
rate of the ith layer is Pi , which means the Ni+1 × Pi fil- paper. SFP should also be effective on different computer
ters are set to zero and pruned from the layer, and the other vision tasks, such as [Kang et al., 2017; Ren et al., 2015;
Ni+1 × (1 − Pi ) filters remain unchanged, and suppose the Dong et al., 2018; Shen et al., 2018b; Yang et al., 2010;
size of the input and output feature map of ith layer is Hi ×Wi Shen et al., 2018a; Dong et al., 2017b], and we will explore
and Hi+1 × Wi+1 . Then after filter pruning, the dimension this in future.
of useful output feature map of the ith layer decreases from In the CIFAR-10 experiments, we use the default parame-
Ni+1 × Hi+1 × Wi+1 to Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . ter setting as [He et al., 2016b] and follow the training sched-
Note that the output of ith layer is the input of (i + 1) th ule in [Zagoruyko and Komodakis, 2016]. On ILSVRC-2012,
layer. And we further prunes the (i + 1)th layer with a fil- we follow the same parameter settings as [He et al., 2016a;
ter pruning rate Pi+1 , then the calculation of (i + 1)th layer He et al., 2016b]. We use the same data argumentation strate-
is decrease from Ni+2 × Ni+1 × k 2 × Hi+2 × Wi+2 to gies with PyTorch official examples [Paszke et al., 2017].
Ni+2 (1 − Pi+1 ) × Ni+1 (1 − Pi ) × k 2 × Hi+2 × Wi+2 . In We conduct our SFP operation at the end of every training
other words, a proportion of 1 − (1 − Pi+1 ) × (1 − Pi ) of the epoch. For pruning a scratch model, we use the normal train-
original calculation is reduced, which will make the neural ing schedule. For pruning a pre-trained model, we reduce the
network inference much faster. learning rate by 10 compared to the schedule for the scratch
Realistic speedup analysis. In theoretical speedup anal- model. We run each experiment three times and report the
ysis, other operations such as batch normalization (BN) and “mean ± std”. We compare the performance with other state-
pooling are negligible comparing to convolution operations. of-the-art acceleration algorithms, e.g., [Dong et al., 2017a;
Therefore, we consider the FLOPs of convolution operations Li et al., 2017; He et al., 2017; Luo et al., 2017].
Depth Method Fine-tune? Baseline Accu. (%) Accelerated Accu. (%) Accu. Drop (%) FLOPs Pruned FLOPs(%)
[Dong et al., 2017a] N 91.53 91.43 0.10 3.20E7 20.3
Ours(10%) N 92.20 ± 0.18 92.24 ± 0.33 -0.04 3.44E7 15.2
20
Ours(20%) N 92.20 ± 0.18 91.20 ± 0.30 1.00 2.87E7 29.3
Ours(30%) N 92.20 ± 0.18 90.83 ± 0.31 1.37 2.43E7 42.2
[Dong et al., 2017a] N 92.33 90.74 1.59 4.70E7 31.2
Ours(10%) N 92.63 ± 0.70 93.22 ± 0.09 -0.59 5.86E7 14.9
32
Ours(20%) N 92.63 ± 0.70 90.63 ± 0.37 0.00 4.90E7 28.8
Ours(30%) N 92.63 ± 0.70 90.08 ± 0.08 0.55 4.03E7 41.5
[Li et al., 2017] N 93.04 91.31 1.75 9.09E7 27.6
[Li et al., 2017] Y 93.04 93.06 -0.02 9.09E7 27.6
[He et al., 2017] N 92.80 90.90 1.90 - 50.0
[He et al., 2017] Y 92.80 91.80 1.00 - 50.0
Ours(10%) N 93.59 ± 0.58 93.89 ± 0.19 -0.30 1.070E8 14.7
56
Ours(20%) N 93.59 ± 0.58 93.47 ± 0.24 0.12 8.98E7 28.4
Ours(30%) N 93.59 ± 0.58 93.10 ± 0.20 0.49 7.40E7 41.1
Ours(30%) Y 93.59 ± 0.58 93.78 ± 0.22 -0.19 7.40E7 41.1
Ours(40%) N 93.59 ± 0.58 92.26 ± 0.31 1.33 5.94E7 52.6
Ours(40%) Y 93.59 ± 0.58 93.35 ± 0.31 0.24 5.94E7 52.6
[Li et al., 2017] N 93.53 92.94 0.61 1.55E8 38.6
[Li et al., 2017] Y 93.53 93.30 0.20 1.55E8 38.6
[Dong et al., 2017a] N 93.63 93.44 0.19 - 34.2
110 Ours(10%) N 93.68 ± 0.32 93.83 ± 0.19 -0.15 2.16E8 14.6
Ours(20%) N 93.68 ± 0.32 93.93 ± 0.41 -0.25 1.82E8 28.2
Ours(30%) N 93.68 ± 0.32 93.38 ± 0.30 0.30 1.50E8 40.8
Ours(30%) Y 93.68 ± 0.32 93.86 ± 0.21 -0.18 1.50E8 40.8
Table 1: Comparison of pruning ResNet on CIFAR-10. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-trained model
as initialization or not, respectively. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative
number means the accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better.
4.2 ResNet on CIFAR-10 method [Luo et al., 2017], but the accuracy of our pruned
Settings. For CIFAR-10 dataset, we test our SFP on ResNet- model exceeds their model by 2.57%. Moreover, for prun-
20, 32, 56 and 110. We use several different pruning rates, ing a pre-trained ResNet-101, SFP reduces more than 40%
and also analyze the difference between using the pre-trained FLOPs of the model with even 0.2% top-5 accuracy in-
model and from scratch. crease, which is the state-of-the-art result. In contrast, the
Results. Tab. 1 shows the results. Our SFP could achieve performance degradation is inevitable for hard filter pruning
a better performance than the other state-of-the-art hard filter method. Maintained model capacity of SFP is the main rea-
pruning methods. For example, [Li et al., 2017] use the hard son for the superior performance. In addition, the non-greedy
pruning method to accelerate ResNet-110 by 38.6% speedup all-layer pruning method may have a better performance than
ratio with 0.61% accuracy drop when without fine-tuning. the locally optimal solution obtained from previous greedy
When using pre-trained model and fine-tuning, the accuracy pruning method, which seems to be another reason. Occa-
drop becomes 0.20%. However, we can accelerate the infer- sionally, large performance degradation happens for the pre-
ence of ResNet-110 to 40.8% speed-up with only 0.30% ac- trained model (e.g., 14.01% top-1 accuracy drop for ResNet-
curacy drop without fine-tuning. When using the pre-trained 50). This will be explored in our future work.
model, we can even outperform the original model by 0.18% To test the realistic speedup ratio, we measure the forward
with about more than 40% FLOPs reduced. time of the pruned models on one GTX1080 GPU with a
These results validate the effectiveness of SFP, which can batch size of 64 (shown in Tab. 3). The gap between theo-
produce a more compressed model with comparable perfor- retical and realistic model may come from and the limitation
mance to the original model. of IO delay, buffer switch and efficiency of BLAS libraries.
Table 2: Comparison of pruning ResNet on ImageNet. “Fine-tune?” and ”Accu. Drop” have the same meaning with Tab. 1.
Baseline Pruned Realistic Theoretical different SFP intervals may lead to different performance; so
Model time (ms) time (ms) Speed-up(%) Speed-up(%) we explore the sensitivity of SFP interval. We use the ResNet-
ResNet-18 37.10 26.97 27.4 41.8 110 under pruning rate 30% as a baseline, and change the SFP
ResNet-34 63.97 45.14 29.4 41.1
ResNet-50 135.01 94.66 29.8 41.8
interval from one epoch to ten epochs, as shown in Fig. 3(b).
ResNet-101 219.71 148.64 32.3 42.2 It is shown that the model accuracy has no large fluctuation
along with the different SFP intervals. Moreover, the model
accuracy of most (80%) intervals surpasses the accuracy of
Table 3: Comparison on the theoretical and realistic speedup. We one epoch interval. Therefore, we can even achieve a better
only count the time consumption of the forward procedure.
performance if we fine-tune this parameter.
Selection of pruned layers. Previous works always prune
94 94
a portion of the layers of the network. Besides, different lay-
Accuracy (%)
Accuracy (%)