0% found this document useful (0 votes)
5 views

cnn_compressed

The paper introduces Depthwise Convolutional Eigen-Filters (DeCEF) as a new parameterization for Conv2D layers in CNNs, aimed at reducing the number of trainable parameters and computational complexity without relying on pre-trained models. Experiments demonstrate that DeCEF layers can achieve similar or improved accuracy with approximately two-thirds of the original parameters and floating point operations. The findings suggest that the effective rank of Conv2D filters decreases with network depth, indicating significant redundancy that can be exploited for more efficient CNN architectures.

Uploaded by

pedestrianmoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

cnn_compressed

The paper introduces Depthwise Convolutional Eigen-Filters (DeCEF) as a new parameterization for Conv2D layers in CNNs, aimed at reducing the number of trainable parameters and computational complexity without relying on pre-trained models. Experiments demonstrate that DeCEF layers can achieve similar or improved accuracy with approximately two-thirds of the original parameters and floating point operations. The findings suggest that the effective rank of Conv2D filters decreases with network depth, indicating significant redundancy that can be exploited for more efficient CNN architectures.

Uploaded by

pedestrianmoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Neurocomputing 609 (2024) 128461

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Building efficient CNNs using Depthwise Convolutional Eigen-Filters


(DeCEF)
Yinan Yu a,c,d ,∗, Samuel Scheidegger c,d , Tomas McKelvey b
a Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden
b
Department of Electrical Engineering, Chalmers University of Technology, Gothenburg, Sweden
c
Asymptotic AI, Gothenburg, Sweden
d
Lumilogic, Gothenburg, Sweden

ARTICLE INFO ABSTRACT

Communicated by A. Mukherjee Deep Convolutional Neural Networks (CNNs) have been widely used in various domains due to their impressive
capabilities. These models are typically composed of a large number of 2D convolutional (Conv2D) layers
Keywords:
with numerous trainable parameters. To manage the complexity of such networks, compression techniques
Convolutional neural network
Low rank approximation
can be applied, which typically rely on the analysis of trained deep learning models. However, in certain
Subspace method situations, training a new CNN from scratch may be infeasible due to resource limitations. In this paper, we
Network complexity propose an alternative parameterization to Conv2D filters with significantly fewer parameters without relying
Efficient network on compressing a pre-trained CNN. Our analysis reveals that the effective rank of the vectorized Conv2D
Deep learning filters decreases with respect to the increasing depth in the network. This leads to the development of the
Depthwise Convolutional Eigen-Filter (DeCEF) layer, which is a low rank version of the Conv2D layer with
significantly fewer trainable parameters and floating point operations (FLOPs). The way we define the effective
rank is different from previous work, and it is easy to implement and interpret. Applying this technique is
straightforward – one can simply replace any standard convolutional layer with a DeCEF layer in a CNN. To
evaluate the effectiveness of DeCEF layers, experiments are conducted on the benchmark datasets CIFAR-10 and
ImageNet for various network architectures. The results have shown a similar or higher accuracy using about
2/3 of the original parameters and reducing the number of FLOPs to 2/3 of the base network. Additionally,
analyzing the patterns in the effective rank provides insights into the inner workings of CNNs and highlights
opportunities for future research.

1. Introduction One topic on constructing an efficient CNN is the Neural Archi-


tecture Search (NAS), where the focus is to search for an optimal
Deep CNN is one of the most commonly used data-driven tech- architecture given certain criteria. In this paper, however, we assume
niques. Typically, the large number of trainable parameters in deep that the wiring of the layers is pre-determined. Our focus is on how to
learning models result in high demands on the computational power improve the efficiency of a CNN for a given architecture.
and memory capacities, which requires renting or purchasing expensive The literature primarily highlights two strategies to achieve this.
infrastructure for training. The high power consumption during training The first strategy is to take a trained network and reduce the most
and inference is not environmentally friendly [1]. Moreover, the size insignificant parameters. This refers to as compression or pruning in the
of the network and the number of FLOPs play an important role for literature. This is often a reasonable approach since many applications
the inference process, where a small edge device may be used with
are using pre-trained networks as backbones in their networks.
restrictions on the complexity of the runtime. Therefore, building an ef-
However, this strategy depends on the availability of a reusable
ficient network is beneficial in terms of saving computational resources
pre-trained network. Since the significance of network weights is often
and reducing the overall cost for deep learning while achieving similar
data-dependent, factors such as variations in data, restrictive licensing,
performances.

∗ Corresponding author at: Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.
E-mail address: [email protected] (Y. Yu).
1
Although not being the main focus of this work, the proposed method can also be applied as a compression technique derived from a pre-trained network.
This aspect is elaborated in Appendix B.

https://doi.org/10.1016/j.neucom.2024.128461
Received 25 October 2022; Received in revised form 11 April 2023; Accepted 21 August 2024
Available online 3 September 2024
0925-2312/© 2024 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Y. Yu et al. Neurocomputing 609 (2024) 128461

Fig. 1. The density histogram (y-axis∈ [0, 1]) of the effective rank (x-axis∈ [1, 9]) estimated using Eq. (2) in Procedure 1 for DenseNet-121 trained on ImageNet. The statistics are
computed over all input channels 𝑖 in that layer. We see the decreasing trend of the effective ranks with respect to the depth of the network.

or other constraints can make it impractical to reuse a pre-trained net- analyze the low rank behaviors in several trained CNNs in Section 2.1.
work. This makes training a neural network from scratch unavoidable. We then propose the definition of a new type of filter parameterization
Nonetheless, training the original network from scratch may not be DeCEF in Section 2.2. To further illustrate the advantages of using a
feasible due to the potentially high resource requirements. In this case, DeCEF layer, we show two key properties, robustness and complexity,
after the overall architecture is established, one may re-parameterize in Section 2.3. In Section 2.4, we present the training strategies for
the CNN to make it more efficient before training. That is, the network DeCEF. In Section 4, we show experiments to evaluate the effective-
is still aimed to accomplish what the original CNN is supposed to ness of DeCEF. First, we run ablation studies on the smaller dataset
achieve but with significantly fewer trainable parameters and FLOPs. CIFAR-10 using DeCEF to gain empirical insights of its behaviors in Sec-
The approximation is on the functional level instead of relying on tion 4.2. To further evaluate the two properties of DeCEF, we conduct
trained parameters. This is the main focus of this work.1 experiments using the benchmark network ResNet-50 on ImageNet for
The main hypothesis for finding an efficient re-parameterization comparing complexity versus accuracy. Moreover, in Section 4.3, we
strategy is that there is significant redundancy in Conv2D layers, which run further experiments on two additional popular network architec-
means that it may be sufficient to express a Conv2D layer with fewer tures DenseNet and HRNet. These results are also compared to other
parameters in order to achieve similar performances. One of the most
state-of-the-art model reduction techniques in Section 4.3.
commonly used function approximation techniques is the subspace low
rank representation [2–4]. It is a family of very well studied and widely
2. DeCEF layers
used techniques in the area of signal processing and machine learning.
To put it in the context of CNN, the main idea is to rearrange the
trainable variables into a vector space and find a subspace spanned by 2.1. Motivation
the most significant singular vectors of these variables. This new rep-
resentation typically results in fewer trainable variables and inference First, let us formally define what a layer is in this context.
FLOPs during runtime with potentially better robustness.
There are two key steps involved to achieve this approximation: (1) Definition 1. In the scope of this paper, a Conv2D layer (or a layer
find a representative vector space for each layer, and (2) estimate for short)
the effective rank without training. To find a representative vector { }
 = 𝐰𝑗(𝑖) ∈ Rℎ×ℎ ∶ 𝑖 = 1 ⋯ 𝑐in , 𝑗 = 1 ⋯ 𝑐out
space for Conv2D filters in a CNN, we have designed experiments
where we observe that 1) vectorized Conv2D filters exhibit low rank is a set of trainable units that are characterized by the following
behaviors, and (2) the effective ranks are different for each layer and attributes: (1) number of input channels 𝑐in ; (2) number of output
they have a decreasing tendency with respect to the depth of the channels 𝑐out , and (3) parameterization: 𝐰(𝑖) ℎ×ℎ , i.e. the Conv2D
𝑗 ∈ R
network. Given these observations, we propose a new convolutional filter.
filter DeCEF. DeCEF is parameterized by a new hyperparameter we
call rank, where a full rank DeCEF is equivalent to a Conv2D filter, Note that there are multiple layers in a network, but we ignore
whereas a rank one DeCEF is equivalent to a depthwise separable the layer index in this definition for simplicity. When multiple layers
convolutional layer. To avoid the common problem of over-tuning, we appear in the same context, we use 𝑙 to denote the indexed layer,
use a rule-based approach for finding the ranks, where the rules are pre- where the subscript 𝑙 ∈ {1, … , 𝐿} is the layer index and 𝐿 is the depth2
determined by cross-validation on a small dataset trained on a small
network. The rules are then applied to larger datasets and networks
without further adjustments or tuning. 2
To clarify, this depth refers to the depth of the network. The depthwise in
The paper is organized as follows. First, to motivate our work, we DeCEF refers to the depth (i.e. input channels) of a layer, which is a different
present the experiments and methodologies being used to observe and concept.

2
Y. Yu et al. Neurocomputing 609 (2024) 128461

Fig. 2. Effective rank (cf. Eq. (3)) versus layer depth. In these networks, we observe decreasing trend of the effective ranks when a network goes deeper. In this figure, we show
this effect in the networks VGG, ResNet and DenseNet.

of the network. In addition, we denote 𝐾 ∶= ℎ2 . Note that in practice, 𝑘 = 1, … , min(𝐾, 𝑐out ), 𝛾 ∈ [0, 1]} ∣ (2)
the filter shape may be rectangular. Moreover, for the sake of both
where |⋅| denotes the cardinality of a set and 𝐒[𝑘, 𝑘] is the 𝑘th diagonal
consistency and convenience, we use 𝑖 and 𝑗 to denote the input channel
element of matrix 𝐒.
index and the output channel index, respectively.
• The effective rank of one layer 𝑙:
Our motivation of this work has originated from the low rank
behaviors we have observed in the vectorized filter parameters, so let 𝑟𝑙 =∣ {𝑠𝑙 ∶ 𝑠𝑙 ≥ 𝛾, 𝑘 = 1, … , min(𝐾, 𝑐out ), 𝛾 ∈ [0, 1]} ∣ (3)
us start with this experimental procedure to illustrate our findings. ( (𝑖) )
𝑙 𝐒 [𝑘,𝑘]
where 𝑠 = E𝑖 𝐒(𝑖) [1,1] and the expected value can be estimated by
Procedure 1. Observing low rank behaviors averaging over all input channels 𝑖.
• Apply vectorization 𝐰̄ (𝑖) (𝑖) 𝐾
𝑗 ∶= vec(𝐰𝑗 ) ∈ R and compute the truncated To illustrate the empirical values, examples can be found in Figs. 1
Singular Value Decomposition (SVD): and 2. Fig. 1 shows the density histogram of singular values computed
[ ]
𝐔̄ (𝑖) 𝐒(𝑖) 𝐕(𝑖)T = 𝐰̄ (𝑖) (𝑖)
, (1) using Eq. (1). The histogram is calculated from all input channels in
1
⋯ ̄
𝐰 𝑐 out
each convolutional layer with 𝐾 > 1. The maximum ranks of the layers
where matrices 𝐔̄ (𝑖) and 𝐕(𝑖) are the left and right singular matrix, in these example networks are min(𝐾, 𝑐out ) = 𝐾, where 𝐾 = 9. Similar
respectively; and 𝐒(𝑖) is a diagonal matrix that contains the singular low rank behaviors can be observed in Fig. 2.
values in a descending order. The implementation of this procedure is To summarize what we have observed:
well supported by any linear algebra libraries in most programming
(1) the vectorized Conv2D filters in a trained CNN exhibit low rank
languages.
properties (cf. Fig. 1);
• Identify the effective rank for each input channel 𝑖:
(2) the effective ranks of vectorized filters show a decreasing ten-
𝑟𝑖 =∣ {𝐒(𝑖) [𝑘, 𝑘] ∶ 𝐒(𝑖) [𝑘, 𝑘] ≥ 𝛾𝐒(𝑖) [1, 1], dency when the network goes deeper (cf. Fig. 2);

3
Y. Yu et al. Neurocomputing 609 (2024) 128461

(3) the effective ranks of vectorized filters converge over training Property 2. Robustness
steps (see video in supplementary material).
Lemma 1. Let 𝛥𝐈𝑖 be an additive perturbation matrix and 𝐰(𝑖) 𝑗 ∈ R
ℎ×ℎ
Given these observations, we propose a new layer called DeCEF as
be a filter parameterized by Eq. (4), which is learned from some training
an alternative parameterization to Conv2D layers for the purpose of
process. Let
reducing the redundancy. [ ]
𝐔̄ (𝑖) = 𝐮̄ (𝑖)
0
, … , 𝐮̄ 𝑟(𝑖) . (5)
2.2. Definition
‖ ‖
If 𝐔̄ (𝑖)T 𝐔̄ (𝑖) = 𝐈 and ‖𝐚𝑗(𝑖) ‖ ≤ 𝜖, ∀𝑖, 𝑗,
‖ ‖2
In this section, we introduce the definition of DeCEF followed
‖∑ ‖ ∑
by its two properties. Generally speaking, subspace techniques bring ‖ (𝑖) ‖ ‖𝛥𝐈𝑖 ‖ .
‖ 𝛥𝐈𝑖 ∗ 𝐰𝑗 ‖ ≤ 𝜖ℎ𝑟 ‖ ‖2 (6)
‖ ‖
better robustness to the learning system due to their reduced model ‖ 𝑖 ‖∞ 𝑖
complexity. Motivated by these observations and analyses, we define a
DeCEF layer as follows: Proof. See Appendix A. □
Robustness in this context is indicated by the propagation of the
Definition 2 (DeCEF Layer). A DeCEF layer is defined by
{ } additive perturbation between input and output feature maps. Lemma 1
𝛩 = 𝐰(𝑖) (𝑖)
𝑗 , 𝐰𝑗 ∈ R
ℎ×ℎ , 𝑖 = 1 ⋯ 𝑐 , 𝑗 = 1 ⋯ 𝑐
in out
shows that when (1) 𝐔̄ (𝑖)T 𝐔̄ (𝑖) = 𝐈, i.e. the vectorized filters are orthonor-
‖ ‖
mal, and (2) ‖𝐚(𝑖) ‖ ≤ 𝜖, i.e. the coefficients are bounded by 𝜖, the effect
with the following parameterization ‖ 𝑗 ‖2
of the perturbation on the output is bounded by Eq. (6).

𝑟
The rank 𝑟 of the eigen-filters is a hyperparameter that yields a
𝐰(𝑖)
𝑗 =
(𝑖) (𝑖)
𝑎𝑘,𝑗 𝐮𝑘 , 𝑟 ∈ [1, ℎ2 ] (4)
𝑘=1 trade-off between the robustness and the representational power of a
DeCEF layer. In this work, we use a rule based approach for choosing
where 𝑎(𝑖) 𝑘,𝑗
∈ R and 𝐮(𝑖)
𝑘
∈ Rℎ×ℎ , which satisfies this hyperparameter.
{
1 if 𝑙 = 𝑚
𝐮̄ (𝑖)𝑇
𝑙
𝐮̄ (𝑖)
𝑚 =
2.4. Training algorithms
0 otherwise
2 In this section, we show how to construct and train a network
for 𝐮̄ (𝑖)
𝑘
= vec(𝐮(𝑖)
𝑘
) ∈ Rℎ . The parameters 𝐮(𝑖)
𝑘
∈ Rℎ×ℎ are called the
composed of DeCEF layers.
eigen-filters.

Note that for the sake of clarity, we use 𝛩 to denote the DeCEF layer, 2.4.1. The optimization problem
instead of the generic notation  in Definition 1. { Given a network
} architecture with a set of layers  =
1 ⋯ 𝐿 . Denote the index set of the network  using  =
{ }
2.3. Properties {1, … , 𝐿}. Let  = 𝛩𝑚1 ⋯ 𝛩𝑚𝑆 ⊆  be a set of DeCEF layers
with index set  = {𝑚1 , … , 𝑚𝑆 }. Let ̃ =  ⧵  be the rest of the
In this section, we present two key properties of the DeCEF layer. layers in the network. Let 𝑓 ( ) be an objective function and 𝜆𝛷()̃ be
These properties are then empirically evaluated in the experiment ̃ where 𝜆 > 0 is the multiplier.
a regularization term applied to the set ,
section. The optimization problem is formulated as:

min 𝑓 ( ) + 𝜆𝛷()̃
Property 1. Complexity (one layer) (𝑖)T ̄ (𝑖)
subject to ̄
𝐔𝑙 𝐔𝑙 = 𝐈 (7)
• Number of trainable parameters (𝑁) ‖ (𝑖) ‖
‖𝐚𝑙,𝑗 ‖ ≤ 𝜖, ∀𝑙 ∈ 
‖ ‖2
– 𝑁(Conv2D): 𝑐in 𝑐out ℎ2
– 𝑁(DeCEF): : 𝑁𝑢 + 𝑁𝑎 , where 2.4.2. Relaxed regularization
Finding an exact optimal DeCEF layer is an NP-hard problem due
– Eigen-filters: 𝑁𝑢 = 𝑐in ℎ2 𝑟 to the orthonormality constraint. Therefore, we approximate the con-
– Coefficients: 𝑁𝑎 = 𝑐in 𝑐out 𝑟 straint by the following regularizations. For a given DeCEF layer 𝑙, we
have:
For 𝑟 = ℎ2 , it is trivial to randomly initialize eigen-filters that span
‖ ‖
the whole ℎ2 dimensional vector space and hence the eigen-filters do 𝛷1 ∶ 𝜆1 ‖𝐔̄ (𝑖)T 𝐔̄ (𝑖) − 𝐈‖ (8)
‖ ‖2 [ ]
not need to be trainable, i.e. 𝑁𝑢 = 0 and 𝑁𝑎 = 𝑐in 𝑐out ℎ2 . Therefore, ‖ (𝑖) ‖ (𝑖) (𝑖) (𝑖)
Conv2D and DeCEF are equivalent for 𝑟 = ℎ⌊2 . 𝛷2 ∶ 𝜆2 ‖𝐚𝑗 ‖ , 𝐚𝑗 = 𝑎1,𝑗 , … , 𝑎𝑟,𝑗 (9)
⌋ ‖ ‖2
𝑐 ℎ2
For 𝑟 < ℎ2 , 𝑁(DeCEF)< 𝑁(Conv2D) if 𝑟 ≤ 𝑐 out+ℎ2 . Note that the layer index 𝑙 is neglected.
out

Example. Given 𝑐in = 𝑐out = 128 and ℎ = 3, we have 𝑁(Conv2D)= The loss function of the whole network is then written as:
147456. If 𝑟 ≤ 8 < ℎ2 = 9, then 𝑁(DeCEF) < 𝑁(Conv2D). For 𝑟 = 8, ̃ + 𝜆1 𝛷1 () + 𝜆2 𝛷2 ()
𝑓 ( ) + 𝜆𝛷() (10)
𝑁(DeCEF)= 140288 and for 𝑟 = 4, 𝑁(DeCEF)= 70144. ∑
• FLOPs (𝐹 ) where 𝛷𝑖 () = 𝑙∈ 𝛷𝑖𝑙 , 𝑖 = 1, 2.
We count the multiply-accumulate operations (macc) and we do not
include bias in our calculations. 2.4.3. Deterministic rule-based hyperparameters
⌊ ⌋ ⌊Given ⌋the dimension of the input layer
𝐻
𝐻 × 𝑊 × 𝑐in , let 𝑡 = 𝑠𝑡𝑟𝑖𝑑𝑒 𝑊
× 𝑠𝑡𝑟𝑖𝑑𝑒 , Hyperparameters are chosen based on deterministic rules to avoid
complex hyperparameter tuning and to increase reproducibility. These
– 𝐹 (Conv2D): 𝑡ℎ2 𝑐in 𝑐out rules are determined using a transfer learning approach. First, we find
( ) the hyperparameters in DeCEF using cross-validation on a small dataset
– 𝐹 (DeCEF): 𝑡𝑐in 𝑟 ℎ2 + 𝑐out
CIFAR-10, where cross-validation is affordable. Then we establish a de-
Example.Given 𝐻 = 𝑊 = 100, 𝑐in = 128, 𝑐out = 128 and ℎ = 3 terministic rule for each hyperparameter. These rules are then directly
with 𝑠𝑡𝑟𝑖𝑑𝑒 = 1, we have 𝐹 (Conv2D)= 1.47 GFLOPs. For 𝑟 = 8, applied to the larger dataset ImageNet without tuning. There are three
𝐹 (DeCEF)= 1.40 GFLOPs. For 𝑟 = 4, 𝐹 (DeCEF)= 0.70 GFLOPs. sets of hyperparameters h1 ∼ h3:

4
Y. Yu et al. Neurocomputing 609 (2024) 128461

h1: Ranks 𝑟 (Algorithm 1): The observation of singular values from 2.5. Refactor a Conv2D network into DeCEF
several networks shows that the effective ranks typically have
a decreasing trend with respect to the depth, i.e., layers at the There are such use cases where a pre-trained CNN is available and
beginning of the network often have higher rank, and vice versa. one needs to reduce the runtime complexity of the network. This is not
The idea of choosing the rank before training a network is to find the focus of this work but we also propose a compression algorithm
a monotonically decreasing function given the increasing depth presented in Appendix B.
(cf. Fig. 2). In this paper, we adopt two alternative routines
for choosing the rank in each layer: linear decay (simple) and 3. Related work
logarithmic decay (aggressive). Let 𝑙 be the depth index of a layer
and 𝐾 = ℎ2 . Denote 𝑙max = max(𝑙) and 𝑙min = min(𝑙). To compare to the state-of-the-art techniques, in this section, we list
⌊ ⌋ the following existing approaches.
– Linear decay: 𝑟̂𝑖 = 𝐾 − 𝑙 𝑙(𝐾−1) . Subspace techniques: The first category is the Low-Rank Approx-
⌊max min ⌋
−𝑙
𝐾−1
– Logarithmic decay: 𝑟̂𝑖 = 𝑙𝑜𝑔2(𝑙+1) . imation (LRA) technique. There are mainly two different approaches
in the existing literature: (1) Separable bases: Jaderberg et al. [6]
h2: Regularization coefficients (Algorithm 1, cf. Eqs. (8), (9)): 𝜆1 = decomposes the 𝑑 × 𝑑 filters into 1 × 𝑑 and 𝑑 × 1 filters to construct
10−4 𝑟 and 𝜆2 = 10−4 . rank-1 bases in the spatial domain. In later work, Tai et al. [7], Lin
h3: Singular value threshold to determine the effective rank (cf. Eqs. et al. [8], presents closed form solutions that significantly improves
(2), (3)): 𝛾 = 0.3. the efficiency over previous iterative optimization solvers are proposed.
Ioannou et al. [9] introduces a novel weight initialization that allows
A note on the rank function. Our observation indicates that the effective small basis filters to be trained from scratch, which has achieved similar
ranks of Conv2D layers have a decreasing tendency, which motivates or higher accuracy than the conventional CNNs. Yu et al. [10] pro-
us to choose a rank function that is decreasing over the depth of the poses a SVD-free algorithm that uses the idea that filters usually share
network to approximate this behavior and enable the possibility of smooth components in a low-rank subspace. Alvarez and Salzmann [11]
training a DeCEF network from scratch. Interestingly, this observation introduces a regularizer that encourages the weights of the layers to
have low rank during the training. More recently, Yang et al. [12]
seems to be related to recent research on layer convergence bias [5],
introduced SVD training, a method to achieve low-rank Deep Neural
which states that shallower layers are carrying lower frequency in-
Networks (DNNs) that avoids costly singular value decomposition at
formation and they tend to converge sooner than deeper layers. A
every step. The Generalized Depthwise-Separable convolution [13] is
possible explanation for the decreasing rank function is that allowing
an efficient post-training approximation for 2D convolutions in CNNs,
a broader range of frequency components to pass from earlier layers to improving throughput while preserving robustness. Yin et al. [14] pro-
deeper layers grants the latter more freedom to learn high-frequency poses an ADMM-based framework for tensor decomposition in model
information. Since the most significant eigen-filters can be interpreted compression, formulating tensor train decomposition as an optimiza-
as low frequency components, if the ranks of shallower layers were set tion problem with tensor rank constraints and iteratively solving it to
too low, high frequency information will be filtered out by the low rank obtain high-accuracy tensor train-format DNN models for CNNs and Re-
approximation, and therefore will not be passed onto deeper layers for current Neural Networks (RNNs). Li et al. [15] introduces a two-phase
learning. progressive genetic algorithm, PSTRN, which leverages the discovery of
interest regions in rank elements to efficiently determine optimal ranks
in tensor ring networks. Recently, Chen et al. [16] proposes joint matrix
2.4.4. Training algorithm decomposition for CNN compression, leveraging shared structures to
The training algorithm is summarized in Algorithm 1. project weights into the same subspace, and offers three decomposi-
tion schemes with SVD-based optimization for improved compression
Algorithm 1 (DeCEF Training Strategy). results.
(2) Filter vectorization: Some existing work implements the low rank
• Step 1: Choose a network topology fully or partially composed approximation by vectorizing the filters. For instance, Denton et al.
of DeCEF layers. For example, one can replace all Conv2D layers [17] stacks all filters for each output channel into a high dimensional
with DeCEF layers. vector space and approximates the trained filters using SVD. Wen
• Step 2: Choose hyperparameters 𝑟, 𝜆1 , 𝜆2 . et al. [18] presents a regularization to enforce filters to coordinate
• Step 3: Initialization for each DeCEF layer (𝑘 = 1, ⋯ , 𝑟): into lower-rank space, where the subspaces are constructed from all
the input channels for each given output channel. Later, Peng et al.
– Eigen-filters 𝐮(𝑖)
𝑘
:
[19] proposed a decomposition focusing on exploiting the filter group
– Generate random matrices: 𝐀(𝑖) ∈ R𝐾×𝑟 . structure for each layer.
– Compute the truncated SVD: 𝐀(𝑖) = 𝐔̄ (𝑖) 𝐒̄ (𝑖) 𝐕̄ (𝑖)T . Pruning: Pruning refers to techniques that aim at reducing the num-
– Reshape each column in 𝐔̄ (𝑖) into matrix 𝐮(𝑖) 𝑘
∈ R𝐾 . ber of parameters in a pre-trained network by identifying and removing
redundant weights. This is a very invested topic in the attempt to
(𝑖)
– Coefficients 𝑎𝑘,𝑗 : randomly initialized from a normal distri- reduce the model complexity. Although being different from our use
bution. case, we list the state-of-the-art pruning techniques in this section to
have a more complete view on model reduction techniques. In Optimal
Step 4: Forward and backward paths: Brain Damage by LeCun et al. [20], and later in Optimal Brain Surgeon
– Forward 𝐈𝑙 → 𝐈𝑙+1 : for each out channel 𝑗, by Hassibi et al. [21], redundant weights are defined by their impact
on the objective function, which are identified using the Hessian of
𝑐in 𝑟
∑ ∑ the loss function. Other definitions of redundancy have been proposed
𝐈𝑗𝑙+1 = 𝑎(𝑖),𝑙
𝑘,𝑗 𝑘
𝐮(𝑖),𝑙 ∗ 𝐈𝑖𝑙
𝑖=1 𝑘=1
in subsequent work. For instance, Anwar et al. [22] applies pruning
on the filter-level of CNNs by using particle filters to propose pruning
– Backward: backpropagation with the loss function described candidates. Han et al. [23] introduces a simpler pruning method using a
in Eq. (10). strong L2 regularization term, where weights under a certain threshold

5
Y. Yu et al. Neurocomputing 609 (2024) 128461

are removed. Molchanov et al. [24] uses Taylor expansion to approxi- blocks, Huang et al. [56] manages to reduce the number of required
mate the influence in the loss function by removing each filter. Hu et al. parameters. Xie et al. [57] proposed a multi-branch architecture which
[25] iteratively optimizes the network by pruning unimportant neurons exposes a new hyperparameter for each block to control the capacity of
based on analysis of their outputs on a large dataset. Li et al. [26] the network. Other work, like MobileNet [58,59] and EfficientNet [60]
identifies and removes filters having a small effect on the accuracy. specifically focus on builing architectures suitable for devices with low
Aghasi et al. [27] prunes a trained network layer-wise by solving a compute capacity, such as mobile phones. By a design that maintains
convex optimization program. Liu et al. [28] takes wide and large a high-resolution representation throughout the whole network, Wang
networks as input models, but during training insignificant channels are et al. [61] achieves good accuracy and performance in HRNet.
automatically identified and pruned afterwards. More recently, Luo and Compression: Deep Compression, by Han et al. [62], reduces the
Wu [29], Luo et al. [30] analyzes the redundancy of filters in a trained storage size of the model using quantization and Huffman encoding
network by looking at statistics computed from its next layer. He et al. to compress the weights in the network. Other work on reducing the
[31] proposes an iterative LASSO regression based channel selection memory size of models is done by binarization. In XNOR-Net by Raste-
algorithm. Huang et al. [32] removes filters by training a pruning agent gari et al. [63], the weights are reduced to a binary representation and
to make decisions for a given reward function. Yu et al. [33] poses the convolutions are replaced by XNOR operations. More recently, Suau
pruning problem as a binary integer optimization and derives a closed- et al. [64] proposed to analyze filter responses to automatically select
form solution based on final response importance. Lin et al. [34] prunes compression methods for each layer.
filters across all layers by proposing a global discriminative function Weight sharing: Another approach to reduce the number of pa-
based on prior knowledge of each filter. Tung and Mori [35] combines rameters in a network is to share weights between the filters and
network pruning and weight quantization in a single learning frame- layers. Boulch [65] share weights between the layers in a residual
work that performs pruning and quantization jointly. Zhang et al. [36] network operating on the same scale.
first formulate the weight pruning problem a nonconvex optimization Depthwise separable convolutions: introduced by Chollet [66],
problem constraints specifying the sparsity requirements and optimize have shown to be a more efficient use of parameters compared to
using the alternating direction method of multipliers. Other work, such regular Conv2Ds Inception like architectures. Depthwise separable con-
as Zhuang et al. [37], uses discrimination-aware losses into the network volutions have also been used in other work, e.g., [31], where it was
to increase the discriminative power of intermediate layers. Huang and used to gain a computational speed-up of ResNet networks.
Wang [38] adds a scaling factor to the outputs and then add sparsity Our focus: We observe and analyze the Conv2D from a different
regularizations on these factors. He et al. [39] compresses CNN models perspective compared to the previous subspace techniques. More specif-
by pruning filters with redundancy, rather than those with ‘‘relatively ically, (i) we vectorize the filters instead of using separable basis in
less’’ importance. Lin et al. [40] proposes a scheme that incorporates the original vector space [6,7,9,10]; (ii) we do not concatenate these
two different regularizers which fully coordinates the global output and vectorized filters into a large vector space [17–19], which achieves
local pruning operations to adaptively prune filters. Later, Lin et al. a better modularity compared to the concatenated vectors. Our per-
[41], proposed an effective structured pruning approach that jointly spective is motivated by the empirical evidence from our experiments.
prunes filters as well as other structures in an end-to-end manner by This opens up new opportunities and provides new analytical tools for
defining a new objective function with sparsity regularization which is understanding the design of convolutional networks with respect to
solved by generative adversarial learning. Ding et al. [42] proposes a their subspace redundancies. In our experiments, we choose a popular
novel optimization method, which can train several filters to collapse base network (ResNet) and compare our experimental results to various
into a single point in the parameter hyperspace which can be trimmed modifications of the same base network. We also conduct tests on
with no performance loss. Liu et al. [43] proposes a meta network, other more recent network architectures such as HRNet-W18-C and
which is able to generate weight parameters for any pruned structure DenseNet-121 for further comparison and validation.
given the target network, which can be used to search for good-
performing pruned networks. You et al. [44] introduce gate decorators 4. Experiments and results
to identify unimportant filters to prune. Molchanov et al. [45] prunes
filters by using Taylor expansions to approximate a filter’s contribution. 4.1. Hardware
Ding et al. [46] finds the least important filters to prune by a binary
search. Luo and Wu [47] proposes an efficient channel selection layer For training and experiments, Nvidia Tesla V100 SXM2 with 32 GB
to find less important filters automatically in a joint training manner. of GPU memory are used.
Lin et al. [48] proposes a method that is mathematically formulated to
prune filters with low-rank feature maps. He et al. [49] introduces a 4.2. Dataset CIFAR-10: Ablation study
differentiable pruning criteria sampler. Ding et al. [50] proposes a re-
parameterization of CNNs to a remembering part and a forgetting part. Dataset: To empirically study the behavior of DeCEF, we conduct
The former learns to maintain the performance and the latter learns various experiments on the standard image recognition dataset CIFAR-
for efficiency. Liu et al. [51] proposes a layer grouping algorithm to 10 by Krizhevsky and Hinton [67]. Benchmark: We use ResNet-32 as
find coupled channels automatically. Shi et al. [52] uses an effective the base net for comparison. ResNet-32 has three blocks, where the last
estimation of each filter, i.e., saliency, to measured filters from two block (block-3) in ResNet-32 has the most filters. Since our goal is to
aspects: the importance for prediction performance and the consumed reduce the amount of trainable parameters and FLOPs, we mainly vary
computational resources. This can be used to preserve the prediction the structure in block-3 in our experiments.
performance while zeroing out more computation-heavy filters.
Experiments: We design four experiments as follows.
Architectural design: Effort has been put into designing a smaller
network architecture without loss of the generalization ability. For Experiment 1. varying rank 𝑟 and 𝑐out . For a layer with input channels
instance, He et al. [53] achieves a higher accuracy in [53,54] com- 𝑖 = 1, … , 𝑐in and output channels 𝑗 = 1, … , 𝑐out , the filters in the DeCEF
∑𝑟
pared to other more complex networks by introducing the residual layer is expressed as 𝐰𝑗(𝑖) = (𝑖) (𝑖)
𝑘=1 𝑎𝑘,𝑗 𝐮𝑘 . We empirically show that
building block. The residual building blocks adds an identity mapping DeCEF layers achieve higher accuracy with significantly lower number
that allows the signals to be directly propagated between the layers. of parameters. In this experiment, we vary two hyperparameters: (1)
Iandola et al. [55] introduces SqueezeNet and the Fire module, which the rank 𝑟 of each filter in the DeCEF layer, and (2) the number of
is designed to reduce the number of parameters in a network by in- output channels 𝑐out . We compare the accuracy versus the number
troducing 1 × 1 filters. By utilizing dense connections pattern between of parameters in different types of layers (Conv2D and DeCEF with

6
Y. Yu et al. Neurocomputing 609 (2024) 128461

different layer ranks in block-3 𝑟3 ∈ {1, … , 9}. In addition, the number


of trainable parameters in DeCEF layers is also varied by using differ-
ent numbers of output channels in block-3, i.e., 𝑐out ∈ {64, 96, 128}.
We then vary 𝑐out in ResNet-32 block-3 (𝑐out ∈ {16, 20, 24, … , 128})
to have a comparable result. We compare the accuracy achieved by
DeCEF-ResNet-32 Fig. 5.

4.3. Dataset ImageNet (ILSVRC-2012)

To further compare our algorithms to the state-of-the-art, we use


the standard dataset ImageNet (ILSVRC-2012) by Deng et al. [69].
ImageNet has 1.2 M training images and 50 k validation images of 1000
object classes, commonly evaluated by Top-1 and Top-5 accuracy. We
Fig. 3. Accuracy versus number of parameters on CIFAR-10. use the networks ResNet-50 v2 [54], DenseNet-121 [56] and HRNet-
W18-C [61] as the base networks. The results are visualized in Figs. 6
and 7 for Top-1 and Top-5 accuracy, respectively.
The hyperparameters used in DeCEF-ResNet-50 are determined by
the deterministic rules presented in h1, h2 and h3. For each setup,
we have five runs and report the average accuracy and its standard
deviation. From the experiments, we see the trade-off between the two
rank decay mechanisms: linear decay is less aggressive, which yield to
a better accuracy, whereas logarithmic decay reduce a greater number
of FLOPs while still having a decent accuracy. To further validate
DeCEF, we run the same experiments on three commonly used base
networks. The results are reported in Tables 2 and 3 to compare with
the corresponding base network and state-of-the-art model reduction
techniques.

Fig. 4. DeCEF layer with trainable vs. frozen bases on CIFAR-10.


4.4. Comparison to related work

Various configurations of the DeCEF method are compared to re-


different hyperparameters). As shown in Fig. 3, with a lower number
lated work on two different datasets, CIFAR-10 and ImageNet.
of parameters, DeCEF achieves a better accuracy with low rank tech-
For CIFAR-10, the DeCEF-ResNet-32 (32, 64, 128) configuration
niques. Moreover, when we increase the number of output channels,
achieves competitive accuracy while having fewer parameters and
DeCEF shows a even more promising result with fewer parameters in
lower computational complexity than comparable competitive meth-
total.
ods, such as HRank ResNet-110 [48] and SASL ResNet-110 [52].
Experiment 2. trainable vs. frozen eigen-filters. In Algorithm 1, the eigen- The DeCEF-ResNet-32 (24, 48, 96) configuration also demonstrates
filters in DeCEF layers are trained simultaneously using backpropa- comparable accuracy compared to SASL ResNet-56 and HRank ResNet-
gation. In this experiment, we investigate the impact of this training 56 with significantly fewer parameters and lower computational re-
process and try to understand if it is sufficient to use random basis quirements, and superior accuracy compared to GBN-40 [44] with
vectors as eigen-filters. We initialize the eigen-filters according to Algo- similar complexities. Among pruning and subspace approximation tech-
rithm 1 and freeze them during training. The comparison between the niques, ResRep ResNet-110 [50] outperforms DeCEF in terms of ac-
accuracies achieved by frozen and trainable eigen-filters can be found curacy and complexity (94.19%, 108 MFLOPS vs. 94.62%, 105.68
in Fig. 4. By using frozen eigen-filters, the network has a fewer number MFLOPS). However, as ReRep is a channel pruning technique, it can
of trainable parameters for the same rank. With a low rank (𝑟 < 5), the be easily combined with DeCEF to achieve better efficiency.
accuracy is degraded without training. On the ImageNet dataset, the DeCEF-ResNet-50 (log decay) con-
figuration showcases competitive accuracy while maintaining fewer
Experiment 3. with or without 𝛷1 regularization. To study the effect of
parameters and lower computational complexities compared to al-
𝛷1 introduced in Eq. (8), some experiments can be found in Fig. 4. We
ternative methods such as Taylor-FO-BN-91% [45] fewer parameters
can see that with a high rank, the regularization needs to be applied.
than ShaResNet-101 and ShaResNet-152 (FLOPs not reported), and
In our experiment, we use 𝜆1 = 10−4 𝑟 and 𝜆2 = 10−4 , where 𝜆1 is the
lower computational complexities than GFP ResNet-50 1 [51] (param-
multiplier of the constraint on the eigen-filters and 𝜆2 is on the subspace
eters not reported). The model has similar complexities as GBN-60
coefficients. The reason for having the multiplier 𝑟 in 𝜆1 is to suppress
while achieving a higher accuracy. Additionally, the DeCEF-HRNet-
the growth of the cost when 𝑟 becomes large.
W18-C (log decay) configuration achieves competitive accuracy with
Experiment 4. comparison to related work. In this experiment, we im- significantly fewer parameters and superior computational efficiency
plement Algorithm 1 (DeCEF-ResNet-32) to compared to the state-of- compared to ResRep ResNet-50 1, ResRep ResNet-50 2, GBN-50, and
the-art techniques. We vary the number of output channels 𝑐out in the SSS-ResNetXt-38 [38].
last ResNet block for comparison, where we see that having fewer Based on these observations, the DeCEF method shows promise in
eigen-filters with more output channels yields a better result. enhancing model efficiency, making it a valuable approach to consider
Results: The results are presented in terms of the estimated mean in deep learning model development.
and the standard deviation of the classification accuracy on the testing Combining DeCEF with state-of-the-art deep learning models that
set with 10 runs for each experimental setup, which are shown in Figs. 3 achieve high accuracy with relatively low computational requirements
and 4. The accuracy is then presented with respect to the number (GFLOPs) such as the EfficientNet family [60], InceptionResNetV2 [71],
of trainable parameters for each network structure. For DeCEF layers, Xception [66], and Inception V3 [72] (by, for example, replacing the
there are nine data points in each presented result, which correspond to depthwise convolutional layers with DeCEF layers) could potentially

7
Y. Yu et al. Neurocomputing 609 (2024) 128461

Table 1
Comparison to state-of-the-art model reduction techniques on CIFAR-10.
Network Acc. Std. No. param. MFLOPs
(a) DeCEF vs. baseline network
DeCEF-ResNet-32 (32, 64, 128)0 94.19% (0.18%) 533.00 k 108.00
DeCEF-ResNet-32 (24, 48, 96)1 93.64% (0.16%) 311.00 k 64.72
ResNet-1102 [53] 93.57% 1.72 M 252.89
ResNet-563 [53] 93.03% 850.00 k 125.49
ResNet-324 [53] 92.49% 467.00 k 69.00
DeCEF-ResNet-32 (16, 32, 64)5 92.45% (0.17%) 148.00 k 32.42
(b) Related work
ResRep ResNet-1106 [50] 94.62% 105.68
C-SGD-5/8 ResNet-1107 [42] 94.44% 98.91
HRank ResNet-110 18 [48] 94.23% 1.04 M 148.70
SASL ResNet-1109 [52] 93.99% 1.17 M 122.15
SFP ResNet-110 20%10 [68] 93.93% 182.00
SFP ResNet-56 10%11 [68] 93.89% 107.00
SASL ResNet-5612 [52] 93.88% 689.35 k 80.44
SFP ResNet-110 30%13 [68] 93.86% 150.00
Bi-JSVD0.7 ResNet-16 11.9914 [16] 93.84% (0.09%) 930.78 k 373.00
SFP ResNet-110 10%15 [68] 93.83% 216.00
SASL* ResNet-11016 [52] 93.80% 786.04 k 75.36
ShaResNet-16417 [65] 93.80% 930.00 k
LFPC ResNet-11018 [49] 93.79% 101.00
SFP ResNet-56 30%19 [68] 93.78% 74.00
FPGM-only 40% ResNet-11020 [39] 93.74% 121.00
ResRep ResNet-56 121 [50] 93.73% 59.09
LFPC ResNet-56 122 [49] 93.72% 66.40
GAL-0.1 ResNet-11023 [41] 93.59% 1.65 M 205.70
SASL* ResNet-5624 [52] 93.58% 538.90 k 53.84
ResNet-110-pruned-A25 [26] 93.55% 1.68 M 213.00
HRank ResNet-56 126 [48] 93.52% 710.00 k 88.72
FPGM-only 40% ResNet-5627 [39] 93.49% 59.40
SFP ResNet-56 20%28 [68] 93.47% 89.80
C-SGD-5/8 ResNet-5629 [42] 93.44% 49.13
GBN-4030 [44] 93.43% 395.25 k 50.07
GAL-0.6 ResNet-5631 [41] 93.38% 750.00 k 78.30
NISP-11032 [33] 93.38% 976.10 k
HRank ResNet-110 233 [48] 93.36% 700.00 k 105.70
SFP ResNet-56 40%34 [68] 93.35% 59.40
LFPC ResNet-56 235 [49] 93.34% 59.10
SFP ResNet-32 10%36 [68] 93.22% 58.60
RJSVD-1 ResNet-16 17.7637 [16] 93.19% (0.04 %) 628.38 k 350.00
HRank ResNet-56 238 [48] 93.17% 490.00 k 62.72
ResNet-56-pruned-A39 [26] 93.10% 770.10 k 112.00
GBN-3040 [44] 93.07% 283.05 k 37.27
ResNet-56-pruned-B41 [26] 93.06% 733.55 k 90.90
ResNet-110-pruned-B42 [26] 93.00% 1.16 M 115.00
NISP-5643 [33] 92.99% 487.90 k 81.00
ADMM TT ResNet-3244 [14] 92.87% 97.29 k
FPGM-mix 40% ResNet-3245 [39] 92.82% 32.30
GAL-0.5 ResNet-11046 [41] 92.74% 950.00 k 130.20
ResRep ResNet-56 247 [50] 92.67% 27.82
SVD ResNet-32 Spatial Hoyer48 [12] 92.66% 26.72
HRank ResNet-110 349 [48] 92.65% 530.00 k 79.30
LFPC ResNet-3250 [49] 92.12% 32.70
nin-c3-lr51 [9] 91.78% 438.00 k 104.00
GAL-0.8 ResNet-5652 [41] 91.58% 290.00 k 49.99
PSTRN-S ResNet-3253 [15] 91.44% 180.00 k
HRank ResNet-56 354 [48] 90.72% 270.00 k 32.52
SFP ResNet-32 20%55 [68] 90.63% 49.00
SFP ResNet-32 30%56 [68] 90.08% 40.30

Table 2
Comparison to the base networks on ImageNet.
Layers Rank decay Top-1 Top-5 params (G)FLOPs
Conv2D None 76.47% 93.21% 25.56M 3.80
ResNet-50 DeCEF Linear 76.61% 93.22% 17.27M 2.90
DeCEF Logarithmic 76.46% 93.24% 16.64M 2.50
Conv2D None 74.81% 92.32% 79.79M 2.83
DenseNet-121 DeCEF Linear 74.85% 92.61% 72.10M 2.81
DeCEF Logarithmic 74.40% 91.89% 62.92M 2.11
Conv2D None 77.00% 93.50% 21.30M 3.99
HRNet-W18-C DeCEF Linear 76.17% 92.99% 9.490M 2.55
DeCEF Logarithmic 75.11% 92.47% 7.05M 1.27

8
Y. Yu et al. Neurocomputing 609 (2024) 128461

Fig. 5. Ball chart for CIFAR-10, where the size of the ball indicates the number of trainable parameters. For papers that have not reported the FLOPs, we use a cross instead of
a ball to represent them. The exact values are reported in Table 1. The number in each ball is the network ID, which is indicated as the superscript of each entry in Table 1.

Fig. 6. Ball chart for ImageNet Top-1 accuracy with the same set up as Fig. 5. The corresponding values can be found in Table 3.

Fig. 7. Ball chart for ImageNet Top-5 accuracy with the same set up as Fig. 5. The corresponding values can be found in Table 3.

improve both performance and efficiency, as these models primarily parameters and FLOPs. Our experiments have shown that in a convolu-
focus on architectural designs. tional layer with filter size ℎ × ℎ, it is not necessary to have more than
Moreover, exploring the integration of channel pruning techniques ℎ2 eigen-filters given the training strategy in Section 2.4.
like those employed in ResRep with DeCEF could serve as a potential In terms of the accuracy-to-complexity ratio, it is beneficial to
future direction, further boosting the efficiency and accuracy of these use more coefficients (i.e. output channels) with fewer eigen-filters in
configurations. DeCEF layers. The DeCEF layer is simple to implement in most deep
learning frameworks using depthwise separable convolutions with a
5. Conclusion and future work new training strategy. With the deterministic rules for choosing the
effective ranks, it is easy to design and reproduce the results. From our
In this paper, we propose a new methodology to observe and observations, the underlying subspace structure is a commonly shared
analyze the redundancy in a CNN. Motivated by our observations of property among different network architectures and topologies, which
the low rank behaviors in vectorized Conv2D filters, we present a provides insights into the design and analysis of CNNs.
layer structure DeCEF as an alternative parameterization to Conv2D For future work, we plan to delve deeper into this low-rank structure
filters for the purpose of reducing their complexity in terms of trainable to optimize the selection of effective ranks by exploring more advanced

9
Y. Yu et al. Neurocomputing 609 (2024) 128461

Table 3
Comparison to state-of-the-art model reduction techniques on ImageNet.
Network Top-5 Acc. Std. Top-1 Acc. Std. No. param. GFLOPs
(a) DeCEF vs. baseline network
HRNet-W18-C0 [61] 93.50% 77.00% 21.30 M 3.99
DeCEF-ResNet-50 (lin decay)1 93.22% (0.07%) 76.61% (0.06%) 17.27 M 2.90
ResNet-502 [54] 93.21% 76.47% 25.56 M 3.80
DeCEF-ResNet-50 (log decay)3 93.24% (0.05%) 76.46% (0.05%) 16.64 M 2.50
DeCEF-HRNet-W18-C (inferred rank)4 93.04% 76.30% 12.06 M 2.61
DeCEF-HRNet-W18-C (lin decay)5 92.99% 76.17% 9.49 M 2.55
DeCEF-HRNet-W18-C (log decay)6 92.47% 75.11% 7.05 M 1.27
(b) Related work
EfficientNet-B77 [60] 96.84% 84.43% 64.10 M
EfficientNet-B68 [60] 96.90% 84.08% 41.00 M
EfficientNet-B59 [60] 96.71% 83.70% 28.50 M
EfficientNet-B410 [60] 96.26% 82.96% 17.70 M
NASNetLarge11 [70] 96.00% 82.50% 84.90 M
EfficientNet-B312 [60] 95.68% 81.58% 10.80 M
InceptionResNetV213 [71] 95.25% 80.26% 54.30 M
EfficientNet-B214 [60] 94.95% 80.18% 7.80 M
EfficientNet-B115 [60] 94.45% 79.13% 6.60 M
Xception16 [66] 94.50% 79.00% 22.86 M
ResNet152V217 [54] 94.16% 78.03% 58.30 M
InceptionV318 [72] 93.72% 77.90% 21.80 M
ShaResNet-15219 [65] 93.86% 77.77% 36.80 M
DenseNet20120 [56] 93.62% 77.32% 18.30 M
ResNet101V221 [54] 93.82% 77.23% 42.60 M
EfficientNet-B022 [60] 93.49% 77.19% 4.00 M
ShaResNet-10123 [65] 93.45% 77.09% 29.40 M
GFP ResNet-50 124 [51] 76.95% 3.06
ResNet15225 [53] 93.12% 76.60% 58.40 M
Taylor-FO-BN-91%26 [45] 76.43% 22.60 M 3.27
ResNet10127 [53] 92.79% 76.42% 42.70 M
GFP ResNet-50 228 [51] 76.42% 2.04
MetaPruning 0.85 ResNet-5029 [43] 76.20% 3.00
GBN-6030 [44] 92.83% 76.19% 17.42 M 2.25
DenseNet16931 [56] 93.18% 76.18% 12.60 M
ResNet-50 GAL-0.5-joint32 [41] 90.82% 76.15% 19.31 M 1.84
ResRep ResNet-50 133 [50] 92.90% 76.15% 1.67
ResNet50V234 [54] 93.03% 75.96% 23.60 M
SSS-ResNetXt-4135 [38] 93.00% 75.93% 12.40 M 3.23
SASL36 [52] 92.82% 75.76% 1.91
AOFP-C137 [46] 92.69% 75.63% 2.58
ResRep ResNet-50 238 [50] 92.55% 75.49% 1.44
Taylor-FO-BN-81%39 [45] 75.48% 17.90 M 2.66
SSS-ResNet-4140 [38] 92.61% 75.44% 25.30 M 3.47
MetaPruning 0.75 ResNet-5041 [43] 75.40% 2.00
ShaResNet-5042 [65] 92.59% 75.39% 20.50 M
MobileNetV2(alpha = 1.4)43 [59] 92.42% 75.23% 4.40 M
ResNet-50 Variational44 [73] 92.10% 75.20% 15.30 M
GBN-5045 [44] 92.41% 75.18% 11.91 M 1.71
SASL*46 [52] 92.47% 75.15% 1.67
AOFP-C247 [46] 92.28% 75.11% 1.66
ResNet-50 FPGM-only 30%48 [39] 92.40% 75.03% 2.23
SSS-ResNetXt-3849 [38] 92.50% 74.98% 10.70 M 2.43
ResNet-50 HRank 150 [48] 92.33% 74.98% 16.15 M 2.30
DenseNet12151 [56] 92.26% 74.97% 7.00 M
DCP52 [37] 92.32% 74.95% 12.41 M 1.69
ResNet5053 [53] 92.06% 74.93% 23.60 M
MobilenetV254 [59] 74.70% 6.90 M 0.58
MobileNetV2(alpha = 1.3)55 [59] 92.12% 74.68% 3.80 M
SFP56 [68] 92.06% 74.61% 2.19
SSS-ResNetXt-35-A57 [38] 92.17% 74.57% 10.00 M 2.07
C-SGD-5058 [42] 92.09% 74.54% 1.71
Taylor-FO-BN-72%59 [45] 74.50% 14.20 M 2.25
LFPC60 [49] 92.04% 74.46% 1.60
NASNetMobile61 [70] 91.85% 74.37% 4.30 M
SSS-ResNet-3262 [38] 91.91% 74.18% 18.60 M 2.82
GFP ResNet-50 363 [51] 73.94% 1.02
Pruned-9064 [29] 91.60% 73.56% 23.89 M 3.58
MetaPruning 0.5 ResNet-5065 [43] 73.40% 1.00
ShaResNet-3466 [65] 90.58% 73.27% 13.60 M
SSS-ResNetXt-35-B67 [38] 91.58% 73.17% 8.50 M 1.55
Pruned-7568 [29] 91.27% 72.89% 21.47 M 3.19
NISP-50-A69 [33] 72.75% 18.63 M 2.76
(continued on next page)

10
Y. Yu et al. Neurocomputing 609 (2024) 128461

Table 3 (continued).
GDP 0.770 [34] 91.05% 72.61% 2.24
ResNet-34-pruned-A71 [26] 72.56% 19.90 M 3.08
ResNet-34-pruned-C72 [26] 72.48% 20.10 M 3.37
NISP-34-A73 [33] 72.29% 15.74 M 2.62
ResNet-50 SSR-L2,0 A74 [40] 91.73% 72.29% 15.50 M 1.90
ResNet-34-pruned-B75 [26] 72.17% 19.30 M 2.76
ResNet-50 SSR-L2,1 A76 [40] 91.57% 72.13% 15.90 M 1.90
NISP-50-B77 [33] 72.07% 14.36 M 2.13
ThiNet-7078 [30] 90.67% 72.04% 16.94 M 4.88
ResNet-50 HRank 279 [48] 91.01% 71.98% 13.77 M 1.55
GDP 0.680 [34] 90.71% 71.89% 1.88
SSS-ResNet-2681 [38] 90.79% 71.82% 15.60 M 2.33
Taylor-FO-BN-56%82 [45] 71.69% 7.90 M 1.34
NISP-34-B83 [33] 71.65% 12.17 M 2.02
ResNet-50 SSR-L2,0 B84 [40] 91.19% 71.47% 12.00 M 1.70
MobileNetV2(alpha = 1.0)85 [59] 90.14% 71.34% 2.30 M
ResNet-50 SSR-L2,1 B86 [40] 91.29% 71.15% 12.20 M 1.70
ThiNet-5087 [30] 90.02% 71.01% 12.38 M 3.41
GDP 0.588 [34] 90.14% 70.93% 1.57
Pruned-5089 [29] 90.03% 70.84% 17.38 M 2.52
MobileNet(alpha = 1.0)90 [58] 89.50% 70.42% 3.20 M
MobileNetV2(alpha = 0.75)91 [59] 89.18% 69.53% 1.40 M
ResNet-50 GAL-1-joint92 [41] 89.12% 69.31% 10.21 M 1.11
ResNet-50 HRank 393 [48] 89.58% 69.10% 8.27 M 0.98
GreBdec (VGG-16)94 [10] 89.06% 68.75% 9.70 M
ThiNet-3095 [30] 88.30% 68.42% 8.66 M 2.20
MobileNet(alpha = 0.75)96 [58] 88.24% 68.41% 1.80 M
GreBdec (GoogLeNet)97 [10] 88.11% 68.30% 1.50 M

strategies, such as integrating learnable rank functions. For instance, Appendix A. Proof of Lemma 1
in certain network architectures, the effective rank shows an initial
increase followed by a rapid decrease – a phenomenon we aim to
account for in our approach. Moreover, since the DeCEF layer can Proof. For each input channel 𝑖, given an additive perturbation matrix
be implemented by the depthwise separable convolutions with a new 𝛥𝐈𝑖 , let 𝐈̃𝑖 = 𝐈𝑖 + 𝛥𝐈𝑖 . Given optimal parameters of kernel 𝑗 expressed
training strategy, a second future direction is to modify and train ∑𝑟
as 𝐰(𝑖)
𝑗 =
(𝑖) (𝑖)
𝑘=1 𝑎𝑘,𝑗 𝐮𝑘 , which are learned from the training data, the
the traditional depthwise separable convolutional layers in well-known
output of the convolutional layer is
networks using DeCEF to reduce the model complexity. Finally, during
the experiments, we have come up with several hypotheses regarding ∑( ) ∑( ) ∑𝑟
𝐈𝑗 = 𝐈𝑖 + 𝛥𝐈𝑖 ∗ 𝐰𝑗(𝑖) = 𝐈𝑖 + 𝛥𝐈𝑖 ∗ (𝑖) (𝑖)
𝑎𝑘,𝑗 𝐮𝑘
the low rank behaviors in deep neural networks that we plan to explore. 𝑖 𝑖 𝑘=1
In particular, we will investigate the convergence properties, such as ∑ ∑
𝑟 ∑ ∑
𝑟
speed and stability with respect to initialization for the subspaces in = 𝐈𝑖 ∗ 𝑎(𝑖) 𝐮
𝑘,𝑗 𝑘
+ 𝛥𝐈𝑖 ∗ 𝑎(𝑖) 𝐮(𝑖)
𝑘,𝑗 𝑘
different layers to better understand and interpret a CNN from this 𝑖 𝑘=1 𝑖 𝑘=1

perspective. ∑ ∑
𝑟
= 𝐈∗𝑗 + 𝛥𝐈𝑖 ∗ 𝑎(𝑖) 𝐮(𝑖)
𝑘,𝑗 𝑘
𝑖 𝑘=1
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
CRediT authorship contribution statement perturbation term

where 𝐈∗𝑗 denotes the optimal feature map. By using the infinity norm
Yinan Yu: Methodology development, Software implementation,
to characterize the effect of the perturbation, we have:
Experimental design and analysis, Writing. Samuel Scheidegger: Method-
‖∑ ∑𝑟 ‖
ology development, Software implementation, Experimental design ‖∗ ‖ ‖ (𝑖) (𝑖) ‖
‖𝐈𝑗 − 𝐈𝑗 ‖ = ‖ 𝛥𝐈𝑖 ∗ 𝑎𝑘,𝑗 𝐮𝑘 ‖ (A.1)
and analysis, Writing. Tomas McKelvey: Methodology development, ‖ ‖∞ ‖ ‖
‖ 𝑖 𝑘=1 ‖∞
Results analysis, Review & editing. ∑‖ ‖ ∑ (𝑖) (𝑖) ‖
𝑟

≤ ‖𝛥𝐈𝑖 ∗ 𝑎𝑘,𝑗 𝐮𝑘 ‖ (A.2)
‖ ‖
𝑖 ‖ 𝑘=1 ‖∞
Declaration of competing interest
From Young’s inequality:
The authors declare that they have no known competing finan- ∑ ‖∑ 𝑟 ‖
‖𝛥𝐈𝑖 ‖ ‖ (𝑖) (𝑖) ‖
cial interests or personal relationships that could have appeared to (A.1) ≤ ‖ ‖2 ‖ ‖
𝑎𝑘,𝑗 𝐮𝑘 ‖

𝑖 ‖𝑘=1 ‖2
influence the work reported in this paper.
∑ ‖ ∑ (𝑖) (𝑖) ‖
𝑟 ‖

‖𝛥𝐈𝑖 ‖ ‖ 𝑎 𝐮 ‖
≤ ‖ ‖2 ‖ 𝑘,𝑗 𝑘 ‖
Data availability 𝑖 ‖𝑘=1 ‖2
∑ ∑𝑟
| | ‖ (𝑖) ‖
≤ ‖𝛥𝐈𝑖 ‖ (𝑖)
|𝑎𝑘,𝑗 | ‖𝐮𝑘 ‖
‖ ‖2 | | ‖ ‖2
Data will be made available on request. 𝑖 𝑘=1

∑ ∑
𝑟
| (𝑖) | ‖ (𝑖) ‖
Acknowledgment ≤ ‖𝛥𝐈𝑖 ‖ |𝑎𝑘,𝑗 | ‖𝐮𝑘 ‖ (A.3)
‖ ‖2 | | ‖ ‖𝐹
𝑖 𝑘=1
∑ ‖ ∑ ‖ (𝑖) ‖
𝑟
This work is partially funded by the Chalmers Artificial Intelligence ‖𝛥𝐈𝑖 ‖ ‖𝐚(𝑖) ‖
Research Centre (CHAIR) through the Vermillion project.
≤ ‖ ‖2 ‖ ‖ 𝑗 ‖1
‖𝐮𝑘 ‖
‖ ‖𝐹
𝑖 𝑘=1

11
Y. Yu et al. Neurocomputing 609 (2024) 128461

[ ]T
(𝑖)
where 𝐚𝑗(𝑖) = 𝑎1,𝑗 ⋯ 𝑎(𝑖)
𝑟,𝑗 and ‖ ⋅ ‖𝐹 denotes the Frobenius norm. [7] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al., Convolutional neural networks with
(𝑖) (𝑖) low-rank regularization, 2015, arXiv preprint arXiv:1511.06067.
Let 𝐮̄ 𝑘 = vect(𝐮𝑘 ), where vect(⋅) denotes the vectorization of a matrix. [8] S. Lin, R. Ji, C. Chen, D. Tao, J. Luo, Holistic cnn compression via low-rank
‖ ‖
If 𝐔̄ (𝑖)T 𝐔̄ (𝑖) = 𝐈, we have ‖𝐮(𝑖) ‖ = 1 and hence decomposition with knowledge transfer, IEEE Trans. Pattern Anal. Mach. Intell.
‖ 𝑘 ‖𝐹 41 (12) (2018) 2889–2905.
∑ ‖ (𝑖) ‖ ∑ ‖ (𝑖) ‖
(A.3) ≤ 𝑟 ‖𝐚𝑗 ‖ ‖ 𝛥𝐈 ‖ ≤ 𝑟ℎ ‖𝐚𝑗 ‖ ‖𝛥𝐈 ‖ [9] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi, Training cnns
‖ ‖1 ‖ 𝑖 ‖2 ‖ ‖2 ‖ 𝑖 ‖2 with low-rank filters for efficient image classification, 2015, arXiv preprint
𝑖 𝑖
arXiv:1511.06744.
‖ ‖
where ℎ is the kernel size. If ‖𝐚(𝑖) ‖ ≤ 𝜖, ∀𝑖, 𝑗, then [10] X. Yu, T. Liu, X. Wang, D. Tao, On compressing deep models by low rank and
‖ 𝑗 ‖2 sparse decomposition, in: The IEEE Conference on Computer Vision and Pattern
‖∑ ‖ ∑ Recognition, CVPR, 2017.
‖ (𝑖) ‖ ‖𝛥𝐈𝑖 ‖
‖ 𝛥𝐈𝑖 ∗ 𝐰𝑗 ‖ ≤ 𝜖ℎ𝑟 ‖ ‖2 □ [11] J.M. Alvarez, M. Salzmann, Compression-aware training of deep networks, in:
‖ ‖
‖ 𝑖 ‖∞ 𝑖 Advances in Neural Information Processing Systems, 2017, pp. 856–867.
[12] H. Yang, M. Tang, W. Wen, F. Yan, D. Hu, A. Li, H. Li, Y. Chen, Learning
Appendix B. Network compression using DeCEF layers low-rank deep neural networks via singular vector orthogonality regularization
and singular value sparsification, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.
One application of DeCEF is to use it as a model reduction technique [13] H. Dbouk, N. Shanbhag, Generalized depthwise-separable convolutions for adver-
for a pre-trained network. As discussed in the paper, this is not the main sarially robust and efficient neural networks, in: A. Beygelzimer, Y. Dauphin, P.
focus of DeCEF. Nevertheless, we propose an algorithm as follows for Liang, J.W. Vaughan (Eds.), Advances in Neural Information Processing Systems,
2021.
this type of applications.
[14] M. Yin, Y. Sui, S. Liao, B. Yuan, Towards efficient tensor decomposition-based
DNN model compression with optimization framework, in: Proceedings of the
Algorithm 2 (DeCEFC-basenet). IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2021,
pp. 10674–10683.
Step 1: Analysis described in Procedure 1. [15] N. Li, Y. Pan, Y. Chen, Z. Ding, D. Zhao, Z. Xu, Heuristic rank selection with
Step 2: For each layer, let 𝐮̄ (𝑖) be the columns of 𝐔̄ (𝑖) . Approximate progressively searching tensor ring network, Complex Intell. Syst. (2021) 1–15.
∑ 𝑟𝑖 𝑘
[16] S. Chen, J. Zhou, W. Sun, L. Huang, Joint matrix decomposition for deep
𝐰(𝑖)
𝑗 by 𝐰 (𝑖)
𝑗 ≈ 𝑎 (𝑖) (𝑖)
𝐮
𝑘=1 𝑘,𝑗 𝑘
, where 𝐮(𝑖)
𝑘
is obtained by reshaping 𝐮̄ (𝑖)
𝑘 convolutional neural networks compression, Neurocomputing 516 (2023) 11–26,
into a ℎ × ℎ matrix. http://dx.doi.org/10.1016/j.neucom.2022.10.021.
Step 3 (optional): Network fine-tuning by freezing the eigen- [17] E.L. Denton, W. Zaremba, J. Bruna, Y. LeCun, R. Fergus, Exploiting linear struc-
filters and training the other trainable parameters. ture within convolutional networks for efficient evaluation, in: Z. Ghahramani,
M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (Eds.), Advances in
Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp.
Appendix C. Convergence video 1269–1277.
[18] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, H. Li, Coordinating filters for faster
In this supplementary material, we include videos (in the file called deep neural networks, in: The IEEE International Conference on Computer Vision,
ICCV, 2017.
‘‘effective_rank_converge_video.zip’’) to show some examples of how [19] B. Peng, W. Tan, Z. Li, S. Zhang, D. Xie, S. Pu, Extreme network compression
the effective ranks (cf. Eq. (2) in the paper) in each layer converge via filter group approximation, in: Proceedings of the European Conference on
over training epochs. The title of each video indicates the layer index, Computer Vision, ECCV, 2018, pp. 300–316.
i.e. the larger the index, the deeper the layer is. The network and data [20] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, in: D.S. Touretzky (Ed.),
Advances in Neural Information Processing Systems 2, Morgan-Kaufmann, 1990,
used here are the DenseNet-121 and ImageNet, respectively.
pp. 598–605.
In the video, the leftmost rectangular box shows the singular values [21] B. Hassibi, D.G. Stork, G.J. Wolff, Optimal Brain Surgeon and general network
computed from all the output channels (filters) for each input channel. pruning, in: IEEE International Conference on Neural Networks, Vol. 1, 1993,
Each row in this figure contains the singular values for one input pp. 293–299, http://dx.doi.org/10.1109/ICNN.1993.298572.
[22] S. Anwar, K. Hwang, W. Sung, Structured pruning of deep convolutional neural
channel. The image in the middle is the histogram density of the
networks, J. Emerg. Technol. Comput. Syst. 13 (3) (2017) 32:1–32:18, http:
effective rank. Each frame in this video shows the singular values and //dx.doi.org/10.1145/3005348.
effective ranks computed from one epoch. Finally, the image on the [23] S. Han, J. Pool, J. Tran, W.J. Dally, Learning both weights and connections for
right shows the convergence of this effective rank over raining epochs. efficient neural networks, in: Proceedings of the 28th International Conference
on Neural Information Processing Systems - Volume 1, NIPS ’15, MIT Press,
Note that due to the limit on the file size, we only show the convergence
Cambridge, MA, USA, 2015, pp. 1135–1143.
for every fourth layer. [24] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional neural
networks for resource efficient inference, 2016, arXiv preprint arXiv:1611.06440.
Appendix D. Supplementary data [25] H. Hu, R. Peng, Y.-W. Tai, C.-K. Tang, Network trimming: A data-driven neuron
pruning approach towards efficient deep architectures, 2016, arXiv preprint
arXiv:1607.03250.
Supplementary material related to this article can be found online [26] H. Li, A. Kadav, I. Durdanovic, H. Samet, H.P. Graf, Pruning filters for efficient
at https://doi.org/10.1016/j.neucom.2024.128461. convnets, 2016, arXiv preprint arXiv:1608.08710.
[27] A. Aghasi, A. Abdi, N. Nguyen, J. Romberg, Net-trim: Convex pruning of deep
neural networks with performance guarantee, in: Advances in Neural Information
References
Processing Systems, 2017, pp. 3177–3186.
[28] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, C. Zhang, Learning efficient convolutional
[1] S. Luccioni, Y. Jernite, E. Strubell, Power Hungry Processing: Watts Driving the networks through network slimming, in: Proceedings of the IEEE International
Cost of AI Deployment?, Association for Computing Machinery, New York, NY, Conference on Computer Vision, 2017, pp. 2736–2744.
USA, 2024, http://dx.doi.org/10.1145/3630106.3658542. [29] J.-H. Luo, J. Wu, An entropy-based pruning method for cnn compression, 2017,
[2] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs fisher faces recognition arXiv preprint arXiv:1706.05791.
using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 [30] J.-H. Luo, J. Wu, W. Lin, Thinet: A filter level pruning method for deep neural
(1997) 711–720. network compression, in: Proceedings of the IEEE International Conference on
[3] I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, 1986. Computer Vision, 2017, pp. 5058–5066.
[4] G. Golub, C. van Loan, Matrix Computations, third ed., Johns Hopkins Press, [31] Y. He, X. Zhang, J. Sun, Channel pruning for accelerating very deep neural
1996. networks, in: Proceedings of the IEEE International Conference on Computer
[5] Y. Chen, A. Yuille, Z. Zhou, Which layer is learning faster? A systematic Vision, Vol. 2017-Octob, 2017, pp. 1398–1406, http://dx.doi.org/10.1109/ICCV.
exploration of layer-wise convergence rate for deep neural networks, in: The 2017.155, arXiv:arXiv:1707.06168v2.
Eleventh International Conference on Learning Representations, 2023, URL: [32] Q. Huang, K. Zhou, S. You, U. Neumann, Learning to prune filters in
https://openreview.net/forum?id=wlMDF1jQF86. convolutional neural networks, 2018, arXiv preprint arXiv:1801.07365.
[6] M. Jaderberg, A. Vedaldi, A. Zisserman, Speeding up convolutional neural [33] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V.I. Morariu, X. Han, M. Gao, C.-Y. Lin, L.S.
networks with low rank expansions, in: Proceedings of the British Machine Vision Davis, NISP: Pruning networks using neuron importance score propagation, in:
Conference, BMVA Press, 2014, http://dx.doi.org/10.5244/C.28.88. The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2018.

12
Y. Yu et al. Neurocomputing 609 (2024) 128461

[34] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, B. Zhang, Accelerating convolutional [59] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, MobileNetV2: Inverted
networks via global & dynamic filter pruning, in: IJCAI, 2018, pp. 2425–2432. residuals and linear bottlenecks, in: Proceedings of the IEEE Computer Society
[35] F. Tung, G. Mori, Clip-q: Deep network compression learning by in-parallel Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520,
pruning-quantization, in: Proceedings of the IEEE Conference on Computer Vision http://dx.doi.org/10.1109/CVPR.2018.00474, arXiv:arXiv:1801.04381v4.
and Pattern Recognition, 2018, pp. 7873–7882. [60] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional neural
[36] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, Y. Wang, A systematic networks, 2019, arXiv preprint arXiv:1905.11946.
dnn weight pruning framework using alternating direction method of multipliers, [61] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X.
in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, Wang, et al., Deep high-resolution representation learning for visual recognition,
pp. 184–199. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
[37] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, J. Zhu, [62] S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural
Discrimination-aware channel pruning for deep neural networks, in: Advances networks with pruning, trained quantization and huffman coding, 2015, arXiv
in Neural Information Processing Systems, 2018, pp. 875–886. preprint arXiv:1510.00149.
[38] Z. Huang, N. Wang, Data-driven sparse structure selection for deep neural [63] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classi-
networks, in: Proceedings of the European Conference on Computer Vision, fication using binary convolutional neural networks, in: B. Leibe, J. Matas, N.
ECCV, 2018, pp. 304–320. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International
[39] Y. He, P. Liu, Z. Wang, Z. Hu, Y. Yang, Filter pruning via geometric median Publishing, Cham, 2016, pp. 525–542.
for deep convolutional neural networks acceleration, in: Proceedings of the IEEE [64] X. Suau, L. Zappella, N. Apostoloff, Network compression using correlation
Conference on Computer Vision and Pattern Recognition, 2019, pp. 4340–4349. analysis of layer responses, 2018, arXiv:1807.10585.
[40] S. Lin, R. Ji, Y. Li, C. Deng, X. Li, Toward compact convnets via structure-sparsity [65] A. Boulch, Reducing parameter number in residual networks by sharing weights,
regularized filter pruning, IEEE Trans. Neural Netw. Learn. Syst. 31 (2) (2019) Pattern Recognit. Lett. 103 (2018) 53–59, http://dx.doi.org/10.1016/j.patrec.
574–588. 2018.01.006.
[41] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, D. Doermann, [66] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in:
Towards optimal structured cnn pruning via generative adversarial learning, in: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [67] A. Krizhevsky, G. Hinton, Learning Multiple Layers of Features from Tiny Images,
2019, pp. 2790–2799. Technical Report, Citeseer, 2009.
[42] X. Ding, G. Ding, Y. Guo, J. Han, Centripetal sgd for pruning very deep [68] Y. He, G. Kang, X. Dong, Y. Fu, Y. Yang, Soft filter pruning for accelerating deep
convolutional networks with complicated structure, in: Proceedings of the IEEE convolutional neural networks, 2018, arXiv preprint arXiv:1808.06866.
Conference on Computer Vision and Pattern Recognition, 2019, pp. 4943–4953. [69] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale
[43] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, J. Sun, Metapruning: hierarchical image database, in: 2009 IEEE Conference on Computer Vision and
Meta learning for automatic neural network channel pruning, in: Proceedings of Pattern Recognition, 2009, pp. 248–255, http://dx.doi.org/10.1109/CVPR.2009.
the IEEE International Conference on Computer Vision, 2019, pp. 3296–3305. 5206848.
[44] Z. You, K. Yan, J. Ye, M. Ma, P. Wang, Gate decorator: Global filter pruning [70] B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for
method for accelerating deep convolutional neural networks, 2019, arXiv:1909. scalable image recognition, in: Proceedings of the IEEE Conference on Computer
08174. Vision and Pattern Recognition, 2018, pp. 8697–8710.
[45] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz, Importance estimation [71] S. Christian, I. Sergey, V. Vincent, A. Alexander, Inception-v4, inception-resnet
for neural network pruning, in: Proceedings of the IEEE/CVF Conference on and the impact of residual connections on learning, 2017.
Computer Vision and Pattern Recognition, CVPR, 2019. [72] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception
[46] X. Ding, G. Ding, Y. Guo, J. Han, C. Yan, Approximated oracle filter pruning architecture for computer vision, in: Proceedings of the IEEE Conference on
for destructive CNN width optimization, in: K. Chaudhuri, R. Salakhutdinov Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
(Eds.), Proceedings of the 36th International Conference on Machine Learning, in: [73] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, Q. Tian, Variational convolutional
Proceedings of Machine Learning Research, vol. 97, PMLR, 2019, pp. 1607–1616. neural network pruning, in: Proceedings of the IEEE Conference on Computer
[47] J.-H. Luo, J. Wu, Autopruner: An end-to-end trainable filter pruning method for Vision and Pattern Recognition, 2019, pp. 2780–2789.
efficient deep model inference, Pattern Recognit. (2020) 107461.
[48] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, L. Shao, HRank:
Filter pruning using high-rank feature map, in: Proceedings of the IEEE/CVF Dr. Yinan Yu is an Assistant Professor in the Computer
Conference on Computer Vision and Pattern Recognition, 2020, pp. 1529–1538. Science and Engineering Department at Chalmers Univer-
[49] Y. He, Y. Ding, P. Liu, L. Zhu, H. Zhang, Y. Yang, Learning filter pruning sity of Technology, Gothenburg, Sweden. Dr. Yu holds a
criteria for deep convolutional neural networks acceleration, in: Proceedings of Master’s degree in Communication Engineering and a Ph.D.
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, in Machine Learning and Signal Processing. Her current
2020. research focuses on automated machine learning for efficient
[50] X. Ding, T. Hao, J. Liu, J. Han, Y. Guo, G. Ding, Lossless cnn channel pruning via and interpretable training and human-in-the-loop natural
gradient resetting and convolutional re-parameterization, 2020, arXiv preprint language processing, applied primarily to the industries of
arXiv:2007.03260. 1. automotive, smart manufacturing and healthcare.
[51] L. Liu, S. Zhang, Z. Kuang, A. Zhou, J.-H. Xue, X. Wang, Y. Chen, W. Yang,
Q. Liao, W. Zhang, Group fisher pruning for practical network compression, in:
M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference
Samuel Scheidegger received his B.Sc. degree in Mecha-
on Machine Learning, in: Proceedings of Machine Learning Research, vol. 139,
tronics and his M.Sc. degree in Systems, Control, and
PMLR, 2021, pp. 7021–7032.
Mechatronics from Chalmers University of Technology in
[52] J. Shi, J. Xu, K. Tasaka, Z. Chen, SASL: Saliency-adaptive sparsity learning for Gothenburg, Sweden, in 2013 and 2015, respectively. Since
neural network acceleration, IEEE Trans. Circuits Syst. Video Technol. 31 (5) then, he has worked in the automotive industry and co-
(2021) 2008–2019, http://dx.doi.org/10.1109/TCSVT.2020.3013170. founded two AI start-up companies, including his current
[53] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, position as CEO of Asymptotic AI and Lumilogic. He has
in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, been actively conducting research in the fields of deep learn-
2016, pp. 770–778, http://dx.doi.org/10.1109/CVPR.2016.90. ing, computer vision, robotics, and autonomous systems. His
[54] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, research interests include optimizing deep learning models
in: European Conference on Computer Vision, Springer, 2016, pp. 630–645. for computer vision tasks, with a focus on network optimiza-
[55] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, tion, model compression, and alternative parameterizations,
Squeezenet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB as well as advancing technologies in various domains of
model size, 2016, arXiv preprint arXiv:1602.07360. autonomous systems to enhance their performance.
[56] G. Huang, Z. Liu, L. van der Maaten, K.Q. Weinberger, Densely connected
convolutional networks, 2016, arXiv:1608.06993.
[57] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transfor- Tomas McKelvey (Senior Member, IEEE) received the M.Sc.
mations for deep neural networks, in: Proceedings - 30th IEEE Conference degree in electrical engineering from Lund University, Lund,
on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-January, Sweden, in 1991, and the Ph.D. degree in automatic control
2017, pp. 5987–5995, http://dx.doi.org/10.1109/CVPR.2017.634, arXiv:arXiv: from Linköping University, Linköping, Sweden, in 1995.,He
1611.05431v2. held research and teaching positions with Linköping Uni-
[58] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. versity, from 1995 to 1999, where he became a Docent,
Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for in 1999. From 1999 and 2000, he was a Visiting Re-
mobile vision applications, 2017, arXiv preprint arXiv:1704.04861. searcher with the University of Newcastle, Newcastle, NSW,

13
Y. Yu et al. Neurocomputing 609 (2024) 128461

Australia. Since 2000, he has been with the Chalmers research interests include model-based and statistical signal
University of Technology, Gothenburg, Sweden, where he processing, system identification, and optimal control with
was a Full Professor and has been the head of the Signal applications to radar systems, electrical power systems, and
Processing Group, since 2006 and 2011, respectively. His vehicle propulsion systems.

14

You might also like