0% found this document useful (0 votes)
2 views9 pages

2406.13392v1

The paper introduces a Dynamic Layer Attention (DLA) architecture that enhances interaction among hierarchical layers in Deep Convolutional Neural Networks (DCNNs) by utilizing a Dynamic Sharing Unit (DSU) for context feature extraction. Unlike static layer attention methods that limit inter-layer information interaction, the DLA architecture allows for dynamic modification of features, improving context representation and efficiency in tasks such as image recognition and object detection. Experimental results indicate that the DLA outperforms existing state-of-the-art methods, demonstrating the effectiveness of its dynamic approach to layer interaction.

Uploaded by

hsf842455044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

2406.13392v1

The paper introduces a Dynamic Layer Attention (DLA) architecture that enhances interaction among hierarchical layers in Deep Convolutional Neural Networks (DCNNs) by utilizing a Dynamic Sharing Unit (DSU) for context feature extraction. Unlike static layer attention methods that limit inter-layer information interaction, the DLA architecture allows for dynamic modification of features, improving context representation and efficiency in tasks such as image recognition and object detection. Experimental results indicate that the DLA outperforms existing state-of-the-art methods, demonstrating the effectiveness of its dynamic approach to layer interaction.

Uploaded by

hsf842455044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Strengthening Layer Interaction via Dynamic Layer Attention

Kaishen Wang1 , Xun Xia 2 , Jian Liu2 , Zhang Yi1 , Tao He1,∗
1
College of Computer Science, Sichuan University
2
Clinical Medical College and The First Affiliated Hospital of Chengdu Medical College
[email protected], [email protected], [email protected], {zhangyi, tao he}@scu.edu.cn
arXiv:2406.13392v1 [cs.CV] 19 Jun 2024

Abstract Layer Attention


Layer Attention Module Module 𝒄𝟎
D
𝒚𝟏
𝒄𝟎
𝒚𝟏
DLayer-1 𝒅 𝟏
Layer-1
𝒅𝟏
𝒄
D 𝒚𝟏 D
𝒍 𝒄𝒍
𝟏
𝒚
Layer-1 𝒌𝟏 𝒌𝟏
In recent years, employing layer attention to en- Layer-1 𝒗𝟏 𝒄𝟏 𝒄𝟏
𝒗𝟏

Layer Attention Module


Layer Attention
hance interaction among hierarchical layers has 𝒚𝟐 𝒚𝟐 𝟐 𝒅𝟐 𝒍 𝒄𝒍
D Layer-2
DLayer-2 𝒅
𝟐
𝒚
𝒄
D 𝒚𝟐 D
proven to be a significant advancement in build- Layer-2 𝒌𝟐 𝒌𝟐

Multi-Head Attention
Layer-2

Multi-Head Attention
𝒄𝟐 𝒄𝟐
𝒗𝟐 𝒗𝟐
ing network structures. In this paper, we delve

Backward
Forward

Backward
Forward


𝑲𝒍


𝑲𝒍 𝑶𝒍


𝑶𝒍




into the distinction between layer attention and the 𝑽𝒍 𝑽𝒍 𝑶𝒍 𝑶𝒍

general attention mechanism, noting that existing 𝒍−𝟏 𝒌𝒍−𝟏


𝒄𝒍−𝟐 𝒄𝒍−𝟐 𝒍−𝟏
𝒚𝒍−𝟏 𝒚
Layer-(l-1)𝒌𝒗𝒍−𝟏
Layer-(l-1) Layer-Layer-

Module
𝒅𝒍−𝟏 𝒅𝒍−𝟏𝒄𝒍 D 𝒄𝒍
layer attention methods achieve layer interaction 𝒗𝒍−𝟏 D D D 𝒚𝒍−𝟏
(l-1) (l-1)
𝒚𝒍−𝟏
on fixed feature maps in a static manner. These 𝒄𝒍−𝟏 𝒄𝒍−𝟏
𝒌𝒍 𝒌𝒍
static layer attention methods limit the ability for 𝒗𝒍 𝒗𝒍 𝒚𝒍 𝒚𝒍
Layer-l
Layer-l D Layer-l
Layer-l
D
𝒄𝒍
context feature extraction among layers. To re- 𝑸𝒍 𝑸𝒍
𝒄𝒍
𝒄𝒍 𝒄𝒍
store the dynamic context representation capability
of the attention mechanism, we propose a Dynamic (a) Static Layer Attention (b) Dynamic Layer Attention
Layer Attention (DLA) architecture. The DLA
comprises dual paths, where the forward path uti- Figure 1: The comparison between static and dynamic layer atten-
lizes an improved recurrent neural network block, tion architecture.
named Dynamic Sharing Unit (DSU), for context
feature extraction. The backward path updates fea-
tures using these shared context representations. evolution of attention mechanisms in DCNNs has progressed
Finally, the attention mechanism is applied to these through several stages, including channel attention ([Hu et
dynamically refreshed feature maps among lay- al., 2018], [Wang et al., 2020a]), spatial attention ([Carion et
ers. Experimental results demonstrate the effec- al., 2020], [Wang et al., 2018]), branch attention ([Srivastava
tiveness of the proposed DLA architecture, out- et al., 2015], [Li et al., 2019]), and temporal attention ([Xu et
performing other state-of-the-art methods in image al., 2017], [Chen et al., 2018a]).
recognition and object detection tasks. Addition- Recently, attention mechanisms have been successfully ap-
ally, the DSU block has been evaluated as an ef- plied to another direction, (e.g., DIANet [Huang et al., 2020],
ficient plugin in the proposed DLA architecture. RLANet [Zhao et al., 2021], MRLA [Fang et al., 2023]), indi-
The code is available at https://github.com/tunantu/ cating the feasibility of strengthening interaction among lay-
Dynamic-Layer-Attention. ers through attention mechanisms. Compared to simple in-
teraction like those in ResNet and DenseNet, the introduction
of attention mechanisms makes the interaction among layers
1 Introduction more closely effective. DIANet employed a parameter-shared
Numerous studies have highlighted the significance of en- LSTM module along the network’s depth to facilitate inter-
hancing interaction among hierarchical layers in Deep Con- action among layers. RLANet proposed a layer aggregation
volutional Neural Networks (DCNNs), which have made sub- structure to reuse features of previous layers for enhancing
stantial progress across various tasks. For instance, ResNet layer interaction. MRLA first introduced the concept of layer
[He et al., 2016] introduced a straightforward and highly ef- attention, treating each feature as a token to learn useful in-
fective implementation by incorporating skip connections be- formation from others by attention mechanisms.
tween two consecutive layers. DenseNet [Huang et al., 2017] However, we have identified a common drawback in exist-
further improved inter-layer interaction by recycling informa- ing layer attention mechanisms: they are applied in a static
tion from all previous layers. Meanwhile, attention mecha- manner, limiting inter-layer information interaction. In chan-
nisms are playing an increasingly crucial role in DCNNs. The nel and spatial attention, for the input x ∈ RC×H×W , tokens
are input to the attention module, all of which are generated

Corresponding author. from x concurrently. However, in existing layer attention,
of layer interaction was relatively simple. ResNet [He et al.,
2016] introduced skip connections between consecutive lay-
ers, mitigating issues of gradient vanishing and exploding to
some extent. DenseNet [Huang et al., 2017] further enhanced
layer interaction by reusing information generated from all
preceding layers. In U-Net [Ronneberger et al., 2015], a com-
monly utilized architecture in medical segmentation, the en-
coder and decoder are connected through skip connections to
(a) Static Layer Attention (b) Dynamic Layer Attention
improve feature extraction and achieve higher accuracy.
Figure 2: Visualization of attention scores from stage 3 of the Recent studies have explored effective methods for imple-
ResNet-56 between static and the proposed dynamic layer atten- menting layer interaction. DREAL [Li and Chen, 2020] opti-
tion on CIFAR-100 dataset. mized parameters by introducing arbitrary attention modules
and employed a Long Short-Term Memory (LSTM) [Hochre-
iter and Schmidhuber, 1997] to incorporate previous attention
features generated from different time steps are treated as to- weights. The update of parameters for both LSTM and atten-
kens and passed into the attention module, as shown in Figure tion layers was achieved through deep reinforcement learn-
1(a). Since tokens generated earlier do not change once pro- ing. DIANet [Huang et al., 2020] incorporated a LSTM
duced, the input tokens are relatively static, leading to a re- module at the layer level, constructing a DIA block shared
duction in information interaction between the current layer by all layers in the entire network to facilitate inter-layer in-
and previous layers. Figure 2(a) visualizes the MRLA atten- teraction. RealFormer [He et al., 2020] integrated residual
tion scores from stage 3 of ResNet-56 trained on CIFAR-100. connections between adjacent attention modules, adding the
When the first 5 layers reuse information from previous lay- attention scores from the previous layer to the current one.
ers through static layer attention, the key values from only RLANet [Zhao et al., 2021] introduced a lightweight recur-
one specific layer are activated, with almost zero attention as- rent layer aggregation module to describe how information
signed to other layers. This observation verifies that static from previous layers can be efficiently reused for better fea-
layer attention compromises the efficiency of information in- ture extraction in the current layer. [Zhao et al., 2022] pro-
teraction among layers. posed a straightforward and versatile approach to strike a bal-
To solve the static problem of layer attention, we propose a ance between effectively utilizing neural network information
novel Dynamic Layer Attention (DLA) architecture to im- and maintaining high computational efficiency. By seam-
prove information flow among layers, where the information lessly integrating features from preceding layers, it foster ef-
of previous layers can be dynamically modified during the fective interaction of information. MRLA [Fang et al., 2023]
feature interaction. As shown in Figure 2(b), during the reuti- treated the features of each layer as tokens, enabling interac-
lization of information from preceding layers, the attention of tion among different hierarchical layers through an attention
the current feature undergo a shift from exclusively focusing mechanism, further strengthening the interaction among lay-
on a particular layer to incorporating information from vari- ers.
ous layers. DLA facilitates a more thorough exploitation of
information, enhancing inter-layer information interaction ef- Dynamic Network Architecture. Dynamic networks rep-
ficiency. Experimental results demonstrate the effectiveness resent a type of neural network structure whose topology or
of the proposed DLA architecture, outperforming other state- parameters can dynamically change during runtime, provid-
of-the-art methods in image recognition and object detection ing the network with considerable flexibility or other benefits.
tasks. The contributions of this paper are summarized as fol- [Wang et al., 2020b] proposed a universally adaptable infer-
lows: ence framework for the majority of DCNNs, with costs that
can be easily dynamically adjusted without additional train-
• We propose a novel DLA architecture, which consists ing. RANet [Yang et al., 2020] introduced an adaptive net-
of dual paths, where the forward path extracts context work that could effectively reduce the spatial redundancy in-
feature among layers using a Recurrent Neural Network volved in inferring high-resolution inputs. [Han et al., 2021]
(RNN) and the backward path refreshes the original fea- have shown that dynamic neural networks can enhance the
ture at each layer using these shared context representa- efficiency of deep networks. CondenseNetv2 [Yang et al.,
tion. 2021] utilized a Sparse Feature Reactivation (SFR) module to
• A novel RNN block, named Dynamic Sharing Unit reactivate pruned features from CondenseNet [Huang et al.,
(DSU), is proposed to be a suitable component for DLA. 2018], thereby enhancing feature utilization efficiency. To
It effectively promotes the dynamic modification of in- the best of our knowledge, this paper is the first attempt to
formation within DLA and demonstrates commendable build a dynamic network architecture for strengthening layer
performance in the layer-wise information integration as interaction.
well.
3 Dynamic Layer Attention
2 Related Work We will start by reconsidering the current layer attention ar-
Layer Interaction. Layer interaction has consistently been chitecture and elucidating its static nature. Subsequently, we
an intriguing aspect of DCNNs. Initially, the implementation will introduce the Dynamic Layer Attention (DLA). Finally,
𝒄𝒎−𝟏 𝒄𝒎 𝒄𝒎−𝟏 𝒄𝒎
X + X + 
𝒄𝒎−𝟏
X + 𝒄𝒎
tanh 𝒎
𝒉𝒎 𝒉
X X X X
X
  tanh    tanh
Linear Linear Linear Linear Linear Linear Linear
  tanh 
Linear Linear Linear Linear
ReLU  ReLU
Linear Linear
𝒉𝒎−𝟏 𝒉𝒎−𝟏

𝒚𝒎 𝒚𝒎 𝒚𝒎

(a) LSTM (b) DIA (c) DSU

Figure 3: A comparison of LSTM, DIA, and the proposed DSU blocks.

we will present an enhanced RNN plugin block named the layer, a preceding feature output xm (m < l) has already
Dynamic Sharing Unit (DSU), integrated within the DLA ar- been generated in the m-th layer, with no subsequent changes.
chitecture. Consequently, the information processed by MRLA com-
prises fixed features from the previous layers. In contrast,
3.1 Rethinking Layer Attention widely used attention-based models, such as channel atten-
Layer attention was recently defined by [Fang et al., 2023] tion [Hu et al., 2018; Wang et al., 2020a], spatial attention
and is illustrated in Figure 1(a), where the attention mecha- [Carion et al., 2020; Huang et al., 2019], and Transformers
nism enhances layer interaction. [Fang et al., 2023] focused [Vaswani et al., 2017; Liu et al., 2021; Chen et al., 2023;
on reducing the computational cost of layer attention and pro- Jiao et al., 2023], pass tokens into the attention module gener-
posed the Recurrent Layer Attention (RLA) architecture. In ated simultaneously. Applying the attention module between
RLA, features from distinct layers are treated as tokens and freshly generated tokens ensures that each token consistently
undergo computations, ultimately producing attention output. learns up-to-date features. Therefore, we categorize MRLA
Let the feature output of the l-th layer be xl ∈ RC×W ×H . as a static layer attention mechanism, limiting interaction be-
The vectors Ql , K l , and V l can be calculated as follows: tween the current layer and shallower layers.
 l In a general self-attention mechanism, the feature xm
 Q = fql (xl ) serves two purposes: conveying essential information and

K l = Concat fk1 (x1 ), . . . , fkl (xl )
 
(1) representing context. The essential information extracted at

 l the current layer distinguishes it from that at other layers.
V = Concat fv (x ), . . . , fvl (xl ) ,
 1 1 
Meanwhile, the context representation captures changes and
where fq is a mapping function to extract information from evolution of features along the temporal axis, a critical as-
the l-th layer, while fk and fv are corresponding mapping pect in determining feature freshness. In the general attention
functions intended to extract information from the 1st to l- mechanism, essential information is generated at each layer,
th layers, respectively. The attention output ol is given as and the context representation is transferred to the next layer
follows: for calculating the attention output. In contrast, in layer atten-
! tion, once tokens are generated, attention is calculated with a
Ql (K l )T
l
o = Softmax √ Vl fixed context representation, diminishing the efficiency of the
Dk attention mechanism. Therefore, this paper aims to establish
l T ! (2) a novel method to restore the context representation, ensuring
Ql fki (xi )

that the information fed into the layer attention is consistently
X
i i
= Softmax √ fv (x ),
i=1
Dk dynamic.
where Dk serves as the scaling factor. A lightweight version 3.3 Dynamic Layer Attention Architecture
of RLA was proposed to recurrently update the attention out-
put ol as follows: To address the static issue of MRLA, we propose the use of
T ! a dynamic updating rule to extract the context representation
Ql fkl (xl )

l l l−1 and promptly update features at previous layers, resulting in
o = λo ⊙ o + Softmax √ fvl (xl ), (3) a Dynamic Layer Attention (DLA) architecture. As illus-
Dk
trated in Figure 1(b), DLA consists of dual paths: forward
where λlo is a learnable vector and ⊙ indicates the element- and backward. In the forward path, a Recurrent Neural Net-
wise multiplication. With the multi-head structure design, work (RNN) is employed for context feature extraction. Let
Multi-head RLA (MRLA) is introduced. the RNN block be denoted as “Dyn”, and the initial context
as c0 , respectively. c0 is randomly initialized. Given an input
3.2 Motivation xm ∈ RC×W ×H where m < l, a Global Average Pooling
MRLA successfully integrated the attention mechanism to (GAP) is applied to extract global features at m-th layer as
enhance layer interaction, effectively addressing computa- follows:
tional costs. However, when MRLA is applied in the l-th y m = GAP(xm ), y m ∈ RC . (4)
The context representation is extracted as follows: The hidden transformation, the input gate, and the forget
gate can be represented by the following formula:
cm = Dyn(y m , cm−1 ; θl ). (5)  m
l
where θ represents the shared trainable parameters of “Dyn”.  c̃ = Tanh(W c2 · sm + bc )

Once the context cl is calculated, the features of each layer im = σ(W i2 · sm + bi ) (8)
are simultaneously updated in the backward path as follows: 
 m f m f
 m f = σ(W 2 · s + b )
d = Dyn(y m , cl ; θl )
(6) Subsequently, we obtain
xm ← xm ⊙ dm
cm = f m ⊙ cm−1 + im ⊙ c̃m (9)
Referring to Equation (5), the forward context feature ex-
traction is a step-by-step process with a computation com- To decrease the network parameters, let W 1 ∈ R ×2C
C
r
plexity of O(n). Meanwhile, the feature updating in Equation C
and W 2 ∈ RC× r , where r is the reduction ratio. DSU re-
(6) can be performed in parallel, resulting in a computation
duces the parameters to 5C 2 /r, which is fewer than 8C 2 of
complexity of O(1). After updating xm , the basic version of
LSTM and 10C 2 /r of DIA.
DLA use Equation (2) to compute the layer attention, abbrevi-
ated as DLA-B. For the lightweight version of DLA, Simply
update ol−1 and then use Equation (3) to obtain DLA-L. 4 Experiments
Computation Efficiency. DLA possesses several advan- 4.1 Image Classification
tages in its structural design. Firstly, global information is Experimental Setup. We conducted experiments on the
condensed to compute context information, a utility that has CIFAR-10, CIFAR-100, and ImageNet-1K datasets using
been validated in [Huang et al., 2020]. Secondly, DLA em- ResNets [He et al., 2016] as the backbone network for image
ploys shared parameters within a RNN block. Thirdly, the classification. For the CIFAR-10 and CIFAR-100 datasets,
context cl is separately fed into the feature maps at each layer we employed standard data augmentation strategies [Huang
in parallel. Both the forward and backward paths share the et al., 2016]. The training process involved random horizon-
same parameters throughout the entire network. Finally, we tal flipping of images, padding each side by 4 pixels, and then
introduce an efficient RNN block for calculating context rep- randomly cropping to 32×32. Normalization with mean and
resentation, which will be elucidated in the following sub- standard deviation adjustment was implemented, and train-
section. With these efficiently designed structural rules, the ing hyperparameters such as batch size, initial learning rate,
computation cost and network capacity are guaranteed. and weight decay followed the recommendations of the orig-
inal ResNets [He et al., 2016]. To address hyperparameter
3.4 Dynamic Sharing Unit uncertainty, we conducted five runs of experiments. For the
LSTM, as depicted in Figure 2(a), is designed for process- ImageNet-1K dataset, we adopted the same data augmenta-
ing sequential data and learning temporal features, enabling tion strategy and hyperparameter settings outlined in [He et
it to capture and store information over long sequences. How- al., 2016] and [He et al., 2019]. During training, images were
ever, the fully connected linear transformation in LSTM sig- randomly cropped to 224×224 with horizontal flipping. In
nificantly increases the network capacity when embedding the testing phase, images were resized to 256×256, then cen-
LSTM as the recurrent block in DLA. To mitigate this ca- trally cropped to a final size of 224×224. The optimization
pacity increase, a variant LSTM block named the DIA unit process utilized an SGD optimizer with a momentum of 0.9
was proposed by [Huang et al., 2020], as illustrated in Figure and weight decay of 1e-4. The initial learning rate was set
2(b). Before feeding data into the network, DIA first utilizes a to 0.1 and decreased according to the MultiStepLR schedule
linear transformation followed by a ReLU activation function over 100 epochs for a batch size of 256, consistent with the
to reduce the input dimension. Additionally, DIA replaces the ResNet approach [He et al., 2016]. Meanwhile, the reduc-
Tanh function with a Sigmoid function at the output layer. tion ratio r was set to 4 for the CIFAR-10 and CIFAR-100
LSTM and DIA generate two outputs, comprising a hid- datasets, and 20 for the ImageNet-1K dataset, consistent with
den vector hm and a cell state vector cm . Typically, hm is the settings in DIANet [Huang et al., 2020].
used as the output vector, and cm serves as the memory vec-
Results on CIFAR. The experimental results, showcasing
tor. DLA is exclusively focused on extracting context char-
Accuracy±Std, are presented in Table 1. These results under-
acteristics from different layers, where the RNN block has
score the significant superiority of our DLA-B and DLA-L
no duty to transform its internal state feature to the outside.
models over other challenging networks, including SE [Hu
Consequently, we discard the output gate and merge the mem-
et al., 2018], ECA [Wang et al., 2020a], DIANet [Huang
ory and hidden vector by omitting the hm symbol. The pro-
et al., 2020], MRLA [Fang et al., 2023] on the CIFAR-10
posed simplified RNN block is named Dynamic Sharing Unit
and CIFAR-100 datasets. In comparison with the baselines,
(DSU). The workflow of the DSU is illustrated in Figure 2(c).
DLA-L’s top-1 accuracy on CIFAR-10 surpasses ResNets by
Specifically, before adding cm−1 and y m , we first normalize
1.32%, 1.60%, and 1.62%, and even outperforms them on
cm−1 using an activation function σ(·). Here, we opt for the
CIFAR-100 by 4.96%, 2.94%, and 3.41%. Furthermore, both
Sigmoid function (σ(z) = 1/(1 + e−z )). Therefore, the input
DLA-B and DLA-L outperform MRLA-B and MRLA-L,
to DSU was compressed as follows:
which are typical static layer attention models. It is notewor-
sm = ReLU W 1 σ(cm−1 ), y m .
 
(7) thy that DLA-L-20 (embedding ResNet-20) achieves fewer
parameters than ResNet-56 while maintaining comparable Model Params FLOPs Top-1 Top-5
top-1 accuracy on the CIFAR-10 (92.96% vs. 92.95%) and
CIFAR-100 (72.26% vs. 72.32%) datasets. Furthermore, ResNet-50 25.6M 4.1B 76.1 92.9
DLA-L-56 outperforms ResNet-110 by 1.51% and 2.46% on SE 28.1M 4.1B 76.7 93.4
CIFAR-10 and CIFAR-100 datasets, respectively. As de- CBAM 28.1M 4.2B 77.3 93.7
picted in Figure 4, ResNet, DIANet, and MRLA-L exhibit A2 34.6M 7.0B 77.0 93.5
a relatively slow growth in capabilities with increasing net- AA 27.1M 4.5B 77.7 93.8
work depth. In contrast, the proposed DLA demonstrates a all GC 29.4M 4.2B 77.7 93.7
faster increase in test accuracy on the CIFAR-10 and CIFAR- ECA 25.6M 4.1B 77.5 93.7
100 datasets as the network depth increases. This observation DIANet 28.4M - 77.2 -
verifies that strengthening layer interaction through DLA is RLAg 25.9M 4.5B 77.2 93.4
more beneficial in a deeper network structure. MRLA-L 25.7M 4.2B 77.7 93.8
DLA-L (ours) 27.2M 4.3B 78.0 94.0
Data CIFAR-10 CIFAR-100 ResNet-101 44.5M 7.8B 77.4 93.5
Model #P(M) Top-1 #P(M) Top-1 SE 49.3M 7.8B 77.6 93.9
ResNet-20 0.22 91.64±0.18 0.24 67.30±0.28 CBAM 49.3M 7.9B 78.5 94.3
SE 0.24 91.29±0.24 0.27 68.93±0.35 AA 47.6M 8.6B 78.7 94.4
ECA 0.22 91.63±0.16 0.24 67.23±0.24 ECA 44.5M 7.8B 78.7 94.3
DIANet 0.44 91.43±0.14 0.46 67.67±0.22 RLAg 45.0M 8.4B 78.5 94.2
MRLA-B 0.23 92.15±0.23 0.25 71.44±0.49 MRLA-L 44.9M 7.9B 78.7 94.4
DLA-B (ours) 0.41 92.47±0.10 0.43 72.01±0.37 DLA-L (ours) 47.8M 8.1B 78.9 94.5
MRLA-L 0.23 92.65±0.08 0.25 71.46±0.27
DLA-L (ours) 0.41 92.96±0.18 0.43 72.26±0.29 Table 2: Comparisons of accuracy (%) on the ImageNet-1K valida-
ResNet-56 0.59 92.95±0.18 0.61 72.32±0.36 tion set (All results of the following methods are captured from their
SE 0.66 93.60±0.18 0.68 73.51±0.28 original papers).
ECA 0.59 93.72±0.14 0.61 72.63±0.35
DIANet 0.81 93.88±0.21 0.83 73.87±0.27
MRLA-L 0.62 94.28±0.26 0.65 74.18±0.17 Then, when compared to the channel attention method, DLA-
DLA-L (ours) 0.80 94.55±0.13 0.82 75.46±0.26 L-50 has 0.9M fewer parameters than SE-50 and CBAM-50
[Woo et al., 2018], but its top-1 accuracy is higher by 1.3%
ResNet-110 1.15 93.04±0.33 1.17 73.00±0.36
SE 1.28 94.09±0.11 1.30 75.01±0.20 and 0.7%. DLA-L-101 has 1.5M fewer parameters than SE-
ECA 1.15 93.76±0.31 1.17 73.97±0.36 101 and CBAM-101 [Woo et al., 2018] while achieving a
DIANet 1.37 94.48±0.08 1.39 75.31±0.16 1.3% and 0.4% top-1 accuracy increase, respectively. Mean-
MRLA-L 1.21 94.49±0.31 1.24 75.16±0.24 while, when compared to the lightweight model, DLA-L-50
DLA-L (ours) 1.39 94.66±0.23 1.41 76.41±0.36 and DLA-L-101 remains a 0.5% and 0.2% higher top-1 ac-
curacy than ECA-50 and ECA-101. DLA-L also outperforms
Table 1: Testing accuracy (%) on CIFAR-10 and CIFAR-100 other popular layer interaction models. With 1.2M fewer pa-
datasets. “#P(M)” means the number of parameters (million). rameters than DIANet-50 , DLA-L-50 achieves a 0.8% higher
top-1 accuracy than DIANet-50. Although DLA-L introduces
more parameters compared to RLAg [Zhao et al., 2021], it
achieves a 0.8% and 0.4% higher top-1 accuracy than RLAg -
50 and RLAg -101, respectively. Additionally, DLA-L also
surpasses recent models such as A2 [Chen et al., 2018b], AA
[Bello et al., 2019], and GC [Cao et al., 2019]. Finally, in
comparison with static layer attention, DLA-L exhibits an
increase of 0.3% and 0.2% in top-1 accuracy to MRLA-L-
50 and MRLA-L-101 with a tolerable increase in the num-
ber of parameters. In summary, through comparisons with
(a) Testing acc. on CIFAR-10 (b) Testing acc. on CIFAR-100 well-known channel attention models, layer interaction mod-
els, and various other challenging models, we have validated
Figure 4: Comparison of testing accuracy of different models with that our proposed DLA architecture serves as a more effec-
different depths. tive layer interaction model, outperforming other models in
the domain of image classification.
Results on ImageNet-1K. We compare the DLA-L archi-
tecture with other challenging methods using ResNets as 4.2 Object Detection
baselines. Experimental results, shown in Table 2, indi- Experimental Setup. In the context of object detection, our
cate that our model significantly outperforms other models. approach was evaluated on the COCO2017 dataset using the
Firstly, our DLA-L exhibits an increase of 1.9% and 1.5% in Faster R-CNN [Ren et al., 2015] and Mask R-CNN [He et
top-1 accuracy to ResNet-50 and ResNet-101, respectively. al., 2017] frameworks as detectors. The implementations of
Methods Detectors Params AP AP50 AP75 APS APM APL
ResNet-50 [He et al., 2016] 41.5M 36.4 58.2 39.2 21.8 40.0 46.2
SE [Hu et al., 2018] 44.0M 37.7 60.1 40.9 22.9 41.9 48.2
ECA [Wang et al., 2020a] 41.5M 38.0 60.6 40.9 23.4 42.1 48.0
RLAg [Zhao et al., 2021] 41.8M 38.8 59.6 42.0 22.5 42.9 49.5
BA [Zhao et al., 2022] 44.7M 39.5 61.3 43.0 24.5 43.2 50.6
MRLA-L [Fang et al., 2023] Faster 41.7M 40.4 61.5 44.0 24.2 44.1 52.7
DLA-L (ours) R-CNN 44.2M 40.6 61.6 44.2 24.5 44.2 52.9
ResNet-101 [He et al., 2016] 60.5M 38.7 60.6 41.9 22.7 43.2 50.4
SE [Hu et al., 2018] 65.2M 39.6 62.0 43.1 23.7 44.0 51.4
ECA [Wang et al., 2020a] 60.5M 40.3 62.9 44.0 24.5 44.7 51.3
RLAg [Zhao et al., 2021] 60.9M 41.2 61.8 44.9 23.7 45.7 53.8
MRLA-L [Fang et al., 2023] 60.9M 42.0 63.1 45.7 25.0 45.8 55.4
DLA-L (ours) 63.4M 42.3 63.3 45.8 25.2 46.0 55.5
ResNet-50 [He et al., 2016] 44.2M 37.2 58.9 40.3 34.1 55.5 36.2
SE [Hu et al., 2018] 46.7M 38.7 60.9 42.1 35.4 57.4 37.8
ECA [Wang et al., 2020a] 44.2M 39.0 61.3 42.1 35.6 58.1 37.7
NL [Wang et al., 2018] 46.5M 38.0 59.8 41.0 34.7 56.7 36.6
GC [Cao et al., 2019] 54.4M 39.9 62.2 42.9 36.2 58.7 38.3
RLAg [Zhao et al., 2021] 44.4M 39.5 60.1 43.4 35.6 56.9 38.0
BA [Zhao et al., 2022] 47.3M 40.5 61.7 44.2 36.6 58.7 38.6
MRLA-L [Fang et al., 2023] Mask 44.3M 41.2 62.3 45.1 37.1 59.1 39.6
DLA-L (ours) R-CNN 46.9M 41.4 62.5 45.3 37.2 59.3 39.7
ResNet-101 [He et al., 2016] 63.2M 39.4 60.9 43.3 35.9 57.7 38.4
SE [Hu et al., 2018] 67.9M 40.7 62.5 44.3 36.8 59.3 39.2
ECA [Wang et al., 2020a] 63.2M 41.3 63.1 44.8 37.4 59.9 39.8
NL [Wang et al., 2018] 65.5M 40.8 63.1 44.5 37.1 59.9 39.2
GC [Cao et al., 2019] 82.2M 41.7 63.7 45.5 37.6 60.5 39.8
RLAg [Zhao et al., 2021] 63.6M 41.8 62.3 46.2 37.3 59.2 40.1
MRLA-L [Fang et al., 2023] 63.5M 42.8 63.6 46.5 38.4 60.6 41.0
DLA-L (ours) 66.1M 42.9 63.8 46.7 38.6 60.9 41.2

Table 3: The object detection results on the COCO val2017 with Faster R-CNN and Mask R-CNN.

all detectors were carried out using the MMDetection toolkit compared to RLAg [Zhao et al., 2021], there is an excellent
[Chen et al., 2019], following the default settings. In the pre- increase in AP by 1.8% and 1.1% on DLA-L-50 and DLA-
processing stage, the shorter side of input images was resized L-101, respectively. Meanwhile, in contrast to a static layer
to 800 pixels. The optimization process employed SGD with attention model, MRLA-L [Fang et al., 2023], our DLA-L
a weight decay of 1e-4, momentum of 0.9, and a batch size achieves a respective increase of 0.2% and 0.3% in AP. When
of 8. The models underwent training for a total of 12 epochs, utilizing Mask R-CNN as the detector, our DLA-L also out-
starting with an initial learning rate of 0.01. Learning rate performs the aforementioned models. Additionally, it sur-
adjustments occurred at the 8th and 11th epochs, with a re- passes NL [Wang et al., 2018], GC [Cao et al., 2019], BA
duction by a factor of 10 each time. [Zhao et al., 2022], and other models, showcasing the remark-
able potential of the DLA architecture in facilitating dynamic
Results on COCO2017. As shown in Table 3, when uti- modification in inter-layer information, even with the intro-
lizing Faster R-CNN as the detectors, our proposed DLA- duction of a tolerable parameter increment.
L demonstrates a remarkable improvement in average preci-
sion (AP) of 4.2% and 3.6% on ResNet-50 [He et al., 2016] 5 Ablation Study
and ResNet-101 [He et al., 2016], respectively, which vali-
dates the outstanding capability of the DLA architecture in 5.1 Evaluating Different RNN Blocks in DLA-L
enhancing object detection performance. Notably, DLA-L In order to validate the effectiveness of our proposed DSU
outperforms other models, when compared to the challenging block in implementing dynamic layer attention, we incorpo-
channel attention block, DLA-L-50 exhibits similar parame- rated the original RNN block, DIA block and LSTM block
ter counts to SE-50 [Hu et al., 2018] but achieves a higher AP as plugins into DLA to replace DSU block for comparative
by 2.9% and a 3.3% improvement on AP75 . When compared experiments. Due to limitations in computational resources,
to representative layer interaction models, DLA-L continues our ablation experiments were conducted using ResNet-110
to exhibit excellence. Despite introducing more parameters as the baseline on CIFAR-100. Each experiment was run five
times, and the results were expressed in the form of Accuracy fusion with y m as input for the DSU block. Experimental
± Std. results confirm this observation, achieving impressive top-1
accuracy of 75.46% and 76.41% on DLA-L-56 and DLA-L-
Block Params Top-1 acc. (%) 110, respectively, far surpassing the baseline.
Original RNN 1.41M 73.65±0.23 5.3 Evaluating DSU Block in Layer-wise
LSTM 1.93M 75.32±0.34 Information Integration
DIA 1.46M 75.60±0.36 We also attempt to evaluate that whether the proposed DSU
DSU 1.39M 76.41±0.36 block could be deployed to other layer interaction methods.
[Huang et al., 2020] introduced a Layer-wise Information In-
Table 4: Testing accuracy of different RNN blocks in DLA-110 on tegration (LII) architecture that shared a RNN block through-
the CIFAR-100 dataset. out different network layers. And DIANet serves as a sim-
ple and effective form of LII architecture, demonstrating no-
As illustrated in Table 4, the dynamic layer attention im- table achievements in image classification. We evaluated the
plemented by the four RNN blocks exhibits varying perfor- performance of the proposed DSU, LSTM, and DIA blocks,
mance. The original RNN blocks show the poorest perfor- when integrating them into the LII. Additionally, we em-
mance, with even a little top-1 accuracy increase than the ployed ResNets as the backbone and conduct experimental
baseline. LSTM block and the DIA block demonstrate com- comparisons on the CIFAR-100 dataset.
parable performance, outperforming baseline by 2.32% and
2.60%, respectively. However, the LSTM block has an addi- Block Params Top-1 acc. (%)
tional parameter count of 0.47M compared to the DIA block,
making it impractical applying LSTM in DLA. On the other ResNet-56 0.61M 72.32±0.36
hand, our proposed DSU block, utilizing even fewer param- LII (LSTM) 1.31M 69.28±0.44
eters, achieves superior results. In comparison to the DIA LII (DIA) 0.83M 73.87±0.27
block, we have 0.07M fewer parameters, leading to a 0.81% LII (DSU) 0.79M 74.23±0.22
improvement in top-1 accuracy. Therefore, it can be inferred ResNet-110 1.17M 73.00±0.36
that our proposed DSU is the most effective block among ex- LII (LSTM) 1.86M 71.31±0.33
isting RNN blocks for implementing DLA. LII (DIA) 1.39M 75.31±0.16
LII (DSU) 1.35M 75.14±0.21
5.2 Evaluating the Introduced σ Function in DSU
Compared to other RNN blocks, our DSU block first normal- Table 6: Testing accuracy of different RNN blocks in layer-wise
izes cm−1 using an activation function, where we employ the information integration on the CIFAR-100 dataset.
Sigmoid function σ(z) = 1+e1−z for this purpose. To contrast
the role of the σ function utilized in DSU, we will substitute As shown in Table 6, the experimental results demonstrate
it with various activation functions, including Identity map- that the RNN blocks exhibits favorable performance when
ping, Tanh, and ReLU functions. applied to LII. On the CIFAR-100 dataset, our DSU block
shows improvements of 1.91% and 2.14% on ResNet-56 and
Model Function Top-1 acc. (%) ResNet-110, respectively. While the introduced parameters
are 0.04M smaller than DIA block, DSU block outperforms
Identity mapping 74.63±0.14 DIA block by 0.36% on ResNet-56. Meanwhile, DSU block
Tanh 74.88±0.21 performs comparably with DIA block on ResNet-110. On the
DLA-L-56 ReLU 74.95±0.40 contrary, LSTM block has the highest number of parameters
Sigmoid 75.46±0.26 but performs the worst. Therefore, it could be concluded that
Identity mapping 75.37±0.38 DSU block is also a challenging method in LII, which could
Tanh 75.67±0.20 achieve comparable results to DIA block in LII architecture
DLA-L-110 ReLU 75.99±0.29 while having fewer parameters.
Sigmoid 76.41±0.36
6 Conclusion
Table 5: Testing accuracy of different activation functions used for In this paper, we first unveil the inherent static nature of exist-
normalizing cm−1 in DLA-L on CIFAR-100 dataset. ing layer attention mechanisms and analyze their limitations
in facilitating feature extraction through layer interaction. To
As shown in Table 5, we conducted experiments on address these limitations and restore the dynamic features of
the CIFAR-100 dataset using DLA-L-56 and DLA-L-110. attention mechanisms, we propose and construct a framework
Firstly, we observed in DSU that normalizing cm−1 is essen- called Dynamic Layer Attention (DLA). For implementing
tial compared to Identity Mapping. The inclusion of an acti- DLA, we design a novel RNN block, named Dynamic Shar-
vation function resulted in significantly higher top-1 accuracy ing Unit (DSU). Experimental evaluations in the domains of
compared to Identity mapping. Secondly, among various acti- image classification and object detection demonstrate that our
vation functions, the Sigmoid function has been proven effec- framework outperforms static layer attention significantly in
tive in scaling cm−1 to the range of (0, 1), facilitating better promoting layer interaction.
Acknowledgments tricks for image classification with convolutional neural
This work was supported by the National Major Sci- networks. In CVPR, pages 558–567, 2019.
ence and Technology Projects of China under Grant [He et al., 2020] Ruining He, Anirudh Ravula, Bhargav
2018AAA0100201, the National Natural Science Foundation Kanagal, and Joshua Ainslie. Realformer: Transformer
of China under Grant 62206189, and the China Postdoctoral likes residual attention. arXiv preprint arXiv:2012.11747,
Science Foundation under Grant 2023M732427. 2020.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
References Jürgen Schmidhuber. Long short-term memory. Neural
[Bello et al., 2019] Irwan Bello, Barret Zoph, Ashish Computation, pages 1735–1780, 1997.
Vaswani, Jonathon Shlens, and Quoc V Le. Attention aug- [Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-
mented convolutional networks. In ICCV, pages 3286– and-excitation networks. In CVPR, pages 7132–7141,
3295, 2019. 2018.
[Cao et al., 2019] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun [Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu,
Wei, and Han Hu. Gcnet: Non-local networks meet Daniel Sedra, and Kilian Q Weinberger. Deep networks
squeeze-excitation networks and beyond. In ICCV, pages with stochastic depth. In ECCV, pages 646–661, 2016.
0–0, 2019.
[Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens Van
[Carion et al., 2020] Nicolas Carion, Francisco Massa,
Der Maaten, and Kilian Q Weinberger. Densely con-
Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, nected convolutional networks. In CVPR, pages 4700–
and Sergey Zagoruyko. End-to-end object detection with 4708, 2017.
transformers. In ECCV, pages 213–229, 2020.
[Huang et al., 2018] Gao Huang, Shichen Liu, Laurens
[Chen et al., 2018a] Dapeng Chen, Hongsheng Li, Tong
Van der Maaten, and Kilian Q Weinberger. Condensenet:
Xiao, Shuai Yi, and Xiaogang Wang. Video person re-
An efficient densenet using learned group convolutions. In
identification with competitive snippet-similarity aggrega-
CVPR, pages 2752–2761, 2018.
tion and co-attentive snippet embedding. In CVPR, pages
1169–1178, 2018. [Huang et al., 2019] Zilong Huang, Xinggang Wang, Lichao
[Chen et al., 2018b] Yunpeng Chen, Yannis Kalantidis, Jian- Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Cc-
shu Li, Shuicheng Yan, and Jiashi Feng. Aˆ 2-nets: Double net: Criss-cross attention for semantic segmentation. In
attention networks. NeurIPS, 2018. ICCV, pages 603–612, 2019.
[Chen et al., 2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, [Huang et al., 2020] Zhongzhan Huang, Senwei Liang,
Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Mingfu Liang, and Haizhao Yang. Dianet: Dense-and-
Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: implicit attention network. In AAAI, pages 4206–4214,
Open mmlab detection toolbox and benchmark. arXiv 2020.
preprint arXiv:1906.07155, 2019. [Jiao et al., 2023] Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin,
[Chen et al., 2023] Zheng Chen, Yulun Zhang, Jinjin Gu, Yipeng Gao, Jinhua Ma, Yaowei Wang, and Wei-Shi
Linghe Kong, Xiaokang Yang, and Fisher Yu. Dual aggre- Zheng. Dilateformer: Multi-scale dilated transformer for
gation transformer for image super-resolution. In ICCV, visual recognition. IEEE Transactions on Multimedia,
pages 12312–12321, 2023. 2023.
[Fang et al., 2023] Yanwen Fang, Yuxi Cai, Jintai Chen, [Li and Chen, 2020] Duo Li and Qifeng Chen. Deep rein-
Jingyu Zhao, Guangjian Tian, and Guodong Li. Cross- forced attention learning for quality-aware visual recogni-
layer retrospective retrieving via layer attention. arXiv tion. In ECCV, pages 493–509, 2020.
preprint arXiv:2302.03985, 2023. [Li et al., 2019] Xiang Li, Wenhai Wang, Xiaolin Hu, and
[Han et al., 2021] Yizeng Han, Gao Huang, Shiji Song, Jian Yang. Selective kernel networks. In CVPR, pages
Le Yang, Honghui Wang, and Yulin Wang. Dynamic 510–519, 2019.
neural networks: A survey. IEEE Transactions on Pat- [Liu et al., 2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu,
tern Analysis and Machine Intelligence, pages 7436–7456, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
2021. Swin transformer: Hierarchical vision transformer using
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing shifted windows. In ICCV, pages 10012–10022, 2021.
Ren, and Jian Sun. Deep residual learning for image recog- [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir-
nition. In CVPR, pages 770–778, 2016. shick, and Jian Sun. Faster r-cnn: Towards real-time object
[He et al., 2017] Kaiming He, Georgia Gkioxari, Piotr detection with region proposal networks. NeurIPS, 2015.
Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages [Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fis-
2961–2969, 2017. cher, and Thomas Brox. U-net: Convolutional networks
[He et al., 2019] Tong He, Zhi Zhang, Hang Zhang, for biomedical image segmentation. In MICCAI, pages
Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of 234–241, 2015.
[Srivastava et al., 2015] Rupesh K Srivastava, Klaus Greff,
and Jürgen Schmidhuber. Training very deep networks.
NeurIPS, 2015.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you
need. NeurIPS, 2017.
[Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav
Gupta, and Kaiming He. Non-local neural networks. In
CVPR, pages 7794–7803, 2018.
[Wang et al., 2020a] Qilong Wang, Banggu Wu, Pengfei
Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-
net: Efficient channel attention for deep convolutional neu-
ral networks. In CVPR, pages 11534–11542, 2020.
[Wang et al., 2020b] Yulin Wang, Kangchen Lv, Rui Huang,
Shiji Song, Le Yang, and Gao Huang. Glance and focus: a
dynamic approach to reducing spatial redundancy in image
classification. NeurIPS, pages 2432–2444, 2020.
[Woo et al., 2018] Sanghyun Woo, Jongchan Park, Joon-
Young Lee, and In So Kweon. Cbam: Convolutional block
attention module. In ECCV, pages 3–19, 2018.
[Xu et al., 2017] Shuangjie Xu, Yu Cheng, Kang Gu, Yang
Yang, Shiyu Chang, and Pan Zhou. Jointly attentive
spatial-temporal pooling networks for video-based person
re-identification. In ICCV, pages 4733–4742, 2017.
[Yang et al., 2020] Le Yang, Yizeng Han, Xi Chen, Shiji
Song, Jifeng Dai, and Gao Huang. Resolution adaptive
networks for efficient inference. In CVPR, pages 2369–
2378, 2020.
[Yang et al., 2021] Le Yang, Haojun Jiang, Ruojin Cai, Yulin
Wang, Shiji Song, Gao Huang, and Qi Tian. Condensenet
v2: Sparse feature reactivation for deep networks. In
CVPR, pages 3569–3578, 2021.
[Zhao et al., 2021] Jingyu Zhao, Yanwen Fang, and
Guodong Li. Recurrence along depth: Deep convolu-
tional neural networks with recurrent layer aggregation.
NeurIPS, pages 10627–10640, 2021.
[Zhao et al., 2022] Yue Zhao, Junzhou Chen, Zirui Zhang,
and Ronghui Zhang. Ba-net: Bridge attention for deep
convolutional neural networks. In ECCV, pages 297–312,
2022.

You might also like