0% found this document useful (0 votes)
8 views

Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation

The document presents SDA-xNet, a novel selective depth attention network designed to enhance multi-scale feature representation in computer vision tasks. By introducing a new attention dimension—depth—alongside existing dimensions, the network effectively manages various spatial-scale objects, improving recognition performance while maintaining computational efficiency. Extensive experiments demonstrate that SDA-xNet outperforms state-of-the-art methods in tasks such as image classification, object detection, and instance segmentation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation

The document presents SDA-xNet, a novel selective depth attention network designed to enhance multi-scale feature representation in computer vision tasks. By introducing a new attention dimension—depth—alongside existing dimensions, the network effectively manages various spatial-scale objects, improving recognition performance while maintaining computational efficiency. Extensive experiments demonstrate that SDA-xNet outperforms state-of-the-art methods in tasks such as image classification, object detection, and instance segmentation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 1

SDA-xNet: Selective Depth Attention Networks for


Adaptive Multi-scale Feature Representation
Qingbei Guo∗ , Xiao-Jun Wu† , Zhiquan Feng∗ , Tianyang Xu† and Cong Hu†

Abstract—Existing multi-scale solutions lead to a risk of just


increasing the receptive field sizes while neglecting small receptive
fields. Thus, it is a challenging problem to effectively construct
adaptive neural networks for recognizing various spatial-scale
objects. To tackle this issue, we first introduce a new attention
arXiv:2209.10327v1 [cs.CV] 21 Sep 2022

dimension, i.e., depth, in addition to existing attention dimensions


such as channel, spatial, and branch, and present a novel selective
depth attention network to symmetrically handle multi-scale
objects in various vision tasks. Specifically, the blocks within each
stage of a given neural network, i.e., ResNet, output hierarchical
feature maps sharing the same resolution but with different
receptive field sizes. Based on this structural property, we design
a stage-wise building module, namely SDA, which includes a
trunk branch and a SE-like attention branch. The block outputs
of the trunk branch are fused to globally guide their depth
attention allocation through the attention branch. According to
the proposed attention mechanism, we can dynamically select
different depth features, which contributes to adaptively adjust- Figure 1. Image classification performance comparison of different multi-scale
ing the receptive field sizes for the variable-sized input objects. and attention methods with ResNet-50 backbone on ImageNet-1k in terms of
In this way, the cross-block information interaction leads to a Top-1 accuracy, computational cost, and model size. The scale of the circles
long-range dependency along the depth direction. Compared with indicates GFLOPs. The proposed SDA-ResNet-86 outperforms a much larger
other multi-scale approaches, our SDA method combines multiple ResNet-152 architecture on the ImageNet-1k dataset with over 2× less the
receptive fields from previous blocks into the stage output, thus number of parameters and close to 3× lesser FLOPs.
offering a wider and richer range of effective receptive fields.
Moreover, our method can be served as a pluggable module
to other multi-scale networks as well as attention networks,
coined as SDA-xNet. Their combination further extends the range exploits different kernel sizes or multiple kernel cascades to
of the effective receptive fields towards small receptive fields, extract features with different scales for the aggregation of rich
enabling interpretable neural networks. Extensive experiments multi-scale information.
demonstrate that the proposed SDA method achieves state-of-the- Although such multi-kernel methods can perceive different
art (SOTA) performance, outperforming other multi-scale and
attention counterparts on numerous computer vision tasks, e.g.,
scales, they tend to rapidly increase the receptive field sizes,
image classification, object detection, and instance segmentation. which is a disadvantage when performing recognition for small
Our source code is available at https://github.com/QingbeiGuo/ target objects [8], [9]. For instance, the recently proposed
SDA-xNet.git. EPSANet [7] is a variant neural network of ResNet [10],
Index Terms—Deep Neural Network, Multi-Scale, Attention which replaces the 3×3 convolution with the Pyramid Split
Mechanism, Computer Vision Attention (PSA) module to improve the ability of the multi-
scale feature representation. The PSA module is composed
of several groups of convolutions with different kernel sizes,
I. I NTRODUCTION
e.g., 3, 5, 7, and 9. Thus, the network output combines
The involvement of multi-scale feature representations is of multiple components of different size receptive fields. The 3×3
great practical significance for various computer vision (CV) convolution contributes to the smallest part of receptive fields,
tasks, including image classification [1], object detection [2], which is equivalent to ResNet in terms of the receptive field
semantic segmentation [3], and instance segmentation [4], settings. Other convolutions with larger kernels increase the
etc., due to diverse sized target objects. Therefore, multi-scale receptive field sizes. In other words, EPSANet achieves a wide
feature representations have been widely used in the design range of effective receptive fields, starting from the smallest
of convolutional neural networks (CNN) such as Res2Net [5], part equivalent to ResNet. However, such large receptive fields
PyConv [6] and EPSANet [7]. Among those, the multi-kernel prefer to recognize large target objects, while neglecting small
convolution is one of the most commonly used methods, which target objects. Therefore, how to design an adaptive multi-
Q. Guo and Z. Feng are with with Shandong Provincial Key Laboratory
scale neural network is the key to extracting a wide range of
of Network based Intelligent Computing, University of Jinan, Jinan 250022, receptive fields from small to large.
China. (e-mail: {ise guoqb; ise fengzq}@ujn.edu.cn) In this work, our goal is to design a novel neural network
X.-J. Wu, T. Xu and C. Hu are with the School of Artificial Intelligence
and Computer Science, Jiangnan University, Wuxi, P.R. China. (e-mail: as an adaptive multi-scale feature selector for the prediction
{xiaojun wu jnu; tianyang xu}@163.com; [email protected]) of scale-varying objects. To this end, we first introduce a
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 2

new attention dimension, i.e., depth, in addition to existing scale and attention methods, which extends the range of
attention dimensions such as channel, spatial, and branch. effective receptive fields towards small receptive fields
Based on this, we present a novel depth attention network for better performance with slight computation increase,
that adaptively fuses the multi-scale features from different providing a new scope of interpretation of neural networks.
depth blocks for various scales of objects. Concretely, in each • Extensive experiments confirm that our SDA-Net out-
stage of neural networks, we build a depth attention module, performs the state-of-the-art multi-scale and attention
which consists of two branches, i.e., a trunk branch and an networks on numerous computer vision tasks, i.e., image
attention branch. We use the unit of ResNet as the trunk classification, object detection, and instance segmentation,
branch, which is responsible for producing the hierarchical under a similar model complexity.
features from the blocks of each stage. The feature hierarchy The rest of this paper is organized as follows: we first
consists of a series of feature maps with the same number introduce the related work in Section II. We then present the
of channels and the same resolution but different receptive SDA method in Section III. Through extensive experiments,
field sizes, which is helpful to align for the feature fusion in we demonstrate that the proposed SDA-Net delivers superior
the channel and spatial dimensions. Similar to SENet [11], performance on numerous computer vision tasks in Section IV.
the attention branch is implemented as a lightweight network In Section V, we conduct an ablation study to investigate the
module, acting as a feature selector for the hierarchical features effect of different factors on our SDA-Net. Subsequently, we
of the trunk branch. Global embedded information is obtained analyze the computational cost, the range of effective receptive
by a global average pooling (GAP), followed by two simple fields, the multi-scale representation adaptability, and class
Fully-Connected layers (FC), with a softmax operation being activation mapping (CAM) in Section VI. Finally, we draw the
applied to obtain the attention weights of the corresponding paper to a conclusion in Section VII.
hierarchical features. Based on the weighted multiplication
and element-wise summation, we get a rich multi-scale feature
II. R ELATED W ORK
representation that corresponds to a wide range of receptive
fields. Small receptive field components from previous blocks Multi-scale feature representations. It has been widely
are included and delivered across stages. In this way, a cross- studied that multi-scale feature representations are essential
block information interaction is built by adaptively adjusting for various computer vision tasks. TridentNet [9] utilized the
the receptive field sizes of different blocks, and the long-range dilated convolution with different dilation rates to generate
dependency is captured along the depth direction. The proposed multiple scale-specific feature maps to solve the scale variation
SDA-Net is constructed by stacking multiple SDA modules, problem of input objects in object detection while exploring
similar to the residual blocks in the ResNet-like manner. the relationship between the receptive field and the scale
Moreover, our SDA method is orthogonal to other multi- variation. A feature pyramid network (FPN) was built by
scale and attention methods, thus SDA-Net can be integrated exploiting the inherent multi-scale feature hierarchy of CNNs
with their neural networks, improving the model performance for object detection [12]. A multi-scale attention model is
for image-related vision tasks with a slight computation constructed with multi-scale input images for the performance
burden. Such a combination merges their receptive field ranges, improvement of semantic segmentation [13]. Some works
thus increasing the range of effective receptive fields in the designed multiple branch networks to aggregate multi-scale
direction of small ones. A wider range of effective receptive information in different scale branches, achieving strong
fields can significantly boost the performance, enhancing the representational power for image classification [14], [15].
interpretability of neural networks, which is verified by our Multi-kernel convolution is one of the commonly used
experiments. Extensive experiments on numerous computer methods for multi-scale feature representation. The early
vision tasks including image classification, object detection, InceptionNets [16]–[18] used the inception module with
and instance segmentation, have been conducted in this work. different kernel sizes to extract multiple scale size feature
The results demonstrate that our SDA-Net achieves superior maps for improving the representation ability of models in
performance to the state-of-the-art multi-scale and attention image classification. PyConv [6] and MixNet [19] exploited
networks under similar computation efficiency. Compared to the lightweight group convolution with multiple kernel sizes
other attention and multi-scale networks, as shown in Fig. 1, for better model accuracy and efficiency. Res2Net [5] designed
the proposed SDA-Net provides the best trade-off between a novel building block with hierarchical residual-like patterns
accuracy and efficiency. to capture in-layer multi-scale features, increasing the range of
The proposed SDA method reflects the following advantages: receptive fields. SKNet [20] introduced a novel selective kernel
convolution to adaptively adjust different kernel selections in
• We first analyze a new attention dimension, i.e., depth, a soft-attention manner. Similarly, EPSANet [7] used different
in addition to other existing attention dimensions such as kernel sizes to extract rich spatial features to achieve strong
channel, spatial, and branch, thus enabling us to capture multi-scale representation ability.
the long-range dependency along the depth direction. In this work, we exploit the inherent multi-scale represen-
• Based on this, we construct a novel SDA-Net adaptively tation of CNNs to extract the hierarchical features instead
adjusting the receptive field sizes for rich multi-scale of multi-kernel convolutions, and the small receptive fields
feature representations. of low layers are concerned for a wide range of multi-scale
• The proposed SDA method is orthogonal to other multi- representations. Furthermore, the proposed SDA-Net differs
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 3
Depth-wise attention

from FPN, which also uses the multi-scale feature hierarchy


× +
element-wise product element-wise summation

×
of CNNs, because the effective reception field is adaptively
×
captured in the in-stage pattern by using a novel depth attention scale
×
mechanism. …
Z
h
1Z 2 Z 3 Z × + O hm
Attention mechanisms. The attention mechanisms enable c
w select c w

neural networks to strengthen the allocation of the most relevant


fuse +
features of input objects while weakening the less useful s
v1 1

GAP FC FC v s

softmax
2 2
ones. Many attention methods have been proposed to explore u u’

v s
F 1×1×c 1×1×d … 3 3

the attention space. According to the difference in attention h×w×c v sm m


1×1×c 1×1×c
dimensions, these methods can be categorized into three distinct 1 ℎ �
m �=�vg�ool(F) = F
groups: channel attention, spatial attention, and branch attention F= i=1 �� ℎ×� i=1 �=1 �,:,:
for image-related vision tasks. Note that the temporal-wise Figure 2. Selective Depth �= m
Attention Module.
�=Re��(��(��)) i=1
�������(�� )��
attention is beyond the scope of this paper.
SENet [11] proposed a channel attention method, which In this work, we focus on a novel depth attention method
adaptively adjusts the channel-wise features by modeling the to explore the attention space in the depth dimension, aiming
cross-channel dependency with the squeeze and excitation at promoting further research on the attention mechanisms.
(SE) operations. ECANet [21] introduced a local cross-channel
attention method to reduce the attention network complexity.
III. M ETHODOLOGY
GSoP [22] introduced higher-order global information embed-
ding methods, i.e., global second-order pooling, to model the A. Selective Depth Attention Module
pair-wise channel correlations of previous image information. SDA module consists of two branches: trunk branch and
FcaNet [23] proposed another global information embedding attention branch, as shown in Fig. 2. The trunk branch is driven
method instead of global average pooling in the attention to learn features based on the typical network architectures.
module, which exploits the discrete cosine transform (DCT) This approach is independent of the backbone architecture of
to incorporate different frequency components, guiding the CNNs, and we use ResNet [10], SENet [11], and Res2Net [5] as
selection of channel-wise features. the building unit of our SDA-Net in this paper. The bottleneck
To overcome the limitation of local convolutions, existing blocks of these three networks are divided into multiple stages
attempts propose to capture the long-range feature interactions according to the difference in output feature resolutions. The
at the spatial-wise attention stage. AANet [24] proposed block sequence of each stage serves as the trunk branch of our
attention augmented convolutional network by combining SDA module. These blocks output the same number of feature
both convolution and self-attention. GCNet [25] modeled the maps with the same resolution but have different receptive
global context by unifying both the simplified non-local (NL) field sizes. It allows us easily merge the multi-scale features
block [26] and the SE block [11]. A2 Net [27] proposed from the intermediate blocks, without the problem of spatial
a double attention block to capture the long-range spatial alignment.
feature dependencies via the global information gathering and The attention branch is implemented as a SE-like lightweight
distribution functions. network structure. This branch gathers the feature information
The branch attention networks, such as SKNet [20] and from the trunk branch, adaptively weighs the features of the
EPSANet [7], used a dynamic branch selection mechanism to intermediate blocks in the depth dimension, and then passes
adaptively adjust the sizes of their receptive fields according to the resulting multi-scale features into the next stage. Unlike
the variable-sized input objects. AFFNet [28] proposed a multi- EPSANet which extracts the multi-scale features using the
scale attention module, which fuses the features from different branch attention mechanism with different size kernels, we
layers or branches by aggregating local and global contexts. exploit the intrinsic feature pyramid representation of CNNs
Similarly, ResNeSt [29] presented a multi-path architecture to and explore an alternative attention mechanism, i.e., depth
capture cross-channel feature interactions by leveraging the attention, for strong multi-scale feature representation capacity.
split-attention mechanism. Thus, the attention branch is used as a feature selector of
Besides, the combination of two or more attention mech- the trunk branch, which decides the attendance of involved
anisms has been also discussed recently, achieving further blocks in the feature learning, based on the requirement of
improvement compared to the use of a single attention feature scales. It is worth noting that we creatively explore a
mechanism. For instance, SANet [30] adopted an efficient new depth dimension for the attention space, differently from
shuffle attention module to combine channel attention and other existing attention dimensions such as channel, spatial,
spatial attention effectively for additional performance gains. and branch.
Similar works such as CBAM [31] and BAM [32] also proposed Specifically, for a given SDA module, its trunk branch
to combine channel attention and spatial attention for significant extracts a sequence of feature maps Z = [Z1 , Z2 , ..., Zm ] from
performance improvements. the intermediate blocks. Here, m is the number of blocks, and
Recently, the transformer architecture has obtained wide Zi ∈ Rh×w×c denotes the output of the i-th block, where h, w
attentions in many CV tasks, such as ViT [33], DETR [34] and c are height, width and the number of channels, respectively.
and Deformable DETR [35]. But the self-attention mechanism Thanks to the same number of channels and the same resolution
they used is out of the scope of this paper. but different receptive field sizes, Z is essentially a feature
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 4

hierarchy, which has gradual semantics from low to high levels. block for aligning the FLOPs with ResNet-50. Thanks to the
Our goal is to adaptively weight the feature hierarchy via the proposed depth attention mechanism, SDA-Net enables a wide
attention branch to achieve the multi-scale representation for range of effective receptive fields for the flexible multi-scale
better predictions. representation ability.
To achieve this goal, we first merge the hierarchical features To be specific, if SDA-Net pays more attention to deep
by using an element-wise summation, as follows, blocks of each stage, then it outputs a high level of semantic
m
features with large receptive fields, which allows it to favor
F=
X
Zi , the large-scale objects. On the contrary, if shallow blocks
(1)
i=1
are paid more attention to in each stage, then a low level of
features is obtained with small receptive fields, being suitable
where F ∈ Rh×w×c . for learning the small-scale objects. The original ResNet is
Then, we aggregate the spatial information of feature maps F viewed as a special case of our SDA-Net, which generates
by using GAP to generate the global spatial context descriptor the fixed-scale features for the single-scale objects through the
u ∈ R1×1×c . In other words, u is generated by shrinking F only last block. In contrast, our SDA-Net adaptively learns the
through its spatial dimensions h × w. Specifically, the k-th dynamic multi-scale contexts to achieve better performance for
element of the descriptor u is calculated by various scales of objects by leveraging the intermediate features
h w instead of just the features of the last block. More importantly,
1 XX
uk = GAP(F:,:,k ) = Fi,j,k . (2) because of the newly proposed depth attention dimension, our
h × w i=1 j=1 method is orthogonal to other multi-scale attention methods
such as EPSANet, SENet, CBAM, PyConv, and Res2Net.
Subsequently, u is fed into two 1×1 convolutions, as follows, Thus, the proposed SDA method further extends the range
of effective receptive fields towards small receptive fields
v = W2 (γ(β(W1 u))), (3) for enhancing the model representation ability, which is of
where β and γ are the functions of BN [36] and ReLU [37], similar complexities to EPSANet-101, SENet-101, CBAM-101,
respectively. W1 ∈ Rd×1×1×c is a learned linear transfor- PyConv-101, and Res2Net-101, respectively. A comparison
mation that maps u to a low-dimensional space for better of ResNet-50, SDA-ResNet-86, SDA-SENet-86 and SDA-
efficiency, and W2 ∈ Rm·c×1×1×d is used to resume the Res2Net-86 neural architectures can be found in Table I.
channel dimension. Here, d = max(c/r, L), where r is the In addition to the above attention mechanism, the compo-
reduction ratio and L is the threshold value. sition of Z is also critical for our SDA-Net to generate the
Further, we align v along the depth dimension to obtain features with rich semantic information according to Eqns. (1)
>
v by a reshaping operation, and employ the softly weighted and (5). The key lies on the choice of features from the
mechanism with a softmax activation, as follows, intermediate blocks. There are several candidate features,
including x+F (x) and γ(x+F (x)), where x is the input tensor
exp(vi> ) of the blocks and F (x) represents multiple convolutional layers
s> = δ(v> ) = Pm >)
, (4)
i=1 exp(vi
which learn a residual mapping function. In this paper, we
select the output of each block before activation, i.e., x + F (x),
where δ denotes the softmax function, and s> is the set of
to constitute the sequence of feature map subsets Z given the
attention vectors.
information loss from the activation function γ. The choice and
Finally, at the end of each SDA module, we adaptively fuse effect of these intermediate features will be further discussed
different scales of semantic information by softly weighting in the section on Ablation Study.
the cross-block features according to the scale of input objects.
Thus, the output of the SDA module can be formulated as:
IV. E XPERIMENTS
m
X In this section, we conduct extensive experiments to investi-
O = γ( si Zi ), (5) gate the effectiveness of our SDA-Net on numerous benchmark
i=1
CV tasks, i.e., image classification, object detection, and
where s> is reshaped to return to the alignment s along the instance segmentation.
channel dimension, and O denotes the final output of the
proposed SDA module. Note that the attention model captures
A. Implementation Details
the channel-wise relationships across blocks, and decides how
much attention to pay to the features at different depths, thus To evaluate the effectiveness of our SDA-Net for the image
achieving the powerful multi-scale feature representation ability. classification task, we carry out experiments on the large-scale
ImageNet-1k [38] dataset, and deploy the typical ResNet-50
as the trunk branch. Considering the combination with other
B. Neural Network Architecture multi-scale and attention methods, we also deploy several
We construct our proposed SDA-Net by stacking multiple other trunk branches, including EPSANet, SENet, CBAM,
SDA modules. Based on ResNet-50, we adjust the number PyConv, and Res2Net. The ImageNet-1K dataset contains 1000
of blocks in each stage from (3, 4, 6, 3) to (5, 6, 12, object classes with 1.28M training images for training samples
5), and set 8 groups in the second convolution of each and 50K validation images for testing samples. We adopt the
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 5

Table I
A RCHITECTURES FOR I MAGE N ET CLASSIFICATION . T HE FOUR COLUMNS REFER TO R ES N ET-50, SDA-R ES N ET-86, SDA-SEN ET-86 AND
SDA-R ES 2N ET-86, RESPECTIVELY.

stage output ResNet-50 SDA-ResNet-86 SDA-SENet-86 SDA-Res2Net-86


conv 112×112 7×7, 64, stride 2
pooling 56×56 3×3 max pool, stride 2
     
  1 × 1, 64 1 × 1, 64 1 × 1, 64
1 × 1, 64  3 × 3, 64, G = 8  × 5  SE, 64 ×5  Res2, 104  × 5
stage1 56×56  3 × 3, 64  × 3
1 × 1, 256 1 × 1, 256 1 × 1, 256
1 × 1, 256
SDA[r = 16, L = 64], 256 SDA[r = 16, L = 64], 256 SDA[r = 16, L = 64], 256
     
  1 × 1, 128 1 × 1, 128 1 × 1, 128
1 × 1, 128  3 × 3, 128, G = 8  × 6  SE, 128 ×6  Res2, 208  × 6
stage2 28×28  3 × 3, 128  × 4
1 × 1, 512 1 × 1, 512 1 × 1, 512
1 × 1, 512
SDA[r = 16, L = 64], 512 SDA[r = 16, L = 64], 512 SDA[r = 16, L = 64], 512
     
  1 × 1, 256 1 × 1, 256 1 × 1, 256
1 × 1, 256  3 × 3, 256, G = 8  × 12  SE, 256  × 12  Res2, 416  × 12
stage3 14×14  3 × 3, 256  × 6
1 × 1, 1024 1 × 1, 1024 1 × 1, 1024
1 × 1, 1024
SDA[r = 16, L = 64], 1024 SDA[r = 16, L = 64], 1024 SDA[r = 16, L = 64], 1024
     
  1 × 1, 512 1 × 1, 512 1 × 1, 512
1 × 1, 512  3 × 3, 512, G = 8  × 5  SE, 512 ×5  Res2, 832 ×5
stage4 7×7  3 × 3, 512  × 3
1 × 1, 2048 1 × 1, 2048 1 × 1, 2048
1 × 1, 2048
SDA[r = 16, L = 64], 2048 SDA[r = 16, L = 64], 2048 SDA[r = 16, L = 64], 2048
classifier 1×1 7×7 global average pool, BN, 1000-d fc, softmax
FLOPs 4.12G 3.88G 6.72G 6.95G
Params 25.56M 27.22M 49.13M 45.26M

same data augmentation and hyper-parameter setting as in [10]. Specifically, our SDA-Net outperforms other multi-scale
Specifically, the training images are resized to 256×256 and networks by a large margin with similar computational com-
then are randomly cropped to 224×224, followed by random plexity. Our method achieves 1.45%, 0.78%, 1.27% and 0.77%
horizontal flipping. All the models are trained from scratch higher performance in terms of top-1 accuracy than bL-ResNet,
by stochastic gradient descent (SGD) with weight decay 1e-4, ScaleNet, EPSANet, PyConv, and Res2Net, respectively. More-
momentum 0.9, and mini-batch size 256. The initial learning over, the model integrating our method with other multi-scale
rate is set to 0.1, and decreased by a factor of 10 every 30 methods significantly improves the recognition performance,
epochs until 120 epochs. During testing, we resize the validation even consistently surpassing their 101-layer variant networks,
images to 256×256 and then use the center crop of 224× 224 i.e., EPSANet-101, PyConv-101, and Res2Net-101, with a
for evaluation. Label Smoothing [17] is applied to regularize smaller computational cost. This fully demonstrates the superi-
the network. ority of adaptive aggregation for different receptive fields.
To evaluate our SDA method on MS COCO [39] for other
downstream CV tasks, we use multiple detectors such as Faster To further evaluate the proposed method for the image
R-CNN [2] and Mask R-CNN [4].The COCO-2017 dataset classification task, our SDA-Net is compared with the SOTA
consists of about 118K training images and 5K validation attention methods with different attention dimensions on
images as testing ones over 80 foreground object classes and ImageNet-1K, such as SENet [11], FcaNet [23], GCNet [25],
one background class. These detectors are implemented with CBAM [31] and EPSANet [7]. We present the result compared
MMDetection toolbox [40] and are configured with its default with the SOTA attention methods and the result of combining
settings. Concretely, during training, the input images are with the part of those attention methods in Table III.
resized such that the short edge is 800 pixels. We train all the
models for 12 epochs using SGD with an initial learning rate In particular, our SDA-Net achieves the best performance
of 0.02 which is decreased by 10 at the 8th and 11th epochs. among these attention methods with similar complexities. For
The other hyper-parameters are set as follows: a weight decay the channel attention dimension, our method outperforms
of 1e-4, a momentum of 0.9, and a mini-batch size of 16 (4 SENet by more than 2% in terms of top-1 accuracy. For the
GPUs with 4 images per GPU). spatial-related attention dimension, our SDA-Net obtains a
1.42% higher gain than CBAM. Our methods boost 1.27%
top-1 accuracy for the branch attention method, i.e., EPSANet.
B. Image Classification
The performance gap between our SDA-Net and FcaNet is
We compare our SDA-Net with the SOTA multi-scale the smallest (0.24%). Additionally, we observe 0.66%, 0.69%,
networks on ImageNet-1K, including bL-ResNet [15], and 1.09% improvement when combining with the typical
ScaleNet [14], EPSANet [7], PyConv [6], and Res2Net [5], networks of different attention dimensions, i.e., SENet, CBAM,
to evaluate the proposed method on the image classification and EPSANet, respectively. Our SDA-Net even consistently
task. Table II shows the result comparison with the SOTA surpasses their 101-layer network variants with less memory
multi-scale methods and the result of combining with the part and computation cost. This fully shows the effectiveness of
of those multi-scale methods. our proposed depth attention mechanism.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 6

Table II
C LASSIFICATION C OMPARISON AMONG SEVERAL SOTA MULTI - SCALE METHODS ON I MAGE N ET-1K.

Params FLOPs Top-1 Top-5


Method Backbone Model Multi-scale
(M) (G) (%) (%)
ResNet-50 [10] multi-scale 25.56 4.12 75.20 92.52
bL-ResNet [15] multi-scale 26.69 2.85 77.31 -
ScaleNet [14] multi-scale 31.48 3.82 77.98 93.95
EPSANet [7] ResNet multi-scale 22.59 3.60 77.49 93.54
PyConv [6] multi-scale 24.85 3.88 77.88 93.80
Res2Net [5] multi-scale 25.70 4.26 77.99 93.85
SDA-ResNet-86 (ours) multi-scale 27.22 3.88 78.76 94.37
EPSANet-101 [7] multi-scale 38.90 6.82 78.43 94.11
EPSANet-86 (our impl.) EPSANet multi-scale 36.67 5.84 77.71 93.83
SDA-EPSANet-86 (ours) multi-scale 39.45 5.85 78.80 94.34
PyConv-101 [6] multi-scale 44.63 8.42 79.22 94.43
PyConv-86 (our impl.) PyConv multi-scale 40.55 6.22 78.63 94.26
SDA-PyConv-86 (ours) multi-scale 43.33 6.22 79.27 94.65
Res2Net-101 [5] multi-scale 45.21 8.10 79.19 94.43
Res2Net-86 (our impl.) Res2Net multi-scale 42.47 6.94 78.63 94.21
SDA-Res2Net-86 (ours) multi-scale 45.26 6.95 79.36 94.70

Table III
C LASSIFICATION C OMPARISON AMONG SEVERAL SOTA ATTENTION METHODS ON I MAGE N ET-1K.

Params FLOPs Top-1 Top-5


Method Backbone Model Attention Dimension
(M) (G) (%) (%)
ResNet-50 [10] - 25.56 4.12 75.20 92.52
SENet [11] channel 28.09 4.09 76.71 93.38
ECANet [21] channel 25.56 4.09 77.48 93.68
GSoP [22] channel 28.05 6.18 77.68 93.98
FcaNet [23] channel 28.07 4.09 78.52 94.14
A2 Net [27] spatial 33.00 6.50 77.00 93.50
AANet [24] ResNet spatial 25.80 4.15 77.70 93.80
GCNet [25] spatial 28.11 4.13 77.70 93.66
BAM [32] channel+spatial 25.92 4.17 75.98 92.82
CBAM [31] channel+spatial 30.62 4.10 77.34 93.69
SANet [30] channel+spatial 25.56 4.09 77.72 93.80
EPSANet [7] branch 22.59 3.60 77.49 93.54
SDA-ResNet-86 (ours) depth 27.22 3.88 78.76 94.37
SENet-101 [11] channel 49.33 7.81 77.62 93.93
SENet-86 (our impl.) SENet channel 46.35 6.71 78.29 93.92
SDA-SENet-86 (ours) depth+channel 49.13 6.72 78.95 94.50
CBAM-101 [31] channel+spatial 54.04 7.81 78.49 94.31
CBAM-86 (our impl.) CBAM channel+spatial 50.75 6.71 78.36 94.09
SDA-CBAM-86 (ours) depth+channel+spatial 53.53 6.72 79.05 94.44
EPSANet-101 [7] branch 38.90 6.82 78.43 94.11
EPSANet-86 (our impl.) EPSANet branch 36.67 5.84 77.71 93.83
SDA-EPSANet-86 (ours) depth+branch 39.45 5.85 78.80 94.34

C. Object Detection its generalization ability. Similarly, we also use our SDA-Net
We apply our method to the object detection task to explore with FPN [12] as the backbone of Mask R-CNN and test
its generalization ability. We evaluate our SDA-Net as the their performance on the MS COCO dataset. The performance
backbone of Faster R-CNN and Mask R-CNN. All models of instance segmentation is shown in Table V. Compared
are trained along with FPN [12] and are tested on the MS with other methods, our proposed method achieves the best
COCO dataset. Table IV shows the object detection results. performance. The improvement of AP are 2.4%, 1.6%, and
The SDA-Net based models outperform their counterparts by 2.3% for ResNet-50, SENet-50, and Res2Net-50, respectively,
a significant margin with both faster-RCNN and Mask-RCNN. surpassing even their 101-layer variant models in terms of
Compared with ResNet-50, SENet-50, and Res2Net-50, their almost all indicators. In the Appendix section, some visual
SDA-Net based models achieve 3.4%, 3.4%, and 8.4% higher results of instance segmentation on challenging examples are
AP performance for Faster-RCNN, respectively, and achieve illustrated in Fig. 7.
3.4%, 2.8%, and 2.7% higher AP performance for Mask-RCNN,
respectively, surpassing even their 101-layer variant models in
terms of almost all indicators. In the Appendix section, some
visual results of object detection on challenging examples are V. A BLATION S TUDY
illustrated in Fig. 6.
As discussed in Section III, there are two major factors
D. Instance Segmentation affecting the multi-scale information flows across stages: the
Besides the object detection task, we also evaluate our feature sequence in trunk branches and the adaptive depth
method on the instance segmentation task to further verify attention mechanism.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 7

Table IV
O BJECT DETECTION RESULTS OF DIFFERENT METHODS ON COCO VAL 2017.

Params FLOPs AP AP50 AP75 APS APM APL


Detector Model
(M) (G) (%) (%) (%) (%) (%) (%)
ResNet-50 [10] 41.53 207.07 36.4 58.2 39.2 21.8 40.0 46.2
EPSANet [7] 38.56 197.07 39.2 60.3 42.3 22.8 42.4 51.1
SENet [11] 44.02 207.18 37.7 60.1 40.9 22.9 41.9 48.2
ECANet [21] 41.53 207.18 38.0 60.6 40.9 23.4 42.1 48.0
FcaNet [23] 44.02 215.63 39.0 61.1 42.3 23.7 42.8 49.6
Faster-RCNN Res2Net [5] 41.67 215.52 33.7 53.6 - 14.0 38.3 51.1
SDA-ResNet-86 (ours) 43.11 202.93 39.8 60.7 43.2 22.9 43.9 51.4
SENet-101 [11] 65.24 295.58 39.6 62.0 43.1 23.7 44.0 51.4
SDA-SENet-86 (ours) 64.81 260.93 41.0 62.1 44.5 23.9 44.9 53.4
Res2Net-101 [5] 61.18 293.68 - - - - - -
SDA-Res2Net-86 (ours) 60.98 265.70 42.0 62.7 45.5 24.1 45.8 55.0
ResNet-50 [10] 44.17 261.81 37.2 58.9 40.3 22.2 40.7 48.0
EPSANet [7] 41.20 248.53 40.0 60.9 43.3 22.3 43.2 52.8
SENet [11] 46.66 261.93 38.7 60.9 42.1 23.4 42.7 50.0
ECANet [21] 44.17 261.93 39.0 61.3 42.1 24.2 42.8 49.9
FcaNet [23] 46.66 261.93 40.3 62.0 44.1 25.2 43.9 52.0
Mask-RCNN Res2Net [5] 44.31 268.59 39.6 60.9 43.1 22.0 42.3 52.8
SDA-ResNet-86 (ours) 45.75 256.01 40.6 61.1 44.4 23.4 44.3 53.8
SENet-101 [11] 67.88 348.65 40.7 62.5 44.3 23.9 45.2 52.8
SDA-SENet-86 (ours) 67.46 314.00 41.5 62.8 45.2 24.8 45.6 53.7
Res2Net-101 [5] 63.82 346.74 41.8 62.6 45.6 23.4 45.5 55.6
SDA-Res2Net-86 (ours) 63.62 318.78 42.3 62.9 46.4 24.5 46.2 55.9

Table V
I NSTANCE SEGMENTATION RESULTS OF DIFFERENT METHODS ON COCO VAL 2017.

Params FLOPs AP AP50 AP75 APS APM APL


Detector Model
(M) (G) (%) (%) (%) (%) (%) (%)
ResNet-50 [10] 44.17 261.81 34.1 55.5 36.2 16.1 36.7 50.0
EPSANet [7] 41.20 248.53 35.9 57.7 38.1 18.5 38.8 49.2
SENet [11] 46.66 261.93 35.4 57.4 37.8 17.1 38.6 51.8
ECANet [21] 44.17 261.93 35.6 58.1 37.7 17.6 39.0 51.8
FcaNet [23] 46.66 261.93 36.2 58.6 38.1 - - -
Mask-RCNN Res2Net [5] 44.31 268.59 35.6 57.6 37.6 15.7 37.9 53.7
SDA-ResNet-86 (ours) 45.75 256.01 36.5 58.2 39.0 16.9 39.5 53.5
SENet-101 [11] 67.88 348.65 36.8 59.3 39.2 17.2 40.3 53.6
SDA-SENet-86 (ours) 67.46 314.00 37.0 59.6 39.3 18.2 40.6 53.4
Res2Net-101 [5] 63.82 346.74 37.1 59.4 39.4 16.6 40.0 55.6
SDA-Res2Net-86 (ours) 63.62 318.78 37.9 59.8 40.7 18.3 41.0 55.2

Table VI Table VII


T HE RESULT OF USING DIFFERENT FEATURES Z IN TRUNK BRANCHES ON C OMPARISON BETWEEN ADAPTIVE ATTENTION VS . NON - ADAPTIVE
I MAGE N ET-1K. ATTENTION ON I MAGE N ET-1K.

Params FLOPs Top-1 Top-5 Params FLOPs Top-1 Top-5


Model Model
(M) (G) (%) (%) (M) (G) (%) (%)
ResNet-86 (our impl.) 24.44 3.87 77.43 93.60 ResNet-86 (our impl.) 24.44 3.87 77.43 93.60
SDA-ResNet-86 & γ(x + F (x)) 27.22 3.88 78.53 94.15 SDA-ResNet-86 & non-adaptive 24.44 3.87 78.01 94.02
SDA-ResNet-86 & x + F (x) (ours) 27.22 3.88 78.76 94.35 SDA-ResNet-86 & adaptive (ours) 27.22 3.88 78.76 94.35

A. The Effect of the Feature Sequence in Trunk Branch B. The Effect of Adaptive Attention vs. Non-adaptive Attention
Next, we compare our adaptive attention with non-adaptive
We investigate the effect of the feature sequence in trunk attention to investigate the effect of our depth attention mecha-
branches. Based on Eqn. (5), we find that the hierarchical nism. These two types of networks have similar architectures
features are essential factors to affect the information flow except for the part of attention branches. Different from
across stages. Empirically, we select two crucial feature our adaptive SDA-Net, the non-adaptive attention network
elements, i.e., γ(x + F (x)) and x + F (x), which contain aggregates the features from different blocks by the simple
1
Pm
rich semantic information with identity features. summation, i.e., O = γ( m i=1 Zi ), instead of the adaptive
The comparison result is shown in Table VI. It is easily seen weighting operation, and the features of each block are equally
that x + F (x) achieves higher accuracy than γ(x + F (x)). We important for the multi-scale feature representations.
argue that this is because the ReLU function results in the As shown in Table VII, we can see that using our adaptive
information loss due to zeroing the negative activations, and depth attention method has better performance, achieving
their combination further increases the loss potential. Therefore, close to 0.8% higher accuracy than the non-adaptive attention
we select x + F (x) as the component of the feature sequence competitor, which verifies the effectiveness of our depth
Z in trunk branches for better performance. attention mechanism. More interestingly, the non-adaptive
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 8

12 effective receptive field, the effective receptive field increases


40

40 10
linearly with the kernel size K and sub-linearly with the depth
35

8
30
L of networks [8]. In other√words, the effective receptive
30

KL 1/2
6

field is proportional to O(K L), which theoretically proves


KL 1/2

20 25
4

10 2
that our proposed method effectively extends the range of
20

0
0
2
effective receptive fields towards small receptive fields because
15
3
3
4
5 it aggregates shallow features. 10
6
5
K
7
L 7
In Fig. 3, we show the range of effective receptive fields
8
9
5
10
9
12
in the second stage for EPSANet-86 and our SDA-ResNet-
11
12
3
0
K
L
86, taking only two factors, K, and L, into consideration.
(a) (b) The output of these two networks contains multiple feature
Figure 3. The range of adaptive effective receptive field in the second stage. components with different receptive fields, in which the large
(a) EPSANet. (b) Our SDA-ResNet-86. receptive field contributes to capturing semantic information
for large-scale objects, and the small receptive field helps to
extract local details for small-scale patterns. For EPSANet, it
attention network achieves better performance than the ResNet-
is easily seen that the smallest and largest components of the
86 model, improving the accuracy from 77.43% to 78.01%.
receptive fields depend on the branch of kernel sizes 3 and 9,
Therefore, the result fully verifies the success of our SDA-Net
respectively. By contrast, the largest receptive field of our SDA-
design that exploits the hierarchical feature representation of
ResNet-86 is dependent on kernel size 3, which is equivalent
CNNs and the depth attention mechanism for richer information
to the smallest one of EPSANet, and the other receptive fields
fusion.
gradually shrink as the network depth decreases. Therefore, it is
worth noting that our method seeks to strengthen the power of
VI. A NALYSIS AND I NTERPRETATION
multi-scale representations towards small receptive fields, and
A. Computational Complexity the combination with other methods further extends the range
In this part, we consider the additional parameters introduced of effective receptive fields to achieve stronger representational
by the proposed SDA module. These additional parameters power.
come from the two 1×1 convolution layers of the attention
branch. Thus, the total number of additional parameters C. Attention Distribution
equals the additional computational complexity, which can To understand how our SDA-Net works for adaptive multi-
be calculated by scale learning by dynamically adjusting receptive fields, we
n
analyze the attention distribution for the same input objects but
1X 2 with different scales. Concretely, we enlarge all the samples
(mi + 1)ci , (6)
r i=1 of the ImageNet validation set from 0.14× to 2.0×, followed
by a 224×224 central cropping. For the validation set with
where n is the number of stages, r refers to the reduction ratio,
different scales, we record the attention values of each block
mi denotes to the number of blocks and ci denotes the number
and analyze their distributions to mathematically interpret the
of output channels for the i-th stage.
adaptive adjustment of receptive fields.
Our SDA-Net offers a good trade-off between improved Fig. 4 shows the attention distributions of each block in the
accuracy and increased computation complexity. To illustrate second stage. On one hand, it is seen that each block is not
the model complexity, we make a comparison between ResNet- equal to be attended for the multi-scale feature representations.
86 and SDA-ResNet-86. For a fair comparison, we use the On the other hand, as the input objects enlarge, the attention
same crop size of 224 for input images. ResNet-86 requires means gradually reduces for the first block; on the contrary,
∼24.44M Params (the number of parameters) and ∼3.87G for the fourth block, the attention means tends to increase. For
FLOPs (floating-point of operations) for a single input image. the third block, when the enlargement factor is less than 1×,
In comparison, our SDA-ResNet-86 requires ∼27.22M Params the attention means dramatically increases. At the 1× scale,
and ∼3.88G FLOPs, corresponding to an acceptable increase the attention means reaches its peak, and it gradually decreases
in the number of parameters and a slight increase in compu- when the enlargement factor is greater than the 1× scale. The
tation complexity. At the cost of acceptable additional model second block follows a similar pattern to the third block. The
complexity, SDA-ResNet-86 significantly surpasses ResNet- fifth and last block has an approximately opposite trend to the
101 and ResNet-152 as well as ResNet-50, achieving 3.56%, second and third blocks. As a result, it suggests that the larger
1.93% and 1.18% higher top-1 accuracy, respectively. the input objects are, the more attention will be assigned to
the features of the deeper blocks, which fully verifies that our
B. The Range of Effective Receptive Field SDA-Net adaptively adjusts the receptive fields according to
We have demonstrated that the range of the effective the object scale sizes.
receptive field is a crucial issue for the objects at all scales in
many CV tasks. For simplicity, take EPSANet as an example, D. Class Activation Mapping
showing the essential difference with our proposed SDA-Net To further understand the adaptive multi-scale representation
through comparative analysis. According to the theory of ability of our SDA-Net, we visualize the class activation map-
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 9

0.25
attention distributon (mean, var)
0.15
attention distributon (mean, var)
VII. C ONCLUSION A ND F UTURE W ORK
0.24
A new attention dimension, i.e., depth, is the first to be
0.23 0.145
introduced, which is an essential factor in addition to existing
attention values

attention values
0.22 attention dimensions such as channel, spatial, and branch. Based
0.21 0.14
on this, we present a simple yet effective SDA-Net to explore
the depth attention mechanism for the multi-scale feature
0.2
representations. The network exploits the inherent hierarchical
0.19
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales
0.135
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales
features of CNNs and the long-range block dependency to
(a) (b)
adaptively adjust the effective receptive fields for different scale-
size objects. The proposed method is orthogonal to other multi-
0.146 0.15
scale and attention methods, thus SDA-Net can be integrated
0.144
0.148
with those networks to extend the range of effective receptive
0.142
0.146 fields towards small receptive fields for better performances.
attention values

attention values

0.14

0.144
We have achieved state-of-the-art performance in a broad range
0.138
of CV tasks, including image classification, object detection,
0.142
0.136
and instance segmentation.
0.134
0.14

attention distributon (mean, var)


We hope that the proposed SDA mechanism will promote
attention distributon (mean, var)
0.132
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
0.138
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x further research on attention networks. There are lots of
scales scales
potential improvements, such as designing more efficient
(c) (d)
attention structures, adopting stronger trunk branches, and
0.172
0.21
combining with the transformer architecture, in future work.
0.171

0.205
0.17

0.169
ACKNOWLEDGMENT
attention values

0.2
attention values

0.168

0.167

0.166 0.195
This work is supported by the National Key R&D Program of
0.165
China (Grant No. 2018YFB1004901), by the National Natural
0.164 0.19
Science Foundation of China (Grant No.61672265, U1836218,
0.163

0.162
attention distributon (mean, var)
0.185
attention distributon (mean, var)
62006097), by the 111 Project of Ministry of Education
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x 0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales scales of China (Grant No. B12018), by UK EPSRC GRANT
(e) (f) EP/N007743/1, MURI/EPSRC/DSTL GRANT EP/R018456/1,
by the Science and Technology Program of University of
Figure 4. Attention distribution (mean, variance) of each block in the second
stage which consists of six blocks. (a) The first block. (b) The second block. Jinan (Grant No. XKY1913, XKY1915), and in part by the
(c) The third block. (d) The fourth block. (e) The fifth block. (f) The last Natural Science Foundation of Jiangsu Province (Grant No.
block. BK20200593).

R EFERENCES
ping (CAM) by applying Grad-CAM [41] to the images from
the ImageNet validation set. Grad-CAM is a commonly used [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in neural information
visualization method to localize the class-specific discriminative processing systems, vol. 25, pp. 1097–1105, 2012.
regions. [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
Fig. 5 shows the visualization results of CAM using ResNet, information processing systems, 2015, pp. 91–99.
SENet, Res2Net, and SDA-Net as the backbone. We randomly [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
sample the images with three different size objects (small, atrous convolution, and fully connected crfs,” IEEE transactions on
medium, and large) from the ImageNet validation set. For small pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
target objects such as ‘balloon’, ‘laddybug’, and ‘baseball’, the 2017.
SDA-Net based CAM regions tend to be more precise than other [4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
networks. Similarly, SDA-Net provides more precise CAM 2017, pp. 2961–2969.
localizations on medium target objects, such as ‘street sign’, [5] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr,
‘beacon’, and ‘goldfish’. Our SDA-Net covers almost all parts “Res2net: A new multi-scale backbone architecture,” IEEE transactions
on pattern analysis and machine intelligence, 2019.
of the large target objects, such as ‘ballpoint’, ‘bulbul’ and [6] I. C. Duta, L. Liu, F. Zhu, and L. Shao, “Pyramidal convolution:
‘red wine’, than other networks which cover so much irrelevant rethinking convolutional neural networks for visual recognition,” arXiv
context, especially Res2Net. Accordingly, the target class scores preprint arXiv:2006.11538, 2020.
[7] H. Zhang, K. Zu, J. Lu, Y. Zou, and D. Meng, “Epsanet: An efficient
of our SDA-Net also tend to be larger. In the Appendix section, pyramid split attention block on convolutional neural network,” arXiv
more CAM visualization results are illustrated for different preprint arXiv:2105.14447, 2021.
methods in Fig. 8. As can be seen, the proposed SDA-Net [8] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective
receptive field in deep convolutional neural networks,” in Proceedings
more precisely localizes the object regions with various scale of the 30th International Conference on Neural Information Processing
sizes than other networks. Systems, 2016, pp. 4905–4913.
CAM
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED)
tiny objects
10

small objects medium objects large objects

balloon traffic_light baseball street_sign beacon goldfish ballpoint warplane red_wine

Original image

ResNet-50

(0.99731) (0.99661) (0.10068) (0.94413) (0.69897) (0.99998) (0.13955) (0.37219) (0.96822)

SENet-50

(0.84935) (0.97624) (0.14728) (0.76514) (0.68921) (0.88649) (0.18997) (0.84351) (0.97776)

Res2Net-50

(0.76307) (0.92784) (0.38324) (0.59625) (0.57756) (0.93651) (0.37201) (0.75306) (0.82614)

DSA-ResNet-86

(0.93182) (0.99855) (0.35910) (0.97091) (0.81356) (0.99791) (0.37297) (0.90670) (0.99489)

Figure 5. Grad-CAM visualization results. The softmax scores for the target class are displayed at the bottom of each image.

[9] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks Proceedings of the IEEE/CVF Conference on Computer Vision and
for object detection,” in Proceedings of the IEEE/CVF International Pattern Recognition, 2019, pp. 510–519.
Conference on Computer Vision, 2019, pp. 6054–6063. [21] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net:
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image Efficient channel attention for deep convolutional neural networks,” in
recognition,” in Proceedings of the IEEE conference on computer vision Proceedings of the IEEE/CVF Conference on Computer Vision and
and pattern recognition, 2016, pp. 770–778. Pattern Recognition, 2020, pp. 11 534–11 542.
[11] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in [22] Z. Gao, J. Xie, Q. Wang, and P. Li, “Global second-order pooling
Proceedings of the IEEE conference on computer vision and pattern convolutional networks,” in Proceedings of the IEEE Conference on
recognition, 2018, pp. 7132–7141. Computer Vision and Pattern Recognition, 2019, pp. 3024–3033.
[12] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [23] Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention
“Feature pyramid networks for object detection,” in Proceedings of the networks,” in Proceedings of the IEEE/CVF International Conference
IEEE conference on computer vision and pattern recognition, 2017, pp. on Computer Vision, 2021, pp. 783–792.
2117–2125. [24] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention aug-
[13] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to mented convolutional networks,” in Proceedings of the IEEE International
scale: Scale-aware semantic image segmentation,” in Proceedings of the Conference on Computer Vision, 2019, pp. 3286–3295.
IEEE conference on computer vision and pattern recognition, 2016, pp. [25] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks
3640–3649. meet squeeze-excitation networks and beyond,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision Workshop
[14] Y. Li, Z. Kuang, Y. Chen, and W. Zhang, “Data-driven neuron allocation
(ICCVW), 2019, pp. 1971–1980.
for scale aggregation networks,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp. [26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
11 526–11 534. in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 7794–7803.
[15] C. F. Chen, Q. Fan, N. Mallinar, T. Sercu, and R. Feris, “Big-little
[27] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2 -nets: double
net: An efficient multi-scale feature representation for visual and speech
attention networks,” in Proceedings of the 32nd International Conference
recognition,” in Proceedings of the International Conference on Learning
on Neural Information Processing Systems, 2018, pp. 350–359.
Representations, 2019.
[28] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Applications of Computer Vision (WACV), 2021, pp. 3560–3569.
in Proceedings of the IEEE conference on computer vision and pattern [29] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He,
recognition, 2015, pp. 1–9. J. Mueller, R. Manmatha et al., “Resnest: Split-attention networks,” in
[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking Proceedings of the IEEE/CVF Conference on Computer Vision and
the inception architecture for computer vision,” in Proceedings of the Pattern Recognition, 2022, pp. 2736–2746.
IEEE conference on computer vision and pattern recognition, 2016, pp. [30] Q.-L. Zhang and Y.-B. Yang, “Sa-net: Shuffle attention for deep
2818–2826. convolutional neural networks,” in Proceedings of the International
[18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
inception-resnet and the impact of residual connections on learning,” in 2021, pp. 2235–2239.
Proceedings of the thirty-first AAAI conference on artificial intelligence, [31] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional
2017. block attention module,” in Proceedings of the European conference on
[19] M. Tan and Q. V. Le, “Mixconv: Mixed depthwise convolutional kernels,” computer vision (ECCV), 2018, pp. 3–19.
in 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, [32] J. Park, S. Woo, J.-Y. Lee, and I.-S. Kweon, “Bam: Bottleneck attention
UK, September 9-12, 2019. BMVA Press, 2019, p. 74. module,” in British Machine Vision Conference (BMVC). British Machine
[20] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Vision Association (BMVA), 2018.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 11

[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,


T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., Zhiquan Feng received his M.S. degree in Computer
“An image is worth 16x16 words: Transformers for image recognition at Software from Northwestern Polytechnical University,
scale,” in International Conference on Learning Representations, 2020, China, in 1995, and Ph.D. degree in Computer
pp. 13 713–13 722. Science & Engineering from Shandong University,
[34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and China, in 2006. He is currently a Professor at
S. Zagoruyko, “End-to-end object detection with transformers,” in University of Jinan, China. Dr. Feng is a visiting
European Conference on Computer Vision. Springer, 2020, pp. 213–229. Professor of Sichuang Mianyang Normal University.
[35] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: As the first author or corresponding author he has
Deformable transformers for end-to-end object detection,” in International published more than 100 papers in international
Conference on Learning Representations, 2020. journals and conference proceedings, 2 books, and 30
[36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network patents in the areas of human hand recognition and
training by reducing internal covariate shift,” in International Conference human-computer interaction. He has served as the Deputy Director of Shandong
on Machine Learning, 2015, pp. 448–456. Provincial Key Laboratory of network based Intelligent Computing, group
[37] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural leader of Human Computer Interaction based on natural hand, editorial board
networks,” in Proceedings of the fourteenth international conference on member of Computer Aided Drafting Design and Manufacturing, CADDM, and
artificial intelligence and statistics, 2011, pp. 315–323. also an editorial board member of The Open Virtual Reality Journal. He is a
[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, deputy editor of World Research Journal of Pattern Recognition and a member
A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual of Computer Graphics professional committee. Dr. Feng’s research interests are
recognition challenge,” International journal of computer vision, vol. in human hand tracking/recognition/interaction, virtual reality, human-computer
115, no. 3, pp. 211–252, 2015. interaction, and image processing. His research has been extensively supported
[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, by the Key R&D Projects of the Ministry of Science and Technology, Natural
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” Science Foundation of China, Key Projects of Natural Science Foundation of
in Proceedings of the European Conference on Computer Vision (ECCV). Shandong Province, and Key R&D Projects of Shandong Province with total
Springer, 2014, pp. 740–755. grant funding over three million RMB. For more information, please refer to
[40] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, http://nbic.ujn.edu.cn/nbic/index.php.
Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li,
X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and
D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,”
arXiv preprint arXiv:1906.07155, 2019.
[41] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-cam: Visual explanations from deep networks via
gradient-based localization,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 618–626.

Tianyang Xu received the B.Sc. degree in electronic


science and engineering from Nanjing University,
Nanjing, China, in 2011. He received the PhD
Qingbei Guo received the M.S. degree from the degree at the School of Artificial Intelligence and
School of Computer Science and Technology, Shan- Computer Science, Jiangnan University, Wuxi, China,
dong University, Jinan, China, in 2006. He is a in 2019. He is currently an Associate Professor at
member of the Shandong Provincial Key Laboratory the School of Artificial Intelligence and Computer
of Network based Intelligent Computing and the Science, Jiangnan University, Wuxi, China. His
associate professor in the School of Information research interests include visual tracking and deep
Science and Engineering, University of Jinan. He learning. He has published several scientific papers,
is now a Ph.D. student at Jiangnan University, Wuxi, including IJCV, ICCV, TIP, TIFS, TKDE, TMM,
China. His current research interests include wireless TCSVT etc. He achieved top 1 tracking performance in several competitions,
sensor networks, deep learning/machine learning, including the VOT2018 public dataset (ECCV18), VOT2020 RGBT challenge
computer vision and neuron networks. (ECCV20), and Anti-UAV challenge (CVPR20).

Xiao-jun Wu received his B.S. degree in mathemat-


ics from Nanjing Normal University, Nanjing, PR
China in 1991 and M.S. degree in 1996, and Ph.D.
degree in Pattern Recognition and Intelligent System
in 2002, both from Nanjing University of Science and
Technology, Nanjing, PR China, respectively. He was
a fellow of United Nations University, International Cong Hu received the Ph.D. degree from Jiangnan
Institute for Software Technology (UNU/IIST) from University, Wuxi, China, in 2019. He is currently
1999 to 2000. From 1996 to 2006, he taught in an Associate Professor with the School of Artificial
the School of Electronics and Information, Jiangsu Intelligence and Computer Science, Jiangnan Univer-
University of Science and Technology where he was sity. He was a visiting Researcher from July 2018
an exceptionally promoted professor. He joined the School of Information to July 2019 with the Centre for Vision, Speech and
Engineering, Jiangnan University in 2006 where he is a professor. He won the Signal Processing, University of Surrey, Guildford
most outstanding postgraduate award by Nanjing University of Science and GU2 7XH, UK. His research interests include pattern
Technology. He has published more than 300 papers in his fields of research. recognition, machine learning, and computer vision.
He was a visiting researcher in the Centre for Vision, Speech, and Signal
Processing (CVSSP), University of Surrey, UK from 2003 to 2004. His current
research interests are pattern recognition, computer vision, fuzzy systems,
neural networks and intelligent systems. A PPENDIX
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 12

object detection
Original image ResNet-50 DSA-ResNet-86 DSA-SENet-86 DSA-Res2Net-86

Figure 6. Visualization of object detection results using ResNet-50, SDA-ResNet-86, SDA-SENet-86, and SDA-Res2Net-86 as backbone networks, respectively.
Instance segmentation
Original image ResNet-50 DSA-ResNet-86 DSA-SENet-86 DSA-Res2Net-86

Figure 7. Visualization of instance segmentation results using ResNet-50, SDA-ResNet-86, SDA-SENet-86, and SDA-Res2Net-86 as backbone networks,
respectively.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 13

small objects

balloon traffic_light baseball jellyfish king_penguin Egretta caerulea assault_rifle ladybug padlock oboe

Original image

ResNet-50

(0.99731) (0.99661) (0.10068) (0.99975) (0.81601) (0.60039) (0.21402) (0.46512) (0.99946) (0.88737)

SENet-50

(0.84935) (0.97624) (0.14728) (0.27532) (0.87783) (0.22988) (0.75929) (0.44553) (0.77475) (0.45638)

Res2Net-50

(0.76307) (0.92784) (0.38324) (0.76601) (0.94254) (0.36123) (0.27084) (0.28484) (0.92993) (0.87614)

DSA-ResNet-86

(0.93182) (0.99855) (0.35910) (0.98820) (0.92832) (0.24809)


middle objects (0.78529) (0.40901) (0.99862) (0.97373)

(a)
street_sign beacon goldfish water_bottle yawl screw cucumber sorrel ruddy_turnstone mailbox

Original image

ResNet-50

(0.94413) (0.69897) (0.99998) (0.99190) (0.55770) (0.99897) (0.17088) (0.88220) (0.99875) (0.99567)

SENet-50

(0.76514) (0.68921) (0.88649) (0.94517) (0.86779) (0.92298) (0.15317) (0.46802) (0.91372) (0.96849)

Res2Net-50

(0.59625) (0.57756) (0.93651) (0.95919) (0.82728) (0.96016) (0.15816) (0.75633) (0.93039) (0.91310)

DSA-ResNet-86

(0.97091) (0.81356) (0.99791) (0.99280) (0.85883)


large objects (0.99334) (0.15147) (0.81115) (0.97085) (0.99539)

(b)
ballpoint warplane red_wine space_heater ice_cream bulbul spider_web basset tank aircraft carrier

Original image

ResNet-50

(0.13955) (0.37219) (0.96822) (0.95052) (0.98367) (0.99915) (0.85813) (0.29911) (0.95586) (0.98854)

SENet-50

(0.18997) (0.84351) (0.97776) (0.95094) (0.75245) (0.69576) (0.78495) (0.28536) (0.88822) (0.78986)

Res2Net-50

(0.37201) (0.75306) (0.82614) (0.99971) (0.92802) (0.92979) (0.99871) (0.31646) (0.88847) (0.91051)

DSA-ResNet-86

(0.37297) (0.90670) (0.99489) (0.98584) (0.97545) (0.95978) (0.81562) (0.80364) (0.93569) 0.94333

(c)

Figure 8. More Grad-CAM visualization results for different attention methods. (a) Small objects. (b) Medium objects. (c) Large objects. The softmax scores
for the target class are displayed at the bottom of each image. The red scores denote the wrong classification results.

You might also like