Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation
Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation
new attention dimension, i.e., depth, in addition to existing scale and attention methods, which extends the range of
attention dimensions such as channel, spatial, and branch. effective receptive fields towards small receptive fields
Based on this, we present a novel depth attention network for better performance with slight computation increase,
that adaptively fuses the multi-scale features from different providing a new scope of interpretation of neural networks.
depth blocks for various scales of objects. Concretely, in each • Extensive experiments confirm that our SDA-Net out-
stage of neural networks, we build a depth attention module, performs the state-of-the-art multi-scale and attention
which consists of two branches, i.e., a trunk branch and an networks on numerous computer vision tasks, i.e., image
attention branch. We use the unit of ResNet as the trunk classification, object detection, and instance segmentation,
branch, which is responsible for producing the hierarchical under a similar model complexity.
features from the blocks of each stage. The feature hierarchy The rest of this paper is organized as follows: we first
consists of a series of feature maps with the same number introduce the related work in Section II. We then present the
of channels and the same resolution but different receptive SDA method in Section III. Through extensive experiments,
field sizes, which is helpful to align for the feature fusion in we demonstrate that the proposed SDA-Net delivers superior
the channel and spatial dimensions. Similar to SENet [11], performance on numerous computer vision tasks in Section IV.
the attention branch is implemented as a lightweight network In Section V, we conduct an ablation study to investigate the
module, acting as a feature selector for the hierarchical features effect of different factors on our SDA-Net. Subsequently, we
of the trunk branch. Global embedded information is obtained analyze the computational cost, the range of effective receptive
by a global average pooling (GAP), followed by two simple fields, the multi-scale representation adaptability, and class
Fully-Connected layers (FC), with a softmax operation being activation mapping (CAM) in Section VI. Finally, we draw the
applied to obtain the attention weights of the corresponding paper to a conclusion in Section VII.
hierarchical features. Based on the weighted multiplication
and element-wise summation, we get a rich multi-scale feature
II. R ELATED W ORK
representation that corresponds to a wide range of receptive
fields. Small receptive field components from previous blocks Multi-scale feature representations. It has been widely
are included and delivered across stages. In this way, a cross- studied that multi-scale feature representations are essential
block information interaction is built by adaptively adjusting for various computer vision tasks. TridentNet [9] utilized the
the receptive field sizes of different blocks, and the long-range dilated convolution with different dilation rates to generate
dependency is captured along the depth direction. The proposed multiple scale-specific feature maps to solve the scale variation
SDA-Net is constructed by stacking multiple SDA modules, problem of input objects in object detection while exploring
similar to the residual blocks in the ResNet-like manner. the relationship between the receptive field and the scale
Moreover, our SDA method is orthogonal to other multi- variation. A feature pyramid network (FPN) was built by
scale and attention methods, thus SDA-Net can be integrated exploiting the inherent multi-scale feature hierarchy of CNNs
with their neural networks, improving the model performance for object detection [12]. A multi-scale attention model is
for image-related vision tasks with a slight computation constructed with multi-scale input images for the performance
burden. Such a combination merges their receptive field ranges, improvement of semantic segmentation [13]. Some works
thus increasing the range of effective receptive fields in the designed multiple branch networks to aggregate multi-scale
direction of small ones. A wider range of effective receptive information in different scale branches, achieving strong
fields can significantly boost the performance, enhancing the representational power for image classification [14], [15].
interpretability of neural networks, which is verified by our Multi-kernel convolution is one of the commonly used
experiments. Extensive experiments on numerous computer methods for multi-scale feature representation. The early
vision tasks including image classification, object detection, InceptionNets [16]–[18] used the inception module with
and instance segmentation, have been conducted in this work. different kernel sizes to extract multiple scale size feature
The results demonstrate that our SDA-Net achieves superior maps for improving the representation ability of models in
performance to the state-of-the-art multi-scale and attention image classification. PyConv [6] and MixNet [19] exploited
networks under similar computation efficiency. Compared to the lightweight group convolution with multiple kernel sizes
other attention and multi-scale networks, as shown in Fig. 1, for better model accuracy and efficiency. Res2Net [5] designed
the proposed SDA-Net provides the best trade-off between a novel building block with hierarchical residual-like patterns
accuracy and efficiency. to capture in-layer multi-scale features, increasing the range of
The proposed SDA method reflects the following advantages: receptive fields. SKNet [20] introduced a novel selective kernel
convolution to adaptively adjust different kernel selections in
• We first analyze a new attention dimension, i.e., depth, a soft-attention manner. Similarly, EPSANet [7] used different
in addition to other existing attention dimensions such as kernel sizes to extract rich spatial features to achieve strong
channel, spatial, and branch, thus enabling us to capture multi-scale representation ability.
the long-range dependency along the depth direction. In this work, we exploit the inherent multi-scale represen-
• Based on this, we construct a novel SDA-Net adaptively tation of CNNs to extract the hierarchical features instead
adjusting the receptive field sizes for rich multi-scale of multi-kernel convolutions, and the small receptive fields
feature representations. of low layers are concerned for a wide range of multi-scale
• The proposed SDA method is orthogonal to other multi- representations. Furthermore, the proposed SDA-Net differs
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 3
Depth-wise attention
×
of CNNs, because the effective reception field is adaptively
×
captured in the in-stage pattern by using a novel depth attention scale
×
mechanism. …
Z
h
1Z 2 Z 3 Z × + O hm
Attention mechanisms. The attention mechanisms enable c
w select c w
GAP FC FC v s
softmax
2 2
ones. Many attention methods have been proposed to explore u u’
…
v s
F 1×1×c 1×1×d … 3 3
hierarchy, which has gradual semantics from low to high levels. block for aligning the FLOPs with ResNet-50. Thanks to the
Our goal is to adaptively weight the feature hierarchy via the proposed depth attention mechanism, SDA-Net enables a wide
attention branch to achieve the multi-scale representation for range of effective receptive fields for the flexible multi-scale
better predictions. representation ability.
To achieve this goal, we first merge the hierarchical features To be specific, if SDA-Net pays more attention to deep
by using an element-wise summation, as follows, blocks of each stage, then it outputs a high level of semantic
m
features with large receptive fields, which allows it to favor
F=
X
Zi , the large-scale objects. On the contrary, if shallow blocks
(1)
i=1
are paid more attention to in each stage, then a low level of
features is obtained with small receptive fields, being suitable
where F ∈ Rh×w×c . for learning the small-scale objects. The original ResNet is
Then, we aggregate the spatial information of feature maps F viewed as a special case of our SDA-Net, which generates
by using GAP to generate the global spatial context descriptor the fixed-scale features for the single-scale objects through the
u ∈ R1×1×c . In other words, u is generated by shrinking F only last block. In contrast, our SDA-Net adaptively learns the
through its spatial dimensions h × w. Specifically, the k-th dynamic multi-scale contexts to achieve better performance for
element of the descriptor u is calculated by various scales of objects by leveraging the intermediate features
h w instead of just the features of the last block. More importantly,
1 XX
uk = GAP(F:,:,k ) = Fi,j,k . (2) because of the newly proposed depth attention dimension, our
h × w i=1 j=1 method is orthogonal to other multi-scale attention methods
such as EPSANet, SENet, CBAM, PyConv, and Res2Net.
Subsequently, u is fed into two 1×1 convolutions, as follows, Thus, the proposed SDA method further extends the range
of effective receptive fields towards small receptive fields
v = W2 (γ(β(W1 u))), (3) for enhancing the model representation ability, which is of
where β and γ are the functions of BN [36] and ReLU [37], similar complexities to EPSANet-101, SENet-101, CBAM-101,
respectively. W1 ∈ Rd×1×1×c is a learned linear transfor- PyConv-101, and Res2Net-101, respectively. A comparison
mation that maps u to a low-dimensional space for better of ResNet-50, SDA-ResNet-86, SDA-SENet-86 and SDA-
efficiency, and W2 ∈ Rm·c×1×1×d is used to resume the Res2Net-86 neural architectures can be found in Table I.
channel dimension. Here, d = max(c/r, L), where r is the In addition to the above attention mechanism, the compo-
reduction ratio and L is the threshold value. sition of Z is also critical for our SDA-Net to generate the
Further, we align v along the depth dimension to obtain features with rich semantic information according to Eqns. (1)
>
v by a reshaping operation, and employ the softly weighted and (5). The key lies on the choice of features from the
mechanism with a softmax activation, as follows, intermediate blocks. There are several candidate features,
including x+F (x) and γ(x+F (x)), where x is the input tensor
exp(vi> ) of the blocks and F (x) represents multiple convolutional layers
s> = δ(v> ) = Pm >)
, (4)
i=1 exp(vi
which learn a residual mapping function. In this paper, we
select the output of each block before activation, i.e., x + F (x),
where δ denotes the softmax function, and s> is the set of
to constitute the sequence of feature map subsets Z given the
attention vectors.
information loss from the activation function γ. The choice and
Finally, at the end of each SDA module, we adaptively fuse effect of these intermediate features will be further discussed
different scales of semantic information by softly weighting in the section on Ablation Study.
the cross-block features according to the scale of input objects.
Thus, the output of the SDA module can be formulated as:
IV. E XPERIMENTS
m
X In this section, we conduct extensive experiments to investi-
O = γ( si Zi ), (5) gate the effectiveness of our SDA-Net on numerous benchmark
i=1
CV tasks, i.e., image classification, object detection, and
where s> is reshaped to return to the alignment s along the instance segmentation.
channel dimension, and O denotes the final output of the
proposed SDA module. Note that the attention model captures
A. Implementation Details
the channel-wise relationships across blocks, and decides how
much attention to pay to the features at different depths, thus To evaluate the effectiveness of our SDA-Net for the image
achieving the powerful multi-scale feature representation ability. classification task, we carry out experiments on the large-scale
ImageNet-1k [38] dataset, and deploy the typical ResNet-50
as the trunk branch. Considering the combination with other
B. Neural Network Architecture multi-scale and attention methods, we also deploy several
We construct our proposed SDA-Net by stacking multiple other trunk branches, including EPSANet, SENet, CBAM,
SDA modules. Based on ResNet-50, we adjust the number PyConv, and Res2Net. The ImageNet-1K dataset contains 1000
of blocks in each stage from (3, 4, 6, 3) to (5, 6, 12, object classes with 1.28M training images for training samples
5), and set 8 groups in the second convolution of each and 50K validation images for testing samples. We adopt the
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 5
Table I
A RCHITECTURES FOR I MAGE N ET CLASSIFICATION . T HE FOUR COLUMNS REFER TO R ES N ET-50, SDA-R ES N ET-86, SDA-SEN ET-86 AND
SDA-R ES 2N ET-86, RESPECTIVELY.
same data augmentation and hyper-parameter setting as in [10]. Specifically, our SDA-Net outperforms other multi-scale
Specifically, the training images are resized to 256×256 and networks by a large margin with similar computational com-
then are randomly cropped to 224×224, followed by random plexity. Our method achieves 1.45%, 0.78%, 1.27% and 0.77%
horizontal flipping. All the models are trained from scratch higher performance in terms of top-1 accuracy than bL-ResNet,
by stochastic gradient descent (SGD) with weight decay 1e-4, ScaleNet, EPSANet, PyConv, and Res2Net, respectively. More-
momentum 0.9, and mini-batch size 256. The initial learning over, the model integrating our method with other multi-scale
rate is set to 0.1, and decreased by a factor of 10 every 30 methods significantly improves the recognition performance,
epochs until 120 epochs. During testing, we resize the validation even consistently surpassing their 101-layer variant networks,
images to 256×256 and then use the center crop of 224× 224 i.e., EPSANet-101, PyConv-101, and Res2Net-101, with a
for evaluation. Label Smoothing [17] is applied to regularize smaller computational cost. This fully demonstrates the superi-
the network. ority of adaptive aggregation for different receptive fields.
To evaluate our SDA method on MS COCO [39] for other
downstream CV tasks, we use multiple detectors such as Faster To further evaluate the proposed method for the image
R-CNN [2] and Mask R-CNN [4].The COCO-2017 dataset classification task, our SDA-Net is compared with the SOTA
consists of about 118K training images and 5K validation attention methods with different attention dimensions on
images as testing ones over 80 foreground object classes and ImageNet-1K, such as SENet [11], FcaNet [23], GCNet [25],
one background class. These detectors are implemented with CBAM [31] and EPSANet [7]. We present the result compared
MMDetection toolbox [40] and are configured with its default with the SOTA attention methods and the result of combining
settings. Concretely, during training, the input images are with the part of those attention methods in Table III.
resized such that the short edge is 800 pixels. We train all the
models for 12 epochs using SGD with an initial learning rate In particular, our SDA-Net achieves the best performance
of 0.02 which is decreased by 10 at the 8th and 11th epochs. among these attention methods with similar complexities. For
The other hyper-parameters are set as follows: a weight decay the channel attention dimension, our method outperforms
of 1e-4, a momentum of 0.9, and a mini-batch size of 16 (4 SENet by more than 2% in terms of top-1 accuracy. For the
GPUs with 4 images per GPU). spatial-related attention dimension, our SDA-Net obtains a
1.42% higher gain than CBAM. Our methods boost 1.27%
top-1 accuracy for the branch attention method, i.e., EPSANet.
B. Image Classification
The performance gap between our SDA-Net and FcaNet is
We compare our SDA-Net with the SOTA multi-scale the smallest (0.24%). Additionally, we observe 0.66%, 0.69%,
networks on ImageNet-1K, including bL-ResNet [15], and 1.09% improvement when combining with the typical
ScaleNet [14], EPSANet [7], PyConv [6], and Res2Net [5], networks of different attention dimensions, i.e., SENet, CBAM,
to evaluate the proposed method on the image classification and EPSANet, respectively. Our SDA-Net even consistently
task. Table II shows the result comparison with the SOTA surpasses their 101-layer network variants with less memory
multi-scale methods and the result of combining with the part and computation cost. This fully shows the effectiveness of
of those multi-scale methods. our proposed depth attention mechanism.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 6
Table II
C LASSIFICATION C OMPARISON AMONG SEVERAL SOTA MULTI - SCALE METHODS ON I MAGE N ET-1K.
Table III
C LASSIFICATION C OMPARISON AMONG SEVERAL SOTA ATTENTION METHODS ON I MAGE N ET-1K.
C. Object Detection its generalization ability. Similarly, we also use our SDA-Net
We apply our method to the object detection task to explore with FPN [12] as the backbone of Mask R-CNN and test
its generalization ability. We evaluate our SDA-Net as the their performance on the MS COCO dataset. The performance
backbone of Faster R-CNN and Mask R-CNN. All models of instance segmentation is shown in Table V. Compared
are trained along with FPN [12] and are tested on the MS with other methods, our proposed method achieves the best
COCO dataset. Table IV shows the object detection results. performance. The improvement of AP are 2.4%, 1.6%, and
The SDA-Net based models outperform their counterparts by 2.3% for ResNet-50, SENet-50, and Res2Net-50, respectively,
a significant margin with both faster-RCNN and Mask-RCNN. surpassing even their 101-layer variant models in terms of
Compared with ResNet-50, SENet-50, and Res2Net-50, their almost all indicators. In the Appendix section, some visual
SDA-Net based models achieve 3.4%, 3.4%, and 8.4% higher results of instance segmentation on challenging examples are
AP performance for Faster-RCNN, respectively, and achieve illustrated in Fig. 7.
3.4%, 2.8%, and 2.7% higher AP performance for Mask-RCNN,
respectively, surpassing even their 101-layer variant models in
terms of almost all indicators. In the Appendix section, some
visual results of object detection on challenging examples are V. A BLATION S TUDY
illustrated in Fig. 6.
As discussed in Section III, there are two major factors
D. Instance Segmentation affecting the multi-scale information flows across stages: the
Besides the object detection task, we also evaluate our feature sequence in trunk branches and the adaptive depth
method on the instance segmentation task to further verify attention mechanism.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 7
Table IV
O BJECT DETECTION RESULTS OF DIFFERENT METHODS ON COCO VAL 2017.
Table V
I NSTANCE SEGMENTATION RESULTS OF DIFFERENT METHODS ON COCO VAL 2017.
A. The Effect of the Feature Sequence in Trunk Branch B. The Effect of Adaptive Attention vs. Non-adaptive Attention
Next, we compare our adaptive attention with non-adaptive
We investigate the effect of the feature sequence in trunk attention to investigate the effect of our depth attention mecha-
branches. Based on Eqn. (5), we find that the hierarchical nism. These two types of networks have similar architectures
features are essential factors to affect the information flow except for the part of attention branches. Different from
across stages. Empirically, we select two crucial feature our adaptive SDA-Net, the non-adaptive attention network
elements, i.e., γ(x + F (x)) and x + F (x), which contain aggregates the features from different blocks by the simple
1
Pm
rich semantic information with identity features. summation, i.e., O = γ( m i=1 Zi ), instead of the adaptive
The comparison result is shown in Table VI. It is easily seen weighting operation, and the features of each block are equally
that x + F (x) achieves higher accuracy than γ(x + F (x)). We important for the multi-scale feature representations.
argue that this is because the ReLU function results in the As shown in Table VII, we can see that using our adaptive
information loss due to zeroing the negative activations, and depth attention method has better performance, achieving
their combination further increases the loss potential. Therefore, close to 0.8% higher accuracy than the non-adaptive attention
we select x + F (x) as the component of the feature sequence competitor, which verifies the effectiveness of our depth
Z in trunk branches for better performance. attention mechanism. More interestingly, the non-adaptive
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 8
40 10
linearly with the kernel size K and sub-linearly with the depth
35
8
30
L of networks [8]. In other√words, the effective receptive
30
KL 1/2
6
20 25
4
10 2
that our proposed method effectively extends the range of
20
0
0
2
effective receptive fields towards small receptive fields because
15
3
3
4
5 it aggregates shallow features. 10
6
5
K
7
L 7
In Fig. 3, we show the range of effective receptive fields
8
9
5
10
9
12
in the second stage for EPSANet-86 and our SDA-ResNet-
11
12
3
0
K
L
86, taking only two factors, K, and L, into consideration.
(a) (b) The output of these two networks contains multiple feature
Figure 3. The range of adaptive effective receptive field in the second stage. components with different receptive fields, in which the large
(a) EPSANet. (b) Our SDA-ResNet-86. receptive field contributes to capturing semantic information
for large-scale objects, and the small receptive field helps to
extract local details for small-scale patterns. For EPSANet, it
attention network achieves better performance than the ResNet-
is easily seen that the smallest and largest components of the
86 model, improving the accuracy from 77.43% to 78.01%.
receptive fields depend on the branch of kernel sizes 3 and 9,
Therefore, the result fully verifies the success of our SDA-Net
respectively. By contrast, the largest receptive field of our SDA-
design that exploits the hierarchical feature representation of
ResNet-86 is dependent on kernel size 3, which is equivalent
CNNs and the depth attention mechanism for richer information
to the smallest one of EPSANet, and the other receptive fields
fusion.
gradually shrink as the network depth decreases. Therefore, it is
worth noting that our method seeks to strengthen the power of
VI. A NALYSIS AND I NTERPRETATION
multi-scale representations towards small receptive fields, and
A. Computational Complexity the combination with other methods further extends the range
In this part, we consider the additional parameters introduced of effective receptive fields to achieve stronger representational
by the proposed SDA module. These additional parameters power.
come from the two 1×1 convolution layers of the attention
branch. Thus, the total number of additional parameters C. Attention Distribution
equals the additional computational complexity, which can To understand how our SDA-Net works for adaptive multi-
be calculated by scale learning by dynamically adjusting receptive fields, we
n
analyze the attention distribution for the same input objects but
1X 2 with different scales. Concretely, we enlarge all the samples
(mi + 1)ci , (6)
r i=1 of the ImageNet validation set from 0.14× to 2.0×, followed
by a 224×224 central cropping. For the validation set with
where n is the number of stages, r refers to the reduction ratio,
different scales, we record the attention values of each block
mi denotes to the number of blocks and ci denotes the number
and analyze their distributions to mathematically interpret the
of output channels for the i-th stage.
adaptive adjustment of receptive fields.
Our SDA-Net offers a good trade-off between improved Fig. 4 shows the attention distributions of each block in the
accuracy and increased computation complexity. To illustrate second stage. On one hand, it is seen that each block is not
the model complexity, we make a comparison between ResNet- equal to be attended for the multi-scale feature representations.
86 and SDA-ResNet-86. For a fair comparison, we use the On the other hand, as the input objects enlarge, the attention
same crop size of 224 for input images. ResNet-86 requires means gradually reduces for the first block; on the contrary,
∼24.44M Params (the number of parameters) and ∼3.87G for the fourth block, the attention means tends to increase. For
FLOPs (floating-point of operations) for a single input image. the third block, when the enlargement factor is less than 1×,
In comparison, our SDA-ResNet-86 requires ∼27.22M Params the attention means dramatically increases. At the 1× scale,
and ∼3.88G FLOPs, corresponding to an acceptable increase the attention means reaches its peak, and it gradually decreases
in the number of parameters and a slight increase in compu- when the enlargement factor is greater than the 1× scale. The
tation complexity. At the cost of acceptable additional model second block follows a similar pattern to the third block. The
complexity, SDA-ResNet-86 significantly surpasses ResNet- fifth and last block has an approximately opposite trend to the
101 and ResNet-152 as well as ResNet-50, achieving 3.56%, second and third blocks. As a result, it suggests that the larger
1.93% and 1.18% higher top-1 accuracy, respectively. the input objects are, the more attention will be assigned to
the features of the deeper blocks, which fully verifies that our
B. The Range of Effective Receptive Field SDA-Net adaptively adjusts the receptive fields according to
We have demonstrated that the range of the effective the object scale sizes.
receptive field is a crucial issue for the objects at all scales in
many CV tasks. For simplicity, take EPSANet as an example, D. Class Activation Mapping
showing the essential difference with our proposed SDA-Net To further understand the adaptive multi-scale representation
through comparative analysis. According to the theory of ability of our SDA-Net, we visualize the class activation map-
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 9
0.25
attention distributon (mean, var)
0.15
attention distributon (mean, var)
VII. C ONCLUSION A ND F UTURE W ORK
0.24
A new attention dimension, i.e., depth, is the first to be
0.23 0.145
introduced, which is an essential factor in addition to existing
attention values
attention values
0.22 attention dimensions such as channel, spatial, and branch. Based
0.21 0.14
on this, we present a simple yet effective SDA-Net to explore
the depth attention mechanism for the multi-scale feature
0.2
representations. The network exploits the inherent hierarchical
0.19
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales
0.135
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales
features of CNNs and the long-range block dependency to
(a) (b)
adaptively adjust the effective receptive fields for different scale-
size objects. The proposed method is orthogonal to other multi-
0.146 0.15
scale and attention methods, thus SDA-Net can be integrated
0.144
0.148
with those networks to extend the range of effective receptive
0.142
0.146 fields towards small receptive fields for better performances.
attention values
attention values
0.14
0.144
We have achieved state-of-the-art performance in a broad range
0.138
of CV tasks, including image classification, object detection,
0.142
0.136
and instance segmentation.
0.134
0.14
0.205
0.17
0.169
ACKNOWLEDGMENT
attention values
0.2
attention values
0.168
0.167
0.166 0.195
This work is supported by the National Key R&D Program of
0.165
China (Grant No. 2018YFB1004901), by the National Natural
0.164 0.19
Science Foundation of China (Grant No.61672265, U1836218,
0.163
0.162
attention distributon (mean, var)
0.185
attention distributon (mean, var)
62006097), by the 111 Project of Ministry of Education
0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x 0 0.14x0.29x0.43x0.57x0.71x0.86x 1x 1.14x1.29x1.43x1.57x1.71x1.86x 2x
scales scales of China (Grant No. B12018), by UK EPSRC GRANT
(e) (f) EP/N007743/1, MURI/EPSRC/DSTL GRANT EP/R018456/1,
by the Science and Technology Program of University of
Figure 4. Attention distribution (mean, variance) of each block in the second
stage which consists of six blocks. (a) The first block. (b) The second block. Jinan (Grant No. XKY1913, XKY1915), and in part by the
(c) The third block. (d) The fourth block. (e) The fifth block. (f) The last Natural Science Foundation of Jiangsu Province (Grant No.
block. BK20200593).
R EFERENCES
ping (CAM) by applying Grad-CAM [41] to the images from
the ImageNet validation set. Grad-CAM is a commonly used [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in neural information
visualization method to localize the class-specific discriminative processing systems, vol. 25, pp. 1097–1105, 2012.
regions. [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
Fig. 5 shows the visualization results of CAM using ResNet, information processing systems, 2015, pp. 91–99.
SENet, Res2Net, and SDA-Net as the backbone. We randomly [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
sample the images with three different size objects (small, atrous convolution, and fully connected crfs,” IEEE transactions on
medium, and large) from the ImageNet validation set. For small pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
target objects such as ‘balloon’, ‘laddybug’, and ‘baseball’, the 2017.
SDA-Net based CAM regions tend to be more precise than other [4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
networks. Similarly, SDA-Net provides more precise CAM 2017, pp. 2961–2969.
localizations on medium target objects, such as ‘street sign’, [5] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr,
‘beacon’, and ‘goldfish’. Our SDA-Net covers almost all parts “Res2net: A new multi-scale backbone architecture,” IEEE transactions
on pattern analysis and machine intelligence, 2019.
of the large target objects, such as ‘ballpoint’, ‘bulbul’ and [6] I. C. Duta, L. Liu, F. Zhu, and L. Shao, “Pyramidal convolution:
‘red wine’, than other networks which cover so much irrelevant rethinking convolutional neural networks for visual recognition,” arXiv
context, especially Res2Net. Accordingly, the target class scores preprint arXiv:2006.11538, 2020.
[7] H. Zhang, K. Zu, J. Lu, Y. Zou, and D. Meng, “Epsanet: An efficient
of our SDA-Net also tend to be larger. In the Appendix section, pyramid split attention block on convolutional neural network,” arXiv
more CAM visualization results are illustrated for different preprint arXiv:2105.14447, 2021.
methods in Fig. 8. As can be seen, the proposed SDA-Net [8] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective
receptive field in deep convolutional neural networks,” in Proceedings
more precisely localizes the object regions with various scale of the 30th International Conference on Neural Information Processing
sizes than other networks. Systems, 2016, pp. 4905–4913.
CAM
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED)
tiny objects
10
Original image
ResNet-50
SENet-50
Res2Net-50
DSA-ResNet-86
Figure 5. Grad-CAM visualization results. The softmax scores for the target class are displayed at the bottom of each image.
[9] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks Proceedings of the IEEE/CVF Conference on Computer Vision and
for object detection,” in Proceedings of the IEEE/CVF International Pattern Recognition, 2019, pp. 510–519.
Conference on Computer Vision, 2019, pp. 6054–6063. [21] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net:
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image Efficient channel attention for deep convolutional neural networks,” in
recognition,” in Proceedings of the IEEE conference on computer vision Proceedings of the IEEE/CVF Conference on Computer Vision and
and pattern recognition, 2016, pp. 770–778. Pattern Recognition, 2020, pp. 11 534–11 542.
[11] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in [22] Z. Gao, J. Xie, Q. Wang, and P. Li, “Global second-order pooling
Proceedings of the IEEE conference on computer vision and pattern convolutional networks,” in Proceedings of the IEEE Conference on
recognition, 2018, pp. 7132–7141. Computer Vision and Pattern Recognition, 2019, pp. 3024–3033.
[12] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, [23] Z. Qin, P. Zhang, F. Wu, and X. Li, “Fcanet: Frequency channel attention
“Feature pyramid networks for object detection,” in Proceedings of the networks,” in Proceedings of the IEEE/CVF International Conference
IEEE conference on computer vision and pattern recognition, 2017, pp. on Computer Vision, 2021, pp. 783–792.
2117–2125. [24] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention aug-
[13] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to mented convolutional networks,” in Proceedings of the IEEE International
scale: Scale-aware semantic image segmentation,” in Proceedings of the Conference on Computer Vision, 2019, pp. 3286–3295.
IEEE conference on computer vision and pattern recognition, 2016, pp. [25] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks
3640–3649. meet squeeze-excitation networks and beyond,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision Workshop
[14] Y. Li, Z. Kuang, Y. Chen, and W. Zhang, “Data-driven neuron allocation
(ICCVW), 2019, pp. 1971–1980.
for scale aggregation networks,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp. [26] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
11 526–11 534. in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 7794–7803.
[15] C. F. Chen, Q. Fan, N. Mallinar, T. Sercu, and R. Feris, “Big-little
[27] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2 -nets: double
net: An efficient multi-scale feature representation for visual and speech
attention networks,” in Proceedings of the 32nd International Conference
recognition,” in Proceedings of the International Conference on Learning
on Neural Information Processing Systems, 2018, pp. 350–359.
Representations, 2019.
[28] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Applications of Computer Vision (WACV), 2021, pp. 3560–3569.
in Proceedings of the IEEE conference on computer vision and pattern [29] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He,
recognition, 2015, pp. 1–9. J. Mueller, R. Manmatha et al., “Resnest: Split-attention networks,” in
[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking Proceedings of the IEEE/CVF Conference on Computer Vision and
the inception architecture for computer vision,” in Proceedings of the Pattern Recognition, 2022, pp. 2736–2746.
IEEE conference on computer vision and pattern recognition, 2016, pp. [30] Q.-L. Zhang and Y.-B. Yang, “Sa-net: Shuffle attention for deep
2818–2826. convolutional neural networks,” in Proceedings of the International
[18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
inception-resnet and the impact of residual connections on learning,” in 2021, pp. 2235–2239.
Proceedings of the thirty-first AAAI conference on artificial intelligence, [31] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutional
2017. block attention module,” in Proceedings of the European conference on
[19] M. Tan and Q. V. Le, “Mixconv: Mixed depthwise convolutional kernels,” computer vision (ECCV), 2018, pp. 3–19.
in 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, [32] J. Park, S. Woo, J.-Y. Lee, and I.-S. Kweon, “Bam: Bottleneck attention
UK, September 9-12, 2019. BMVA Press, 2019, p. 74. module,” in British Machine Vision Conference (BMVC). British Machine
[20] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Vision Association (BMVA), 2018.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 11
object detection
Original image ResNet-50 DSA-ResNet-86 DSA-SENet-86 DSA-Res2Net-86
Figure 6. Visualization of object detection results using ResNet-50, SDA-ResNet-86, SDA-SENet-86, and SDA-Res2Net-86 as backbone networks, respectively.
Instance segmentation
Original image ResNet-50 DSA-ResNet-86 DSA-SENet-86 DSA-Res2Net-86
Figure 7. Visualization of instance segmentation results using ResNet-50, SDA-ResNet-86, SDA-SENet-86, and SDA-Res2Net-86 as backbone networks,
respectively.
IEEE TRANSACTIONS ON IMAGE PROCESSING (SUBMITTED) 13
small objects
balloon traffic_light baseball jellyfish king_penguin Egretta caerulea assault_rifle ladybug padlock oboe
Original image
ResNet-50
(0.99731) (0.99661) (0.10068) (0.99975) (0.81601) (0.60039) (0.21402) (0.46512) (0.99946) (0.88737)
SENet-50
(0.84935) (0.97624) (0.14728) (0.27532) (0.87783) (0.22988) (0.75929) (0.44553) (0.77475) (0.45638)
Res2Net-50
(0.76307) (0.92784) (0.38324) (0.76601) (0.94254) (0.36123) (0.27084) (0.28484) (0.92993) (0.87614)
DSA-ResNet-86
(a)
street_sign beacon goldfish water_bottle yawl screw cucumber sorrel ruddy_turnstone mailbox
Original image
ResNet-50
(0.94413) (0.69897) (0.99998) (0.99190) (0.55770) (0.99897) (0.17088) (0.88220) (0.99875) (0.99567)
SENet-50
(0.76514) (0.68921) (0.88649) (0.94517) (0.86779) (0.92298) (0.15317) (0.46802) (0.91372) (0.96849)
Res2Net-50
(0.59625) (0.57756) (0.93651) (0.95919) (0.82728) (0.96016) (0.15816) (0.75633) (0.93039) (0.91310)
DSA-ResNet-86
(b)
ballpoint warplane red_wine space_heater ice_cream bulbul spider_web basset tank aircraft carrier
Original image
ResNet-50
(0.13955) (0.37219) (0.96822) (0.95052) (0.98367) (0.99915) (0.85813) (0.29911) (0.95586) (0.98854)
SENet-50
(0.18997) (0.84351) (0.97776) (0.95094) (0.75245) (0.69576) (0.78495) (0.28536) (0.88822) (0.78986)
Res2Net-50
(0.37201) (0.75306) (0.82614) (0.99971) (0.92802) (0.92979) (0.99871) (0.31646) (0.88847) (0.91051)
DSA-ResNet-86
(0.37297) (0.90670) (0.99489) (0.98584) (0.97545) (0.95978) (0.81562) (0.80364) (0.93569) 0.94333
(c)
Figure 8. More Grad-CAM visualization results for different attention methods. (a) Small objects. (b) Medium objects. (c) Large objects. The softmax scores
for the target class are displayed at the bottom of each image. The red scores denote the wrong classification results.