0% found this document useful (0 votes)

2 views

TIP21-SAMNet

The document presents SAMNet, a lightweight network for salient object detection (SOD) that utilizes a novel Stereoscopically Attentive Multi-scale (SAM) module to effectively learn multi-level and multi-scale information. SAMNet achieves high accuracy comparable to state-of-the-art methods while maintaining a significantly smaller model size and faster processing speeds, making it suitable for real-time applications on mobile devices. The proposed method addresses the challenges of traditional SOD approaches that rely on large networks, thereby promoting practical deployment in resource-constrained environments.

Uploaded by

xiaoyuancaizi02

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

TIP21-SAMNet

Uploaded by

xiaoyuancaizi02

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

SAMNet: Stereoscopically Attentive Multi-scale

Network for Lightweight Salient Object Detection
Yun Liu*, Xin-Yu Zhang*, Jia-Wang Bian, Le Zhang, Ming-Ming Cheng

Abstract—Recent progress on salient object detection (SOD) undoubtedly makes it less practical for real-time and resource-
mostly benefits from the explosive development of Convolutional constrained applications such as autonomous driving, robots,
Neural Networks (CNNs). However, much of the improvement augmented reality, and so on. In those scenarios, mobile
comes with the larger network size and heavier computation
overhead, which, in our view, is not mobile-friendly and thus devices have reduced computational capabilities, restrictive
difficult to deploy in practice. To promote more practical SOD memory constraints, and limited energy overhead.
systems, we introduce a novel Stereoscopically Attentive Multi- Designing lightweight CNNs could be definitely one of the
scale (SAM) module, which adopts a stereoscopic attention mech- solutions for the above problem, and it has been studied for
anism to adaptively fuse the features of various scales. Embarking other tasks such as image classification [38]–[41]. Although
on this module, we propose an extremely lightweight network,
namely SAMNet, for SOD. Extensive experiments on popular drawing inspirations from them, we are the first to make
benchmarks demonstrate that the proposed SAMNet yields an effort in SOD. This is nontrivial due to the following
comparable accuracy with state-of-the-art methods while running challenges: i) SOD requires both high-level abstract semantic
at a GPU speed of 343fps and a CPU speed of 5fps for 336 × 336 features and low-level fine-grained features to locate salient
inputs with only 1.33M parameters. Therefore, SAMNet paves objects and refine object details, respectively; ii) SOD needs
a new path towards SOD. The source code is available on the
project page https://mmcheng.net/SAMNet/. multi-scale information to process salient objects with various
sizes and aspect ratios in natural scenes. Since lightweight
Index Terms—Lightweight salient object detection, lightweight networks usually have shallow depths and simplified oper-
saliency detection, multi-scale learning.
ations, they are less potent in multi-level and multi-scale
learning than traditional large networks [42], [43]. Therefore,
I. I NTRODUCTION naively applying existing lightweight backbone networks such
as MobileNets [38], [39] and ShuffleNets [40], [41] into SOD
ALIENT object detection (SOD), also known as saliency
S detection, aims at detecting the most visually distinctive
objects or regions in natural images [1]. The progress in SOD
leads to suboptimal performance, which will be demonstrated
in the experiments.
It is well-known that CNNs can learn high-level semantic
has been beneficial to a wide range of computer vision applica- information at their top sides and low-level fine details at their
tions, including image retrieval [2], image segmentation [3], bottom sides. This makes different side-outputs of CNNs con-
object detection [4], visual tracking [5], scene classification tain multi-scale information. Hence, to learn multi-level and
[6], content-aware image editing [7], etc. Conventional meth- multi-scale information, current state-of-the-art SOD methods
ods for this task mainly rely on hand-crafted low-level features (with large networks) adopt encoder-decoder network architec-
and heuristic priors [1], [8], [9], but the lack of high-level se- tures [10]–[29] to integrate the multi-level side-output features
mantic information usually leads to limited accuracy. Recently, of backbone networks. Recent development of SOD mainly
thanks to the unprecedented success of Convolutional Neural comes from new strategies and modules for the effective fusion
Networks (CNNs), especially Fully Convolutional Networks of multi-level backbone features.
(FCNs), deep learning based methods have refreshed the state- Based on the above analyses, the key to lightweight SOD is
of-the-art performance of SOD [10]–[33]. how to effectively learn multi-level and multi-scale informa-
However, those improvements do not come without cost: tion within limited parameter budgets. Instead of integrating
they usually rely on large network size and substantial com- different side-outputs of backbone networks [14], [19], [21],
putational overhead [14], [16], [21], [30], [31], [33]–[36]. For [23], [28], [30], [37] or summarizing the convolutional features
example, EGNet [37] with the VGG16 backbone has 108M of different dilation rates [15], [44] as done in previous
parameters and needs ∼432 MB disk to store its pretrained studies, we propose a novel Stereoscopically Attentive Multi-
model. Moreover, EGNet [37] can only run at 0.09fps on scale (SAM) module for multi-scale learning. It adopts a
the powerful i7-8700K CPU and 12.7fps on an NVIDIA stereoscopic attention mechanism to automatically control
TITAN XP GPU for 336 × 336 images. Such ponderousness the learning at different scales, so it has the capability to
effectively learn the necessary information at various levels of
Y. Liu and M.M. Cheng are with College of Computer Science, Nankai
University. X.Y. Zhang is with School of Mathematical Science, Nankai deep CNNs. Using SAM module as the basic unit, we build
University. J.W. Bian is with the School of Computer Science, The University a lightweight encoder-decoder network, namely SAMNet, to
of Adelaide, Australia. L. Zhang is with the Agency for Science Technology integrate the multi-level and multi-scale features learned by
and Research (A*STAR), Singapore.
The first two authors contributed equally to this paper. M.M. Cheng is the SAM modules. SAMNet achieves comparable accuracy with
corresponding author ([email protected]). state-of-the-art SOD methods while running at 5fps on an i7-
IEEE TRANSACTIONS ON IMAGE PROCESSING 2

8700K CPU and 343fps on an NVIDIA TITAN XP GPU Wang et al. [56] proposed to use an hourglass module to
for 336 × 336 images. Moreover, SAMNet only has 1.33M generate attention maps for hidden features. Furthermore, Hu
parameters. These lightweight properties make it possible for et al. [57] proposed a “Squeeze-and-Excitation” module to
practical mobile applications. explicitly exploit inter-channel relationships and adaptively
In summary, our contributions are threefold: recalibrate feature maps in a channel-wise manner. Beyond
• We propose a novel Stereoscopically Attentive Multi- channel-wise attention, CBAM [58] introduces spatial atten-
scale (SAM) module that adopts a stereoscopic attention tion in a similar way. These methods fall into the self-
mechanism for effective and efficient multi-scale learn- attention category. Spatial and channel-wise self-attention can
ing. adaptively emphasize the most informative feature patches
• Using the SAM module as the basic unit, we propose and channels, respectively. Different from these methods, we
SAMNet, a lightweight encoder-decoder architecture for introduce a stereoscopic attention mechanism to adaptively
SOD, which is the first lightweight SOD model as we recalibrate information flow from multiple branches based on
know. both channel inter-dependencies and spatial contextual clues.
• We empirically evaluate SAMNet on six popular SOD Hence the proposed SAM module can effectively learn multi-
datasets and demonstrate its comparable accuracy, much scale information under a lightweight setting.
higher efficiency, and much smaller network size.
III. M ETHODOLOGY
II. R ELATED W ORK In this section, we elaborate on the proposed framework
for SOD. In Section III-A, we present a simple multi-scale
a) Salient Object Detection: Over the past two decades, module. In Section III-B, we propose the SAM module for
numerous methods have been proposed to detect salient objects effective multi-scale learning. Finally, in Section III-C, we in-
in an image. Traditional methods are mainly based on hand- corporate the proposed SAM module into the encode-decoder
crafted features, such as image contrast [1], texture [45], cen- network and elaborate on the full network architecture.
tral prior [8], background prior [46], etc. Despite the efficiency
of these approaches, hand-crafted features intrinsically lack
A. Multi-scale Learning
the capacity for high-level representation, leading to limited
performance. Based on the above analyses, the multi-scale feature repre-
With the rapid development of deep learning, CNN-based sentations of CNNs are of great importance for SOD [14], [19],
SOD methods have surpassed traditional counterparts by a [23], [28], [30], [37]. Inspired by this, we first propose a simple
considerable margin. Early CNN-based methods [24], [47]– multi-scale module to process visual information at different
[49] include several fully-connected layers, resulting in the scales. Lightweight is central to our design. We adopt dilated
loss of essential spatial information of the whole image. Since convolutions with different dilation rates to capture multi-scale
the seminal work [50] proposed FCNs to predict semantic information and use the depthwise separable convolution to
labels at the pixel level, FCN-based SOD approaches [11]– reduce floating-point operations and model parameters. We
[16], [19]–[23], [25] have dominated this field by exploring the call it dilated depthwise separable convolution and use it
multi-level and multi-scale deep features, as described above. as the basic convolutional operator to scale up the network
Although previous approaches have achieved high accu- dimensionality in terms of depth and width.
racy by employing powerful CNNs, they are relatively slow Formally, let I ∈ RC×H×W be the input feature map
in speed, hungry in energy consumption, and occupy large whose number of channels, height, and width are C, H, and
memory space. These shortcomings make it difficult to deploy W , respectively. With the input I, we first apply a single
state-of-the-art methods into real-world applications. This is depthwise separable conv3 × 3 (DSConv3 × 3 for short) to
our motivation for this work, i.e., towards lightweight SOD extract common information F0 for each branch, namely,
that trades off accuracy, model size, and speed. Our technical F0 = K0 (I), (1)
motivation comes from the fact that most of previous CNN-
based methods [10]–[29] improve the performance through where K0 denotes a DSConv3 × 3 operation. At different
the exploration in multi-scale and multi-level deep learning. branches, dilated DSConv3 × 3 with different dilation rates
In this paper, we propose a novel SAM module that adopts are applied to F0 , i.e.,
stereoscopic attention for lightweight multi-scale learning. Fi = Ki (F0 ), i = 1, 2, · · · , N, (2)
b) Attention Mechanism: Attention mechanism plays an
essential role in human perception [51], [52]. Instead of where Ki denotes the dilated DSConv3 × 3 operation at
processing the whole image at once, human visual system branch i, and N is the number of branches. Then, contextual
adaptively filters less essential information like image back- information at different scales is aggregated by a single
ground to enhance the capture of visual structure. This has element-wise summation with a residual connection, namely,
inspired the research in deep learning. N
X
Recent computer vision community has witnessed numerous F = Fi . (3)
successes of attention mechanism in a wide range of tasks, i=0
including SOD [19], [26], [34], sequence learning [53], person Here, we use element-wise summation instead of concatena-
re-ID [54], and image recovery [55]. For image classification, tion, because concatenation will greatly increase the number
IEEE TRANSACTIONS ON IMAGE PROCESSING 3

Identity Mapping
Attention Mechanism Element-wise sum
r=1
Element-wise mul
F1 W0V F0
Dilated DSConv3×3 Channel-wise Attention Mechanism
GAP Global average pooling
r=2 FC FC
GAP
z d
F2 Feature Fusion W1V F1

DSConv3×3 r=4 conv1×1

softmax
conv1×1 W2V
I F0 F3 M v F2 F O
r=8 r=2 r=4 Volumetric
conv1×1 Attention
Dialted DSConv3×3 s W3V F3
F4
Spatial Attention Mechanism

W4V F4

Fig. 1. Illustration of the proposed SAM module. Given the input feature map I, SAM first extracts multi-scale features Fi (i = 0, 1, · · · , N ) through
multiple branches. Multi-scale features Fi are aggregated via element-wise summation to produce M . Then, we calculate channel-wise attention logits d and
spatial attention logits s based on M . d and s are multiplied to get stereoscopic attention logits v that are transformed to stereoscopic attention weights
WiV (i = 0, 1, · · · , N ) using a softmax function. WiV (i = 0, 1, · · · , N ) are used as weights to aggregate Fi to obtain F . At last, a residual connection
[43] from I is added to get the output O. In the figure, ⊗ indicates element-wise multiplication, and the two multiplied feature maps are replicated to the
same shape before multiplication. The symbol r denotes the dilation rate of the dilated DSConv3 × 3. Best viewed in color.

of channels, leading to heavier computational complexity and from different scales, and thus blindly feeding all features
more network parameters. Finally, the aggregated features are F may lead to severe overfitting. A natural solution is to
further rearranged by a vanilla conv1 × 1, i.e., attentively suppress non-informative branches and promote
discriminative ones automatically, and we achieve this by the
O = Kf use (F ) + I, (4) well-established attention mechanism.
where Kf use denotes the conv1 × 1 operation to fuse con- In our setting, an ideal attention module should have the
textual information at various scales. The summation of I following features. i) Due to the independence of each chan-
denotes a residual connection [43], which has demonstrated nel, the final attention should have a strong intra-channel
to be effective in CNN training. dependency. More specifically, different feature channels are
The dilation rates and the number of branches are hyper- usually obtained by convolving independent filters with input
parameters involved in our multi-scale module. Empirically, features. In this case, if a feature from a particular channel
larger dilation rates and more branches are desired when the is informative for the final prediction, the features in the
input features I are of high resolution, as large feature maps same channel of the same branch are likely to be informative
usually possess contextual information at various scales. as well. ii) The final attention should have a strong spatial-
wise dependency. The reason could be that, as a mid-level
task, SOD requires certain level of reasoning in the local
B. Stereoscopically Attentive Multi-scale Module neighbourhood for each pixel. iii) The computation should
A potential drawback of the multi-scale module in Sec- be efficient. In this case, the naı̈ve solution of independently
tion III-A lies in the element-wise summation. When con- learning a group (e.g., C × H × W ) of branch-wise attention
textual information from different branches is directly sum- weights is suboptimal due to its heavy computational overload.
marized together, informative branches may be weakened or The first two requirements mentioned above regularize the
even overwhelmed by non-informative ones. On the other attention mechanism in a global and local manner, respectively.
hand, the layers at different network depths may prefer the This naturally motivates us to factorize the final attention
information from different scales, while the element-wise weights v into two individual weights as follows:
summation assigns equal importance to all scales. To alle-
v = d ⊗ s, (5)
viate this problem, we propose a novel stereoscopic attention
mechanism that allows each channel at each spatial location where d and s denote the channel-wise attention and the
to adaptively adjust the weight of each branch with a soft spatial-wise attention, respectively. ⊗ indicates element-wise
attention mechanism. Fig. 1 provides an illustration of the multiplication, and d and s are replicated to the same shape of
proposed SAM module with four branches. (N +1)×C ×H ×W before multiplication. More specifically,
a) Stereoscopic Attention Mechanism: Without loss of channel-wise attention d ∈ R(N +1)×C shrinks feature spatially
generality, we consider the input feature is a 4-dimensional and attentively suppress non-informative features and promote
tensor F ∈ R(N +1)×C×H×W , in which each branch i ∈ discriminative ones in a channel-wise global manner. In the
{0, · · · N } generates features Fi ∈ RC×H×W of different same way, the spatial-wise attention s ∈ R(N +1)×H×W
scale and semantic level. It is widely believed that layers from absorbs the features at a particular spatial location across
different depth levels in a network may prefer the information different channels. Finally, d and s are broadcasted into the
IEEE TRANSACTIONS ON IMAGE PROCESSING 4

F Feature Fusion
G
DSConvk×k upsample

conv1×1

Prediction & Deep Supervision

P1 P2 P3 P4 P5
F F F F F conv1×1
R1 R2 R3 R4 R5 sigmoid Loss GT
upsample
S6

Stacked VAM Modules

S5 PPM
S3 S4 Si-1 (DS) conv3×3
S2 VAM Si
S1 (stride=2) ×ni

Fig. 2. Overall encoder-decoder architecture of the proposed SAMNet. Si and Ri represent the output feature maps of the encoder and the decoder at the
ith stage, respectively. Pi is the predicted saliency map at the ith stage, and P1 is the final prediction of SAMNet. G is the ground-truth saliency map.
PPM denotes the well-known Pyramid Pooling Module [59]. Best viewed in color.

same dimensionality of (N + 1) × C × H × W to obtain the where d ∈ R(N +1)C is further reshaped to R(N +1)×C . The
final attention weight by Eq. (5). softmax function is applied to d on the branch-wise dimension
b) Preprocessing by scale fusion: In order to reduce the to obtain the channel-wise attention, i.e.,
computational overload, we aggregate multi-scale contextual
information with element-wise summation, namely, D edi,c
Wi,c = PN ,
dj,c
N j=0 e (9)
X
M= Fi , (6) i = 0, 1, · · · , N ; c = 1, 2, · · · , C.
i=0
With channel-wise attention incorporated, the feature aggre-
The fused features M is used to compute the gating statistics gation in Eq. (3) is rewritten as
for each branch. Different from self-attention, the attention for
each module branch is decided by the learning of all branches, N
X
instead of a single branch as in self-attention mechanisms. FD = WiD ⊗ Fi , (10)
Therefore, the SAM module can extract the informative fea- i=0
tures from all branches within a “global” view. As a result,
the SAM module has the capability to learn the features at its where WiD is replicated to the same shape as Fi
preferred scales. (i.e., RC×H×W ) before element-wise multiplication. The fu-
c) Channel-wise Attention Mechanism: The channel at- sion of Eq. (4) is applied to F D to serve as the output.
tention mechanism aims at calculating a channel-wise attention d) Spatial Attention Mechanism: The spatial attention
vector WiD ∈ RC for each branch i, i = 0, 1, · · · , N . To ex- mechanism aims at computing a spatial attention map WiS ∈
plore the inter-channel relationship among different channels RH×W to highlight or suppress the activation at specific
[57], we embed the global information using global average locations. It is well-acknowledged that large receptive fields
pooling (GAP) on the fused feature map M , i.e., can capture contextual information better, which is crucial
for learning location-wise attention [58]. Based on this, we
H W
1 XX adopt dilated DSConv3 × 3 to enlarge the receptive fields
zc = FGAP (M ) = Mc,i,j ,
HW i=1 j=1 (7) while maintaining low computational complexity. Specifically,
the fused features M is first projected to a low-dimensional
c = 0, 1, · · · , C − 1, space RC/4×H×W by a conv1 × 1 for reducing parameters
and computational cost. Then, two dilated DSConv3 × 3 are
where z ∈ RC is the latent vector encoding channel-wise
applied to the reduced features for efficient contextual infor-
information of M . Then, we apply a multi-layer perceptron
mation aggregation. Finally, the features are again reduced to
(MLP) with two layers1 on the latent vector, and extract
R(N +1)×H×W using a conv1 × 1. Mathematically, we have
channel-wise information at different scales as follows:

d = FMLP (z), (8) s = F41×1 (F33×3 (F23×3 (F11×1 (M )))), (11)

1 We insert batch normalization and ReLU activation function between the where Fik×k denotes the ith (depthwise separable or vanilla)
two linear transformations. k × k convolution. Similar to channel-wise attention mecha-
IEEE TRANSACTIONS ON IMAGE PROCESSING 5

nism, we can formulate the fusion of multiple branches using TABLE I

spatial attention as follows: BACKBONE SETTINGS OF THE PROPOSED SAMN ET .

S esi,h,w Stage Resolution Module #M #F Stride Dilation rates

Wi,h,w = PN , i = 0, 1, · · · , N, 224 × 224 conv3 × 3 1 16 2 -
sj,h,w 1
j=0 e 112 × 112 SAM 1 16 1 1,2,3
N (12) 112 × 112 DSConv3 × 3 1 32 2 -
X 2
F S
= WiS ⊗ Fi , 56 × 56 SAM 1 32 1 1,2,3
56 × 56 DSConv3 × 3 1 64 2 -
i=0 3
28 × 28 SAM 3 64 1 1,2,3
in which we have 0 ≤ h < H and 0 ≤ w < W . WiS is 4
28 × 28 DSConv3 × 3 1 96 2 -
replicated to the shape of RC×H×W before multiplication. Eq. 14 × 14 SAM 6 96 1 1,2,3
14 × 14 DSConv3 × 3 1 128 2 -
(4) is applied to F S to produce the output. 5
7×7 SAM 3 128 1 1,2
e) Final Normalization: With d and s defined, the soft- 6 7×7 PPM 1 128 1 -
max function is applied to the stereoscopic attention v on the * “#M” means the number of modules whose types are specified in
branch-wise dimension, namely, the column of “Module”. “#F” means the number of convolution
filters (i.e., channels).
V evi,c,h,w
Wi,c,h,w = PN , i = 0, 1, · · · , N. (13)
vj,c,h,w
j=0 e
b) Encoder-Decoder Network: Based on the above back-
After this, WiV ∈ RC×H×W serves as the stereoscopic bone, we can build a lightweight encoder-decoder network, as
weighting scalars for the ith branch. The feature fusion in illustrated in Fig. 2. Let {Si : i = 1, 2, · · · , 6} denote the
Eq. (3) is rewritten as output feature maps of each stage of the backbone. For the
N fusion of top features, we apply a single conv1 × 1 to S5 to
adjust the number of channels and fuse S5 and S6 via element-
X
FV = WiV ⊗ Fi , (14)
i=0 wise summation. Then, we adopt a dilated DSConvk × k to
further integrate the fused activation. Formally, we have
Eq. (4) can also be applied to F V to generate the output of a
SAM module. We display the computation process of F V in R5 = G5k×k (G51×1 (S5 ) + S6 ), (15)
Fig. 1. where G5k×k denotes the (depthwise separable) k × k convolu-
So far, we have designed four modules for multi-scale tion at the fifth stage, and R5 denotes the fused feature map
learning, i.e., F , F D , F S , and F V . Among them, F V is at the fifth stage. Similarly, for the fusion of bottom features,
the default SAM module in this paper, because it achieves we upsample the fused features from the top stages to match
better performance with the consideration of both channel- the spatial resolution of the feature maps at the bottom stages.
wise attention and spatial attention. With the above design, In summary, we have
each SAM module can automatically decide what information Ri = Gik×k (Gi1×1 (Si ) + Up(Ri+1 )), i = 1, 2, 3, 4, (16)
it wants through the control of multi-scale branches.
in which Up represents the upsampling operation with an
upsampling rate of 2.
C. Network Architecture c) Deep Supervision & Loss Function: We employ deep
a) Backbone Architecture: Following previous studies supervision [61] to improve the transparency of the learning
[30]–[32], [35], [60], we build an FCN structure [50] using process for hidden layers. Concretely, for the fused features
the proposed SAM module as the basic unit. Within the first {Ri , i = 1, 2, 3, 4, 5}, we sequentially apply a conv1 × 1 with
five stages, we use DSConv3 × 32 with the stride of 2 to a single output channel and the sigmoid activation function to
downsample the input and adjust the number of channels. derive several predictions {Pi , i = 1, 2, 3, 4, 5}. We adopt the
Then, we apply the proposed SAM module to learn multi- standard binary cross-entropy loss for training, which can be
scale contextual information. For the first two stages, as the formulated as follows:
input feature maps are of relatively high resolution, we merely 5
X
employ one SAM module to avoid the heavy computational L = LBCE (P1 , G) + λ LBCE (Pi , G), (17)
overhead. On the contrary, from the third stage to the fifth i=2
stage, we stack multiple SAM modules to enlarge the receptive where LBCE is the standard binary cross-entropy loss function,
fields and enrich deep convolutional representations. After and G denotes the ground-truth saliency map. λ denotes the
the final fifth stage, we adopt the Pyramid Pooling Module weighting scalar for loss balance, and we follow [59] to
(PPM) [59] to further enhance global feature learning. The empirically set λ to 0.4 in this paper.
default configuration for dilation rates and the number of
branches in SAM modules is shown in Table I. Please refer to IV. E XPERIMENTS
Table V for detailed ablation studies about different network
A. Experimental Setup
configurations.
a) Implementation Details: The proposed method is im-
2 For the first stage, we just use regular conv3 × 3. plemented using the PyTorch [68] library. The training of
IEEE TRANSACTIONS ON IMAGE PROCESSING 6

TABLE II
C OMPARISON WITH EXISTING METHODS IN TERMS OF THE NUMBER OF PARAMETERS (#PARAM ), GPU MEMORY USAGE , Fβ , AND MAE.

#Param Memory ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

Methods
(M) (M) Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
DRFI [8] - - 0.777 0.161 0.652 0.138 0.649 0.154 0.774 0.146 0.704 0.217 0.670 0.150
DCL [62] 66.24 3737 0.895 0.080 0.733 0.095 0.785 0.082 0.892 0.063 0.831 0.131 0.747 0.096
DHSNet [24] 94.04 1899 0.903 0.062 - - 0.807 0.066 0.889 0.053 0.822 0.128 0.752 0.082
RFCN [10] 134.69 4501 0.896 0.097 0.738 0.095 0.782 0.089 0.892 0.080 0.802 0.161 0.754 0.100
NLDF [12] 35.49 11707 0.902 0.066 0.753 0.080 0.806 0.065 0.902 0.048 0.837 0.123 0.762 0.080
DSS [30] 62.23 3209 0.915 0.056 0.774 0.066 0.827 0.056 0.913 0.041 0.842 0.122 0.770 0.074
Amulet [23] 33.15 3359 0.913 0.061 0.743 0.098 0.778 0.085 0.897 0.051 0.795 0.144 0.755 0.094
UCF [11] 23.98 5317 0.901 0.071 0.730 0.120 0.772 0.112 0.888 0.062 0.805 0.148 0.758 0.112
SRM [35] 43.74 1927 0.914 0.056 0.769 0.069 0.826 0.059 0.906 0.046 0.840 0.126 0.778 0.077
PiCANet [21] 32.85 1541 0.923 0.049 0.766 0.068 0.837 0.054 0.916 0.042 0.836 0.102 0.783 0.083
BRN [14] 126.35 5477 0.919 0.043 0.774 0.062 0.827 0.050 0.910 0.036 0.843 0.103 0.769 0.076
C2S [16] 137.03 1455 0.907 0.057 0.759 0.072 0.811 0.062 0.898 0.046 0.819 0.122 0.775 0.083
RAS [34] 20.13 2417 0.916 0.058 0.785 0.063 0.831 0.059 0.913 0.045 0.847 0.123 0.772 0.075
DNA [31] 20.06 2071 0.935 0.041 0.799 0.056 0.865 0.044 0.930 0.031 0.853 0.107 0.793 0.069
CPD [63] 29.23 761 0.930 0.044 0.794 0.057 0.861 0.043 0.924 0.033 0.848 0.113 0.795 0.068
BASNet [18] 87.06 1103 0.938 0.040 0.805 0.056 0.859 0.048 0.928 0.032 0.849 0.112 0.783 0.073
AFNet [27] 37.11 1123 0.930 0.045 0.784 0.057 0.857 0.046 0.921 0.036 0.848 0.108 0.791 0.072
PoolNet [28] 53.63 1087 0.934 0.048 0.791 0.057 0.866 0.043 0.925 0.037 0.863 0.111 0.800 0.068
EGNet [37] 108.07 1177 0.938 0.044 0.794 0.056 0.870 0.044 0.928 0.034 0.859 0.110 0.800 0.070
BANet [64] 55.90 3275 0.940 0.038 0.803 0.059 0.872 0.040 0.932 0.031 0.865 0.105 0.796 0.068
MobileNet [38] 4.27 633 0.906 0.064 0.753 0.073 0.804 0.066 0.895 0.052 0.809 0.136 0.767 0.081
MobileNetV2 [39] 2.37 609 0.905 0.066 0.758 0.075 0.798 0.070 0.890 0.056 0.801 0.138 0.766 0.085
ShuffleNet [40] 1.80 585 0.907 0.062 0.757 0.069 0.811 0.062 0.898 0.050 0.816 0.130 0.771 0.078
ShuffleNetV2 [41] 1.60 579 0.901 0.069 0.746 0.076 0.789 0.071 0.884 0.059 0.789 0.147 0.755 0.086
ICNet [65] 6.70 633 0.918 0.059 0.773 0.072 0.810 0.067 0.898 0.052 0.802 0.134 0.768 0.084
BiSeNet R18 [66] 13.48 719 0.909 0.062 0.757 0.072 0.815 0.062 0.902 0.049 0.821 0.128 0.776 0.080
BiSeNet X39 [66] 1.84 665 0.901 0.070 0.755 0.078 0.787 0.074 0.888 0.059 0.792 0.147 0.756 0.090
DFANet [67] 1.83 749 0.896 0.073 0.750 0.078 0.791 0.075 0.884 0.061 0.802 0.148 0.757 0.089
SAMNet (OURS) 1.33 599 0.925 0.053 0.797 0.065 0.835 0.058 0.915 0.045 0.833 0.123 0.785 0.077

all experiments is conducted using the Adam optimizer [69] where we set β 2 = 0.3 to emphasize the importance of
with parameters β1 = 0.9, β2 = 0.999, weight decay of precision, as in previous works [21], [23], [28], [30], [31].
10−4 , and batch size of 20. Our model is pretrained on the Note that each threshold will correspond to a Fβ score here,
ImageNet dataset [70] as in [43]. We adopt poly learning and we report the maximum Fβ score across all thresholds.
rate scheduler so that the learning rate for the nth epoch is MAE measures the difference between the predicted saliency
power
n
init lr × 1 − #epochs , where init lr = 5 × 10−4 and map P and the ground-truth saliency map G, which can be
power = 0.9. We train the proposed model for 50 epochs, computed as
i.e., #epochs = 50. H W
1 XX
b) Datasets: We extensively evaluate the proposed MAE(P , G) = |Pij − Gij | , (19)
method on six datasets, including DUTS [71], ECSSD [72], HW i=1 j=1
SOD [73], HKU-IS [47], THUR15K [74], and DUT-OMRON
where H and W denote the height and width of the saliency
[45] datasets. These six datasets consist of 15572, 1000, 300,
map, respectively. The weighted Fβω -measure [76] is designed
4447, 6232, and 5168 natural images with corresponding
to amend the interpolation flaw, the dependency flaw, and
pixel-level labels, respectively. Following recent studies [14],
the equal-importance flaw in traditional evaluation metrics,
[21], [31], [35], [75], we train the proposed model on the
and we use it with default settings to evaluate SAMNet and
DUTS training set and evaluate it on the DUTS testing set
other competitors. As for Sβ [77], it is also a widely-used
(DUTS-TE) and other five datasets.
saliency evaluation metric as it can measure the structure sim-
c) Evaluation Criteria: We evaluate the accuracy of ilarity between predictions and ground truths. Sβ consists of
SAMNet against previous state-of-the-art methods with regard region-aware and object-aware structural similarity measures,
to four widely-used metrics, i.e., Fβ -measure score (Fβ ), mean in which the former is achieved using well-known SSIM [78],
absolute error (MAE), weighted Fβω -measure (Fβω ) [76], and and the latter is based on the probability theory. We adopt the
structure similarity measure (Sβ ) [77]. Given a threshold in official code with default settings in our experiments.
the range of [0, 1), we can binarize the predicted saliency
d) Efficiency Measures: This paper targets a lightweight
probability map and then calculate the precision and recall
yet powerful solution for SOD, so we also evaluate the
values by comparing the binarized prediction map with the
efficiency and flexibility of various methods, including the
binary ground-truth map. With precision and recall calculated,
number of model parameters (#Param), GPU memory usage,
Fβ -measure is the weighted harmonic mean of precision and
the number of floating-point operations (FLOPs), and the
recall, i.e.,
inference speed (FPS). GPU memory usage measures the
(1 + β 2 ) × Precision × Recall amount of memory required for a model to test an image.
Fβ = , (18) The number of FLOPs measures the computational cost of a
β 2 × Precision + Recall
IEEE TRANSACTIONS ON IMAGE PROCESSING 7

TABLE III
C OMPARISON WITH EXISTING METHODS IN TERMS OF THE NUMBER OF FLOP S , SPEED , Fβω , AND Sβ .

FLOPs Speed ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

Methods
(G) (FPS) Fβω ↑ Sβ ↑ Fβω ↑ Sβ ↑ Fβω ↑ Sβ ↑ Fβω ↑ Sβ ↑ Fβω ↑ Sβ ↑ Fβω ↑ Sβ ↑
DRFI [8] - 0.1 0.548 0.727 0.424 0.697 0.378 0.676 0.504 0.739 0.450 0.616 0.444 0.711
DCL [62] 224.9 1.4 0.782 0.869 0.584 0.762 0.632 0.803 0.770 0.871 0.669 0.756 0.624 0.794
DHSNet [24] 15.8 10.0 0.837 0.880 - - 0.705 0.820 0.816 0.870 0.685 0.746 0.666 0.802
RFCN [10] 102.8 0.4 0.725 0.856 0.562 0.774 0.586 0.793 0.707 0.858 0.591 0.716 0.591 0.793
NLDF [12] 263.9 18.5 0.835 0.870 0.634 0.770 0.710 0.816 0.838 0.879 0.708 0.753 0.676 0.801
DSS [30] 114.6 7.0 0.864 0.879 0.688 0.790 0.752 0.826 0.862 0.881 0.711 0.746 0.702 0.805
Amulet [23] 45.3 9.7 0.839 0.891 0.626 0.781 0.657 0.804 0.817 0.886 0.674 0.750 0.650 0.796
UCF [11] 61.4 12.0 0.805 0.881 0.573 0.760 0.595 0.782 0.779 0.875 0.673 0.759 0.613 0.785
SRM [35] 20.3 12.3 0.849 0.890 0.658 0.798 0.721 0.836 0.835 0.887 0.670 0.739 0.684 0.818
PiCANet [21] 37.1 5.6 0.862 0.909 0.691 0.826 0.745 0.860 0.847 0.905 0.721 0.787 0.687 0.823
BRN [14] 24.1 3.6 0.887 0.898 0.709 0.806 0.774 0.842 0.875 0.894 0.738 0.768 0.712 0.813
C2S [16] 20.5 16.7 0.849 0.891 0.663 0.799 0.717 0.831 0.835 0.889 0.699 0.757 0.685 0.812
RAS [34] 35.6 20.4 0.855 0.889 0.695 0.812 0.739 0.838 0.849 0.889 0.718 0.761 0.691 0.813
DNA [31] 82.5 25.0 0.897 0.909 0.729 0.823 0.797 0.863 0.889 0.908 0.755 0.780 0.723 0.824
CPD [63] 59.5 68.0 0.889 0.905 0.715 0.818 0.799 0.866 0.879 0.904 0.718 0.765 0.731 0.831
BASNet [18] 127.3 36.2 0.898 0.910 0.751 0.836 0.802 0.865 0.889 0.909 0.728 0.766 0.721 0.823
AFNet [27] 38.4 21.6 0.880 0.907 0.717 0.826 0.784 0.867 0.869 0.905 0.726 0.773 0.719 0.829
PoolNet [28] 123.4 39.7 0.875 0.909 0.710 0.829 0.783 0.875 0.864 0.908 0.731 0.781 0.724 0.839
EGNet [37] 270.8 12.7 0.886 0.913 0.727 0.836 0.796 0.878 0.876 0.912 0.736 0.781 0.727 0.836
BANet [64] 121.6 12.5 0.901 0.918 0.736 0.832 0.810 0.878 0.889 0.915 0.765 0.788 0.730 0.834
MobileNet [38] 2.2 295.8 0.829 0.884 0.656 0.802 0.696 0.828 0.816 0.884 0.653 0.735 0.675 0.814
MobileNetV2 [39] 0.8 446.2 0.820 0.885 0.651 0.806 0.676 0.823 0.799 0.879 0.657 0.742 0.660 0.811
ShuffleNet [40] 0.7 406.9 0.831 0.884 0.667 0.808 0.709 0.834 0.820 0.885 0.670 0.743 0.683 0.819
ShuffleNetV2 [41] 0.5 452.5 0.812 0.878 0.637 0.797 0.665 0.816 0.788 0.871 0.621 0.715 0.652 0.806
ICNet [65] 6.3 75.1 0.838 0.895 0.669 0.813 0.694 0.830 0.812 0.885 0.663 0.743 0.668 0.812
BiSeNet R18 [66] 25.0 120.5 0.829 0.886 0.648 0.803 0.699 0.835 0.819 0.889 0.669 0.751 0.675 0.818
BiSeNet X39 [66] 7.3 165.8 0.802 0.877 0.632 0.799 0.652 0.813 0.784 0.875 0.620 0.720 0.641 0.802
DFANet [67] 1.7 91.4 0.799 0.872 0.627 0.794 0.652 0.811 0.778 0.868 0.617 0.718 0.639 0.802
SAMNet (OURS) 0.5 343.2 0.855 0.902 0.699 0.830 0.729 0.849 0.837 0.898 0.686 0.756 0.693 0.825

0.87 DCL
0.87 0.87
RFCN
0.86 NLDF 0.86 0.86
DSS
Amulet
0.85 UCF 0.85 0.85
SRM
F-measure
F-measure

F-measure

PiCANet
0.84 BRN 0.84 0.84
C2S
0.83 RAS
0.83 0.83
DNA
CPD
0.82 BASNet 0.82 0.82
AFNet
PoolNet
0.81 EGNet 0.81 0.81
BANet
OURS
0.8 0.8 0.8
10 0 10 1 10 2 10 0 10 1 10 2 10 0 10 1 10 2
#Param (M) FLOAPs FPS

Fig. 3. Illustration of the trade-off between performance and computational cost. The F-measure (Fβ ) is averaged over six datasets. Note that the horizon
axis is logarithmic.

model, and a smaller number of FLOPs leads to lower energy B. Performance Analysis
consumption. Following lightweight backbone networks [38]– In this part, we compare the proposed SAMNet with 20
[41] and efficient semantic segmentation [65]–[67], the speed state-of-the-art SOD methods, including DRFI [8], DCL [62],
stands for the number of images that a model can do network DHSNet [24], RFCN [10], NLDF [12], DSS [30], Amulet
inference per second, tested on a single NVIDIA TITAN Xp [23], UCF [11], SRM [35], PiCANet [21], BRN [14], C2S
GPU. Since deep supervision is only used in training, we omit [16], RAS [34], DNA [31], CPD [63], BASNet [18], AFNet
the calculation of {Pi , i = 2, 3, 4, 5} in the speed testing of [27], PoolNet [28], EGNet [37], and BANet [64]. Other than
SAMNet. For saliency detectors, memory usage, FLOPs, and existing SOD methods, we also compare with several state-of-
speed are tested using a 336 × 336 input image except that a the-art lightweight backbone networks that are widely used for
method specifies its input dimensions. Since efficient semantic image classification, including MobileNet [38], MobileNetV2
segmentation methods [65]–[67] are usually designed for high- [39], ShuffleNet [40], and ShuffleNetV2 [41]. For adapting
resolution images, we use a 672×672 input to ensure accuracy; them to the SOD task, we add the same decoder in the
otherwise, we would get very low accuracy. For reformed proposed SAMNet to them. The resulting baselines are trained
baselines based on lightweight backbones [38]–[41], we use with the same settings for comparison. Besides, we compare
a 336 × 336 input. Here, #Param and GPU memory usage with some efficient semantic segmentation methods as well,
are measured in million (M), and the number of FLOPs is including ICNet [65], BiSeNet [66], and DFANet [67]. We
measured in giga (G). reform them for SOD by replacing their final softmax acti-
IEEE TRANSACTIONS ON IMAGE PROCESSING 8

Image RFCN DSS Amulet UCF SRM PiCANet BRN C2S RAS CPD EGNet Ours GT

Fig. 4. Qualitative comparison with state-of-the-art SOD methods.

vation function with the standard sigmoid activation function. accuracy, the number of parameters, GPU memory usage, the
For BiSeNet [66], we report its results with ResNet-18 [43] number of FLOPs, and speed.
(i.e., BiSeNet R18) and Xception-39 [79] (i.e., BiSeNet X39) b) Comparison with Lightweight Backbones: Table II
backbones, respectively. and Table III also present the comparison of SAMNet to
a) Comparison with Existing SOD Methods: Table II lightweight backbone based baselines, i.e., MobileNet [38],
shows the evaluation results of the proposed SAMNet com- MobileNetV2 [39], ShuffleNet [40], and ShuffleNetV2 [41].
pared with previous state-of-the-art alternatives in terms of the Although these baselines have similar or even faster speeds
number of parameters, GPU memory usage, Fβ , and MAE. than SAMNet, SAMNet achieves substantially better accuracy
Table III shows the evaluation results in terms of the number in terms of all metrics. This suggests that it is suboptimal
of FLOPs, speed, Fβω , and Sβ . Note that the number of to apply existing lightweight backbones for SOD directly.
FLOPs is highly related to energy consumption for the network This also demonstrates the importance of carefully designing
inference. The results clearly show that SAMNet achieves on network architectures for the lightweight SOD task and the
par accuracy with state-of-the-art SOD solutions, especially in advantage of the proposed multi-scale attention mechanism.
terms of Fβ , MAE, and Sβ . However, SAMNet requires one c) Comparison with Efficient Semantic Segmentation
or two orders of magnitude fewer computational resources. Methods: From Table II and Table III, we can see that
For example, compared with the best performing BANet [64], SAMNet outperforms efficient semantic segmentation methods
SAMNet shows a slightly lower average Fβ (0.848 vs. 0.868) [65]–[67] by a large margin, with fewer parameters, fewer
over six datasets, but SAMNet has 42× fewer parameters, FLOPs, and faster speed. It is interesting to find that efficient
5.5× less GPU memory consumption, 243× fewer FLOPs, semantic segmentation methods perform worse than baselines
27× faster running speed than BANet [64]. This has significant based on lightweight backbones. This implies that efficient
impacts on mobile devices, where the limited computational semantic segmentation methods are heavily tuned for semantic
resource, reduced energy supply, restrictive running memory segmentation and are thus unsuitable for directly applying to
and storage space cannot afford the heavy overhead of tradi- SOD. Hence, lightweight SOD is an essential problem and
tional cumbersome SOD methods. should get more attention from this community.
We also illustrate the comparison in Fig. 3, where the trade- d) Qualitative comparison: In Fig. 4, we provide some
offs between the accuracy and efficiency of various methods visualization examples to exhibit the superiority of SAMNet.
are more clearly shown. In the sub-figures of Fβ vs. #Param Although SAMNet performs slightly worse than traditional
and Fβ vs. FLOPs, SAMNet lies at the top-left corner; in the cumbersome SOD methods, it can segment salient objects
sub-figure of Fβ vs. FPS, SAMNet lies at the top-right corner. with coherent boundaries in many challenging circumstances,
This implies that SAMNet achieves comparable accuracy to such as complicated scenarios (1st and 3rd rows), low contrast
previous state-of-the-art methods with much fewer parameters between foreground and background (2nd and 4th rows),
and FLOPs, and much faster speed. Therefore, we can come to large objects (5th and 6th rows), scenarios with abnormal
the conclusion that SAMNet achieves a good trade-off among brightness (7th and 8th rows), and confusing natural scenarios
IEEE TRANSACTIONS ON IMAGE PROCESSING 9

TABLE IV
A BLATION STUDY FOR THE DESIGN CHOICES OF SAMN ET .

Component ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

No.
MB CA SA PP IP Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
0 0.891 0.075 0.750 0.079 0.779 0.077 0.877 0.062 0.791 0.149 0.749 0.089
1 ! 0.904 0.068 0.767 0.076 0.792 0.074 0.888 0.058 0.797 0.142 0.757 0.090
2 ! ! 0.903 0.066 0.773 0.074 0.795 0.071 0.892 0.056 0.803 0.138 0.759 0.086
3 ! ! 0.902 0.065 0.769 0.072 0.797 0.071 0.889 0.056 0.807 0.138 0.758 0.087
4 ! ! ! 0.905 0.063 0.766 0.072 0.802 0.068 0.892 0.055 0.795 0.135 0.761 0.085
5 ! ! ! ! 0.909 0.060 0.769 0.072 0.802 0.069 0.894 0.053 0.810 0.134 0.761 0.086
6 ! ! ! ! ! 0.925 0.053 0.797 0.065 0.835 0.058 0.915 0.045 0.833 0.123 0.785 0.077
* We use the vanilla single branch module as the baseline (No. 0). Here, “MB”, “CA”, “SA”, “PP”, and “IP” refer to the simple multi-branch module (F ), channel-wise
attention (F D ), spatial attention (F S ), pyramid pooling, and ImageNet [70] pretraining, respectively. When incorporating both channel attention and spatial attention,
we reach the stereoscopic attention mechanism (F V ).

TABLE V
A BLATION STUDY FOR THE CONFIGURATIONS OF SAMN ET .

ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

Stage Configuration
Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
Default Configuration 0.909 0.060 0.769 0.072 0.802 0.069 0.894 0.053 0.810 0.134 0.761 0.086
1−4 1, 2 0.907 0.063 0.769 0.071 0.802 0.068 0.893 0.054 0.803 0.136 0.761 0.085
1−4 1, 2, 3, 4 0.908 0.063 0.777 0.071 0.804 0.069 0.893 0.055 0.792 0.145 0.765 0.083
Dilation rates 1 − 4 1, 2, 4, 8 0.905 0.064 0.777 0.070 0.807 0.068 0.892 0.055 0.803 0.141 0.765 0.084
5 1, 2, 3 0.904 0.064 0.772 0.074 0.800 0.072 0.890 0.056 0.800 0.144 0.765 0.086
5 1, 2, 3, 4 0.904 0.065 0.772 0.073 0.798 0.072 0.893 0.054 0.801 0.139 0.762 0.084
3 2 0.907 0.063 0.770 0.070 0.798 0.068 0.890 0.055 0.798 0.142 0.759 0.086
4 5 0.908 0.063 0.774 0.071 0.801 0.068 0.894 0.054 0.808 0.133 0.763 0.084
#Modules 4 7 0.910 0.062 0.767 0.073 0.801 0.069 0.894 0.053 0.812 0.138 0.760 0.086
5 2 0.910 0.061 0.770 0.071 0.798 0.069 0.894 0.053 0.804 0.136 0.761 0.084
5 4 0.903 0.064 0.776 0.070 0.803 0.067 0.893 0.054 0.790 0.139 0.763 0.083
ALL ×0.75 0.899 0.069 0.765 0.074 0.784 0.074 0.883 0.059 0.786 0.144 0.753 0.089
#Filters
ALL ×1.25 0.911 0.061 0.779 0.070 0.809 0.067 0.897 0.053 0.808 0.132 0.765 0.084
* “#Modules” represents the number of SAM modules in each stage. “#Filters” denotes the number of convolution filters (i.e., channel), and “×k” means that
we multiply the number of filters in SAMNet by a factor of k. Note that all experiments here are trained from scratch.

(9th row). Combined with the lightweight and efficient nature goal of lightweight SOD. Our default setting of SAMNet is
of SAMNet, it has the potential to promote real-world SOD set by considering the trade-off between effectiveness and
applications. lightweight restriction.

V. C ONCLUSION
C. Ablation Study
Instead of only considering accuracy, this paper focuses
In this section, we conduct ablation study to demonstrate on lightweight SOD that trades off among accuracy, effi-
the effectiveness of the proposed module components and ciency, parameter numbers, and FLOPs. We propose a novel
the parameter configurations of SAMNet. The experimental SAM module, which enables small networks to effectively
settings follow those in Section IV-B. encode both high-level features and low-level details. Incor-
a) Proposed Module Components: Table IV shows the porating the SAM module, the proposed SAMNet achieves
ablation study results for the proposed module components. comparable performances with state-of-the-art SOD methods
The proposed SAM module is designed by carefully com- that use significantly more parameters, while saving several
bining these basic components into a nontrivial module for orders of magnitude overhead. Such excellent trade-off be-
effective and efficient multi-scale learning. Table IV suggests tween performance and efficiency makes SAMNet possible to
that the performance is gradually boosted by adding each com- provide high-accuracy SOD in resource-limited environments,
ponent into the framework. Besides, the comparison between e.g., mobile devices. SAMNet also clearly outperforms other
No. 0 and No. 5 demonstrates the superiority of the proposed well-known lightweight networks in image classification [38]–
solution compared with the baseline, where the performance [41] and semantic segmentation [65]–[67], suggesting that
gap comes solely from our contributions because two models lightweight SOD is worth studying and should be set up
are both trained from scratch without the ImageNet [70] as a separate research field. To the best of our knowledge,
pretraining. SAMNet is the first lightweight SOD method that is expected
b) Configurations of SAMNet: Table V shows the ab- to pave a new path for SOD. Through this work, we want to
lation study results for different network configurations. It arouse the research for lightweight SOD that would promote
is interesting to find that the proposed SAMNet is robust more practical SOD applications. In the future, we plan
to the slight changes in configurations. Introducing more to apply the weight quantization and network compression
parameters will lead to better performance, such as increasing techniques to boost SAMNet’s CPU speed of 5fps for real-
the number of convolution filters, but this is orthogonal to our time performance.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10

ACKNOWLEDGMENT [23] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating
multi-level convolutional features for salient object detection,” in Int.
This research was supported by Major Project for New Conf. Comput. Vis., 2017, pp. 202–211.
Generation of AI under Grant No. 2018AAA0100400, NSFC [24] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network for
(61922046), S&T innovation project from Chinese Min- salient object detection,” in IEEE Conf. Comput. Vis. Pattern Recog.,
2016, pp. 678–686.
istry of Education, and Tianjin Natural Science Foundation [25] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into salient
(17JCJQJC43700). object subitizing and detection,” in Int. Conf. Comput. Vis., 2017, pp.
1059–1067.
R EFERENCES [26] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object
detection with pyramid attention and salient edges,” in IEEE Conf.
[1] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global Comput. Vis. Pattern Recog., 2019, pp. 1448–1457.
contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. [27] M. Feng, H. Lu, and E. Ding, “Attentive feedback network for boundary-
Intell., vol. 37, no. 3, pp. 569–582, 2015. aware salient object detection,” in IEEE Conf. Comput. Vis. Pattern
[2] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu, “Visual-textual Recog., 2019, pp. 1623–1632.
joint relevance learning for tag-based social image search,” IEEE Trans. [28] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple
Image Process., vol. 22, no. 1, pp. 363–376, 2012. pooling-based design for real-time salient object detection,” in IEEE
[3] M. Donoser, M. Urschler, M. Hirzer, and H. Bischof, “Saliency driven Conf. Comput. Vis. Pattern Recog., 2019, pp. 3917–3926.
total variation segmentation,” in Int. Conf. Comput. Vis., 2009, pp. 817– [29] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutual
824. learning method for salient object detection with intertwined multi-
[4] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp.
attention useful for object recognition?” in IEEE Conf. Comput. Vis. 8150–8159.
Pattern Recog., 2004, pp. 37–44. [30] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr, “Deeply
[5] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track- supervised salient object detection with short connections,” IEEE Trans.
ing,” in IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 1007–1013. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 815–828, 2019.
[6] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliency [31] Y. Liu, M.-M. Cheng, X.-Y. Zhang, G.-Y. Nie, and M. Wang, “DNA:
detection and its application in object recognition,” IEEE Trans. Circ. Deeply-supervised nonlinear aggregation for salient object detection,”
Syst. Video Technol., vol. 24, no. 5, pp. 769–779, 2013. IEEE Trans. Cybernetics, 2021.
[7] M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, [32] Y. Liu, Y.-C. Gu, X.-Y. Zhang, W. Wang, and M.-M. Cheng,
“RepFinder: Finding approximately repeated scene elements for image “Lightweight salient object detection via hierarchical visual perception
editing,” ACM Trans. Graph., vol. 29, no. 4, pp. 83:1–83:8, 2010. learning,” IEEE Trans. Cybernetics, 2020.
[8] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object [33] Y. Qiu, Y. Liu, H. Yang, and J. Xu, “A simple saliency detection
detection: A discriminative regional feature integration approach,” in approach via automatic top-down feature fusion,” Neurocomputing, vol.
IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 2083–2090. 388, pp. 124–134, 2020.
[9] D. A. Klein and S. Frintrop, “Center-surround divergence of feature [34] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient
statistics for salient object detection,” in Int. Conf. Comput. Vis., 2011, object detection,” in Eur. Conf. Comput. Vis., 2018, pp. 234–250.
pp. 2214–2219. [35] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise
[10] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection refinement model for detecting salient objects in images,” in Int. Conf.
with recurrent fully convolutional networks,” in Eur. Conf. Comput. Vis., Comput. Vis., 2017, pp. 4019–4028.
2016, pp. 825–841. [36] Y. Qiu, Y. Liu, X. Ma, L. Liu, H. Gao, and J. Xu, “Revisiting multi-
[11] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain level feature fusion: A simple yet effective network for salient object
convolutional features for accurate saliency detection,” in Int. Conf. detection,” in IEEE Int. Conf. Image Process., 2019, pp. 4010–4014.
Comput. Vis., 2017, pp. 212–221. [37] J.-X. Zhao, J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng,
[12] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin, “EGNet: Edge guidance network for salient object detection,” in Int.
“Non-local deep features for salient object detection,” in IEEE Conf. Conf. Comput. Vis., 2019, pp. 8779–8788.
Comput. Vis. Pattern Recog., 2017, pp. 6609–6617. [38] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[13] P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for salient object M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural
detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2300– networks for mobile vision applications,” arXiv preprint, 2017.
2309. [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
[14] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji, “MobileNetV2: Inverted residuals and linear bottlenecks,” in IEEE Conf.
“Detect globally, refine locally: A novel approach to saliency detection,” Comput. Vis. Pattern Recog., 2018, pp. 4510–4520.
in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 3127–3135. [40] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
[15] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional message efficient convolutional neural network for mobile devices,” in IEEE Conf.
passing model for salient object detection,” in IEEE Conf. Comput. Vis. Comput. Vis. Pattern Recog., 2018, pp. 6848–6856.
Pattern Recog., 2018, pp. 1741–1750. [41] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet v2: Practical
[16] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowledge guidelines for efficient CNN architecture design,” in Eur. Conf. Comput.
transfer for salient object detection,” in Eur. Conf. Comput. Vis., 2018, Vis., 2018, pp. 116–131.
pp. 355–370. [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[17] N. D. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature large-scale image recognition,” in Int. Conf. Learn. Represent., 2015.
contrast, semantics, and beyond,” in IEEE Conf. Comput. Vis. Pattern [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Recog., 2016, pp. 516–524. recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–
[18] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, 778.
“BASNet: Boundary-aware salient object detection,” in IEEE Conf. [44] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Comput. Vis. Pattern Recog., 2019, pp. 7479–7489. “DeepLab: Semantic image segmentation with deep convolutional nets,
[19] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attention atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
guided recurrent network for salient object detection,” in IEEE Conf. Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
Comput. Vis. Pattern Recog., 2018, pp. 714–722. [45] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
[20] W. Wang, J. Shen, X. Dong, and A. Borji, “Salient object detection via graph-based manifold ranking,” in IEEE Conf. Comput. Vis. Pattern
driven by fixation prediction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 3166–3173.
Recog., 2018, pp. 1711–1720. [46] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection
[21] N. Liu, J. Han, and M.-H. Yang, “PiCANet: Learning pixel-wise via dense and sparse reconstruction,” in Int. Conf. Comput. Vis., 2013,
contextual attention for saliency detection,” in IEEE Conf. Comput. Vis. pp. 2976–2983.
Pattern Recog., 2018, pp. 3089–3098. [47] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,”
[22] M. A. Islam, M. Kalash, and N. D. Bruce, “Revisiting salient object in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 5455–5463.
detection: Simultaneous detection, ranking, and subitizing of multiple [48] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-
salient objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. context deep learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015,
7142–7150. pp. 1265–1274.
IEEE TRANSACTIONS ON IMAGE PROCESSING 11

[49] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency [77] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure:
detection via local estimation and global search,” in IEEE Conf. Comput. A new way to evaluate foreground maps,” in Int. Conf. Comput. Vis.,
Vis. Pattern Recog., 2015, pp. 3183–3192. 2017, pp. 4548–4557.
[50] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [78] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., quality assessment: From error visibility to structural similarity,” IEEE
2015, pp. 3431–3440. Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
[51] R. A. Rensink, “The dynamic representation of scenes,” Visual Cogni- [79] F. Chollet, “Xception: Deep learning with depthwise separable convolu-
tion, vol. 7, no. 1-3, pp. 17–42, 2000. tions,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 1251–1258.
[52] M. Corbetta and G. L. Shulman, “Control of goal-directed and stimulus-
driven attention in the brain,” Nature Reviews Neuroscience, vol. 3, no. 3,
pp. 201–215, 2002.
[53] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating Yun Liu is a PhD candidate at the College of
for video classification,” arXiv preprint, 2017. Computer Science, Nankai University. He received
his bachelor degree from Nankai University in 2016.
[54] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai, “Person search
His research interests include computer vision and
via a mask-guided two-stream CNN model,” in Eur. Conf. Comput. Vis.,
machine learning.
2018, pp. 734–750.
[55] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-
resolution using very deep residual channel attention networks,” in Eur.
Conf. Comput. Vis., 2018, pp. 286–301.
[56] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
X. Tang, “Residual attention network for image classification,” in IEEE
Conf. Comput. Vis. Pattern Recog., 2017, pp. 3156–3164.
[57] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE
Conf. Comput. Vis. Pattern Recog., 2018, pp. 7132–7141. Xin-Yu Zhang is an undergraduate student from
[58] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutional School of Mathematical Sciences at Nankai Univer-
block attention module,” in Eur. Conf. Comput. Vis., 2018, pp. 3–19. sity. His research interests include computer vision
[59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing and deep learning.
network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2881–
2890.
[60] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment:
Finding the salient objects in images via two-stream fixation-semantic
CNNs,” in Int. Conf. Comput. Vis., 2017, pp. 1050–1058.
[61] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised
nets,” in Artif. Intell. Stat., 2015, pp. 562–570.
[62] G. Li and Y. Yu, “Deep contrast learning for salient object detection,”
in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 478–487.
[63] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and Jia-Wang Bian is a PhD student at the University
accurate salient object detection,” in IEEE Conf. Comput. Vis. Pattern of Adelaide and an Associated PhD researcher with
Recog., 2019, pp. 3907–3916. the Australian Centre for Robotic Vision (ACRV).
He is advised by Prof. Ian Reid and Prof. Chun-
[64] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance:
hua Shen. His research interests lie in the field
Boundary-aware salient object detection,” in Int. Conf. Comput. Vis.,
of computer vision and robotics. Jiawang received
2019, pp. 3799–3808.
his B.Eng. degree from Nankai University, where
[65] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic
he was advised by Prof. Ming-Ming Cheng. He
segmentation on high-resolution images,” in Eur. Conf. Comput. Vis.,
was a research assistant at the Singapore University
2018, pp. 405–420.
of Technology and Design (SUTD). Jiawang also
[66] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet:
did a trainee engineer job at the Advanced Digital
Bilateral segmentation network for real-time semantic segmentation,”
Sciences Center (ADSC), Huawei Technologies Co., Ltd, and Tusimple.
in Eur. Conf. Comput. Vis., 2018, pp. 325–341.
[67] H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep feature aggregation
for real-time semantic segmentation,” in IEEE Conf. Comput. Vis.
Pattern Recog., 2019, pp. 9522–9531. Le Zhang received the B.Eng degree from Univer-
[68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, sity of Electronic Science and Technology Of China
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “PyTorch: An in 2011. He received his M.Sc and PhD degree
imperative style, high-performance deep learning library,” in Adv. Neural form Nanyang Technological University (NTU) in
Inform. Process. Syst., 2019, pp. 8026–8037. 2012 and 2016, respectively. Currently, he is a
[69] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” scientist at Institute for Infocomm Research, Agency
in Int. Conf. Learn. Represent., 2015. for Science, Technology and Research (A*STAR),
[70] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Singapore. He served as TPC member in several
A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis. conferences such as AAAI, IJCAI. He has served
Pattern Recog., 2009, pp. 248–255. as a Guest Editor for Pattern Recognition and Neu-
[71] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, rocomputing. His current research interests include
“Learning to detect salient objects with image-level supervision,” in deep learning and computer vision.
IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 136–145.
[72] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 1155–1162. Ming-Ming Cheng received his PhD degree from
[73] V. Movahedi and J. H. Elder, “Design and perceptual validation of Tsinghua University in 2012. Then, he did 2 years
performance measures for salient object segmentation,” in IEEE Conf. research fellow with Prof. Philip Torr at Oxford. He
Comput. Vis. Pattern Recog. Worksh., 2010, pp. 49–56. is now a professor at Nankai University, leading the
[74] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “SalientShape: Media Computing Lab. His research interests include
Group saliency in image collections,” The Vis. Comput., vol. 30, no. 4, computer graphics, computer vision, and image
pp. 443–453, 2014. processing. He received research awards, including
[75] Y. Zeng, H. Lu, L. Zhang, M. Feng, and A. Borji, “Learning to promote ACM China Rising Star Award, IBM Global SUR
saliency detectors,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, Award, and CCF-Intel Young Faculty Researcher
pp. 1644–1653. Program. He is on the editorial boards of IEEE TIP.
[76] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground
maps?” in IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 248–255.

Evidence Based Treatments For Trauma Related Psychological Disorders A Practical Guide For Clinicians 2nd Edition Ulrich Schnyder
No ratings yet
Evidence Based Treatments For Trauma Related Psychological Disorders A Practical Guide For Clinicians 2nd Edition Ulrich Schnyder
49 pages
Concise Guide to OTN optical transport networks
From Everand
Concise Guide to OTN optical transport networks
alasdair gilchrist
4/5 (2)
PRM - Tastee Snax (B)
100% (2)
PRM - Tastee Snax (B)
9 pages
Visual Guide Schemas in Play 5
No ratings yet
Visual Guide Schemas in Play 5
23 pages
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Documentation of Life Experiences: (1) Personal Information
No ratings yet
Documentation of Life Experiences: (1) Personal Information
3 pages
1st Year Curriculum Structure For B.tech Courses in Engineering & Technology-2018!19!31.10.2018
No ratings yet
1st Year Curriculum Structure For B.tech Courses in Engineering & Technology-2018!19!31.10.2018
12 pages
Comprehensive Review On CNN Encoder Decoder PDF
No ratings yet
Comprehensive Review On CNN Encoder Decoder PDF
23 pages
U - Net: Going Deeper With Nested U-Structure For Salient Object Detection
No ratings yet
U - Net: Going Deeper With Nested U-Structure For Salient Object Detection
15 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
SimAM - A Simple, Parameter-Free Attention Module For Convolutional Neural Networks
No ratings yet
SimAM - A Simple, Parameter-Free Attention Module For Convolutional Neural Networks
12 pages
SOD-MTGAN: Small Object Detection Via Multi-Task Generative Adversarial Network
No ratings yet
SOD-MTGAN: Small Object Detection Via Multi-Task Generative Adversarial Network
16 pages
Com/Tanveer Hussain/Efficientsod2
No ratings yet
Com/Tanveer Hussain/Efficientsod2
11 pages
Res2Net A New Multi-Scale Backbone Architecture
No ratings yet
Res2Net A New Multi-Scale Backbone Architecture
11 pages
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
No ratings yet
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
11 pages
Detecting Small Signs From Large Images
No ratings yet
Detecting Small Signs From Large Images
9 pages
TPAMI20-Res2Net
No ratings yet
TPAMI20-Res2Net
11 pages
Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
No ratings yet
Edgenext Efficiently Amalgamated Cnn-Transformer Architecture For Mobile Vision Applications
10 pages
2207.14096v4
No ratings yet
2207.14096v4
24 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
No ratings yet
05 Tiny - SSD - A - Tiny - Single-Shot - Detection - Deep - Convolutional - Neural - Network - For - Real-Time - Embedded - Object - Detection
7 pages
Swin3D- A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
No ratings yet
Swin3D- A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
15 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Base
No ratings yet
Base
17 pages
A Highly Efficient Model To Study The Semantics of Salient Object Detection
No ratings yet
A Highly Efficient Model To Study The Semantics of Salient Object Detection
16 pages
An Efficient Object Detection Algorithm Based On Compressed Networks
No ratings yet
An Efficient Object Detection Algorithm Based On Compressed Networks
13 pages
SLICING AIDEDHYPERINFERENCEANDFINE-TUNING FORSMALLOBJECTDETECTION
No ratings yet
SLICING AIDEDHYPERINFERENCEANDFINE-TUNING FORSMALLOBJECTDETECTION
5 pages
10.3934 Mbe.2023282
No ratings yet
10.3934 Mbe.2023282
40 pages
Cisco Certified Network Associate CCNA Interview Question and Answer
From Everand
Cisco Certified Network Associate CCNA Interview Question and Answer
Manish Soni
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
13 pages
Object Detection With Deep Learning
No ratings yet
Object Detection With Deep Learning
3 pages
Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation
No ratings yet
Selective Depth Attention Networks for Adaptive Multi-scale Feature Representation
13 pages
Omni SR
No ratings yet
Omni SR
10 pages
2412.16886v1
No ratings yet
2412.16886v1
18 pages
Origami
No ratings yet
Origami
14 pages
Object Detection With Deep Learning_ A Review Summary
No ratings yet
Object Detection With Deep Learning_ A Review Summary
11 pages
Object Detection Using Deep CNNs Trained On Synthetic Images
No ratings yet
Object Detection Using Deep CNNs Trained On Synthetic Images
8 pages
CVlecture 6
No ratings yet
CVlecture 6
33 pages
Mobiledets: Searching For Object Detection Architectures For Mobile Accelerators
No ratings yet
Mobiledets: Searching For Object Detection Architectures For Mobile Accelerators
11 pages
Object Detection and Its Implementation On Android Devices
No ratings yet
Object Detection and Its Implementation On Android Devices
8 pages
Tcyb21 Dna
No ratings yet
Tcyb21 Dna
12 pages
1-realtimeobjectdetection
No ratings yet
1-realtimeobjectdetection
6 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Object Detection using ELAN
No ratings yet
Object Detection using ELAN
6 pages
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
He Bi-Directional Cascade Network For Perceptual Edge Detection CVPR 2019 Paper
No ratings yet
He Bi-Directional Cascade Network For Perceptual Edge Detection CVPR 2019 Paper
10 pages
2020 Survey
No ratings yet
2020 Survey
20 pages
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
No ratings yet
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
6 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
No ratings yet
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
10 pages
CNN Models To Detect Multiple Leds For Multilateral Occ.: Project: Ieee P802.15 Ig Vat
No ratings yet
CNN Models To Detect Multiple Leds For Multilateral Occ.: Project: Ieee P802.15 Ig Vat
9 pages
6 Segnet
No ratings yet
6 Segnet
14 pages
Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper
No ratings yet
Wang NAS-FCOS Fast Neural Architecture Search For Object Detection CVPR 2020 Paper
9 pages
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
No ratings yet
Efficientdet: Scalable and Efficient Object Detection: Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team (
10 pages
ECFFNet_Effective_and_Consistent_Feature_Fusion_Network_for_RGB
No ratings yet
ECFFNet_Effective_and_Consistent_Feature_Fusion_Network_for_RGB
12 pages
5-IJLEMR-77839
No ratings yet
5-IJLEMR-77839
5 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
No ratings yet
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
10 pages
Optical Neural Networks
No ratings yet
Optical Neural Networks
10 pages
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
No ratings yet
Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper
10 pages
Mobilenet SSDv2An Improved Object Detection Model For Embedded Systems
No ratings yet
Mobilenet SSDv2An Improved Object Detection Model For Embedded Systems
5 pages
A Simple Pooling-Based Design For Real-Time Salient Object Detection
No ratings yet
A Simple Pooling-Based Design For Real-Time Salient Object Detection
10 pages
Revisiting Mobile CNN From ViT Perspective
No ratings yet
Revisiting Mobile CNN From ViT Perspective
12 pages
AISTATS21 SemiMeta Compressed
No ratings yet
AISTATS21 SemiMeta Compressed
16 pages
TCYB21-HVP
No ratings yet
TCYB21-HVP
11 pages
NeurIPS21 Implicit
No ratings yet
NeurIPS21 Implicit
24 pages
1711.04174v1
No ratings yet
1711.04174v1
21 pages
The Nexus Between Creative Leadership Job Satisfaction Organizational Citizenship Behaviour and Employee Performance Moderation Through Transparency - 2022 - SRAC Romanian Society For Quality
No ratings yet
The Nexus Between Creative Leadership Job Satisfaction Organizational Citizenship Behaviour and Employee Performance Moderation Through Transparency - 2022 - SRAC Romanian Society For Quality
8 pages
Re Zero Kara Hajimeru Isekai Seikatsu - 36 (1080p)
No ratings yet
Re Zero Kara Hajimeru Isekai Seikatsu - 36 (1080p)
30 pages
How To Plan Your Life
100% (1)
How To Plan Your Life
13 pages
Melanie Q. Dadios October 10, 2020 Ccje/1B Ma'am Charlotte U. Caponpon
No ratings yet
Melanie Q. Dadios October 10, 2020 Ccje/1B Ma'am Charlotte U. Caponpon
1 page
Extreme Programming
No ratings yet
Extreme Programming
19 pages
John Ruscio CV
No ratings yet
John Ruscio CV
15 pages
CSAT Stud Ho Blood Relation
No ratings yet
CSAT Stud Ho Blood Relation
11 pages
Greater Glasgow and Clyde
No ratings yet
Greater Glasgow and Clyde
24 pages
Academic and Professional Writing Quiz
No ratings yet
Academic and Professional Writing Quiz
3 pages
A Demonstration Lesson in English III
No ratings yet
A Demonstration Lesson in English III
3 pages
Customer Support - Speak English With Vanessa
No ratings yet
Customer Support - Speak English With Vanessa
4 pages
Multigrade Weekly Lesson Log MTB New
No ratings yet
Multigrade Weekly Lesson Log MTB New
6 pages
Tinitus
No ratings yet
Tinitus
4 pages
Chapter 2
50% (2)
Chapter 2
12 pages
1 Month Study Plan
100% (1)
1 Month Study Plan
16 pages
Reading Practice - 137-204
No ratings yet
Reading Practice - 137-204
68 pages
Old Man and The Sea
No ratings yet
Old Man and The Sea
10 pages
"All Managers Are HR Managers" - Gary Dessler
88% (8)
"All Managers Are HR Managers" - Gary Dessler
4 pages
The Cultural and Psychological Depths of "Encanto": An Analytical Exploration
No ratings yet
The Cultural and Psychological Depths of "Encanto": An Analytical Exploration
2 pages
Unit 1 Lesson 4 Making Friends 2
No ratings yet
Unit 1 Lesson 4 Making Friends 2
24 pages
Happiness Is A Choice
No ratings yet
Happiness Is A Choice
2 pages
Transformative Architecture
100% (1)
Transformative Architecture
3 pages
Unit One: Introduction To Project Management
No ratings yet
Unit One: Introduction To Project Management
54 pages
2018 Bài Thi Môn: A. B. C. D
No ratings yet
2018 Bài Thi Môn: A. B. C. D
7 pages
The Death of Heroins in Terhemba Shija's The Siege The Saga and The Implication For Female Struggle
100% (1)
The Death of Heroins in Terhemba Shija's The Siege The Saga and The Implication For Female Struggle
16 pages
Determiner
No ratings yet
Determiner
12 pages

TIP21-SAMNet

Uploaded by

TIP21-SAMNet

Uploaded by

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

SAMNet: Stereoscopically Attentive Multi-scale

DSConv3×3 r=4 conv1×1

Prediction & Deep Supervision

Stacked VAM Modules

d = FMLP (z), (8) s = F41×1 (F33×3 (F23×3 (F11×1 (M )))), (11)

nism, we can formulate the fusion of multiple branches using TABLE I

S esi,h,w Stage Resolution Module #M #F Stride Dilation rates

#Param Memory ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

FLOPs Speed ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

Fig. 4. Qualitative comparison with state-of-the-art SOD methods.

Component ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

ECSSD DUT-OMRON DUTS-TE HKU-IS SOD THUR15K

You might also like