TIP21-SAMNet
TIP21-SAMNet
Abstract—Recent progress on salient object detection (SOD) undoubtedly makes it less practical for real-time and resource-
mostly benefits from the explosive development of Convolutional constrained applications such as autonomous driving, robots,
Neural Networks (CNNs). However, much of the improvement augmented reality, and so on. In those scenarios, mobile
comes with the larger network size and heavier computation
overhead, which, in our view, is not mobile-friendly and thus devices have reduced computational capabilities, restrictive
difficult to deploy in practice. To promote more practical SOD memory constraints, and limited energy overhead.
systems, we introduce a novel Stereoscopically Attentive Multi- Designing lightweight CNNs could be definitely one of the
scale (SAM) module, which adopts a stereoscopic attention mech- solutions for the above problem, and it has been studied for
anism to adaptively fuse the features of various scales. Embarking other tasks such as image classification [38]–[41]. Although
on this module, we propose an extremely lightweight network,
namely SAMNet, for SOD. Extensive experiments on popular drawing inspirations from them, we are the first to make
benchmarks demonstrate that the proposed SAMNet yields an effort in SOD. This is nontrivial due to the following
comparable accuracy with state-of-the-art methods while running challenges: i) SOD requires both high-level abstract semantic
at a GPU speed of 343fps and a CPU speed of 5fps for 336 × 336 features and low-level fine-grained features to locate salient
inputs with only 1.33M parameters. Therefore, SAMNet paves objects and refine object details, respectively; ii) SOD needs
a new path towards SOD. The source code is available on the
project page https://mmcheng.net/SAMNet/. multi-scale information to process salient objects with various
sizes and aspect ratios in natural scenes. Since lightweight
Index Terms—Lightweight salient object detection, lightweight networks usually have shallow depths and simplified oper-
saliency detection, multi-scale learning.
ations, they are less potent in multi-level and multi-scale
learning than traditional large networks [42], [43]. Therefore,
I. I NTRODUCTION naively applying existing lightweight backbone networks such
as MobileNets [38], [39] and ShuffleNets [40], [41] into SOD
ALIENT object detection (SOD), also known as saliency
S detection, aims at detecting the most visually distinctive
objects or regions in natural images [1]. The progress in SOD
leads to suboptimal performance, which will be demonstrated
in the experiments.
It is well-known that CNNs can learn high-level semantic
has been beneficial to a wide range of computer vision applica- information at their top sides and low-level fine details at their
tions, including image retrieval [2], image segmentation [3], bottom sides. This makes different side-outputs of CNNs con-
object detection [4], visual tracking [5], scene classification tain multi-scale information. Hence, to learn multi-level and
[6], content-aware image editing [7], etc. Conventional meth- multi-scale information, current state-of-the-art SOD methods
ods for this task mainly rely on hand-crafted low-level features (with large networks) adopt encoder-decoder network architec-
and heuristic priors [1], [8], [9], but the lack of high-level se- tures [10]–[29] to integrate the multi-level side-output features
mantic information usually leads to limited accuracy. Recently, of backbone networks. Recent development of SOD mainly
thanks to the unprecedented success of Convolutional Neural comes from new strategies and modules for the effective fusion
Networks (CNNs), especially Fully Convolutional Networks of multi-level backbone features.
(FCNs), deep learning based methods have refreshed the state- Based on the above analyses, the key to lightweight SOD is
of-the-art performance of SOD [10]–[33]. how to effectively learn multi-level and multi-scale informa-
However, those improvements do not come without cost: tion within limited parameter budgets. Instead of integrating
they usually rely on large network size and substantial com- different side-outputs of backbone networks [14], [19], [21],
putational overhead [14], [16], [21], [30], [31], [33]–[36]. For [23], [28], [30], [37] or summarizing the convolutional features
example, EGNet [37] with the VGG16 backbone has 108M of different dilation rates [15], [44] as done in previous
parameters and needs ∼432 MB disk to store its pretrained studies, we propose a novel Stereoscopically Attentive Multi-
model. Moreover, EGNet [37] can only run at 0.09fps on scale (SAM) module for multi-scale learning. It adopts a
the powerful i7-8700K CPU and 12.7fps on an NVIDIA stereoscopic attention mechanism to automatically control
TITAN XP GPU for 336 × 336 images. Such ponderousness the learning at different scales, so it has the capability to
effectively learn the necessary information at various levels of
Y. Liu and M.M. Cheng are with College of Computer Science, Nankai
University. X.Y. Zhang is with School of Mathematical Science, Nankai deep CNNs. Using SAM module as the basic unit, we build
University. J.W. Bian is with the School of Computer Science, The University a lightweight encoder-decoder network, namely SAMNet, to
of Adelaide, Australia. L. Zhang is with the Agency for Science Technology integrate the multi-level and multi-scale features learned by
and Research (A*STAR), Singapore.
The first two authors contributed equally to this paper. M.M. Cheng is the SAM modules. SAMNet achieves comparable accuracy with
corresponding author ([email protected]). state-of-the-art SOD methods while running at 5fps on an i7-
IEEE TRANSACTIONS ON IMAGE PROCESSING 2
8700K CPU and 343fps on an NVIDIA TITAN XP GPU Wang et al. [56] proposed to use an hourglass module to
for 336 × 336 images. Moreover, SAMNet only has 1.33M generate attention maps for hidden features. Furthermore, Hu
parameters. These lightweight properties make it possible for et al. [57] proposed a “Squeeze-and-Excitation” module to
practical mobile applications. explicitly exploit inter-channel relationships and adaptively
In summary, our contributions are threefold: recalibrate feature maps in a channel-wise manner. Beyond
• We propose a novel Stereoscopically Attentive Multi- channel-wise attention, CBAM [58] introduces spatial atten-
scale (SAM) module that adopts a stereoscopic attention tion in a similar way. These methods fall into the self-
mechanism for effective and efficient multi-scale learn- attention category. Spatial and channel-wise self-attention can
ing. adaptively emphasize the most informative feature patches
• Using the SAM module as the basic unit, we propose and channels, respectively. Different from these methods, we
SAMNet, a lightweight encoder-decoder architecture for introduce a stereoscopic attention mechanism to adaptively
SOD, which is the first lightweight SOD model as we recalibrate information flow from multiple branches based on
know. both channel inter-dependencies and spatial contextual clues.
• We empirically evaluate SAMNet on six popular SOD Hence the proposed SAM module can effectively learn multi-
datasets and demonstrate its comparable accuracy, much scale information under a lightweight setting.
higher efficiency, and much smaller network size.
III. M ETHODOLOGY
II. R ELATED W ORK In this section, we elaborate on the proposed framework
for SOD. In Section III-A, we present a simple multi-scale
a) Salient Object Detection: Over the past two decades, module. In Section III-B, we propose the SAM module for
numerous methods have been proposed to detect salient objects effective multi-scale learning. Finally, in Section III-C, we in-
in an image. Traditional methods are mainly based on hand- corporate the proposed SAM module into the encode-decoder
crafted features, such as image contrast [1], texture [45], cen- network and elaborate on the full network architecture.
tral prior [8], background prior [46], etc. Despite the efficiency
of these approaches, hand-crafted features intrinsically lack
A. Multi-scale Learning
the capacity for high-level representation, leading to limited
performance. Based on the above analyses, the multi-scale feature repre-
With the rapid development of deep learning, CNN-based sentations of CNNs are of great importance for SOD [14], [19],
SOD methods have surpassed traditional counterparts by a [23], [28], [30], [37]. Inspired by this, we first propose a simple
considerable margin. Early CNN-based methods [24], [47]– multi-scale module to process visual information at different
[49] include several fully-connected layers, resulting in the scales. Lightweight is central to our design. We adopt dilated
loss of essential spatial information of the whole image. Since convolutions with different dilation rates to capture multi-scale
the seminal work [50] proposed FCNs to predict semantic information and use the depthwise separable convolution to
labels at the pixel level, FCN-based SOD approaches [11]– reduce floating-point operations and model parameters. We
[16], [19]–[23], [25] have dominated this field by exploring the call it dilated depthwise separable convolution and use it
multi-level and multi-scale deep features, as described above. as the basic convolutional operator to scale up the network
Although previous approaches have achieved high accu- dimensionality in terms of depth and width.
racy by employing powerful CNNs, they are relatively slow Formally, let I ∈ RC×H×W be the input feature map
in speed, hungry in energy consumption, and occupy large whose number of channels, height, and width are C, H, and
memory space. These shortcomings make it difficult to deploy W , respectively. With the input I, we first apply a single
state-of-the-art methods into real-world applications. This is depthwise separable conv3 × 3 (DSConv3 × 3 for short) to
our motivation for this work, i.e., towards lightweight SOD extract common information F0 for each branch, namely,
that trades off accuracy, model size, and speed. Our technical F0 = K0 (I), (1)
motivation comes from the fact that most of previous CNN-
based methods [10]–[29] improve the performance through where K0 denotes a DSConv3 × 3 operation. At different
the exploration in multi-scale and multi-level deep learning. branches, dilated DSConv3 × 3 with different dilation rates
In this paper, we propose a novel SAM module that adopts are applied to F0 , i.e.,
stereoscopic attention for lightweight multi-scale learning. Fi = Ki (F0 ), i = 1, 2, · · · , N, (2)
b) Attention Mechanism: Attention mechanism plays an
essential role in human perception [51], [52]. Instead of where Ki denotes the dilated DSConv3 × 3 operation at
processing the whole image at once, human visual system branch i, and N is the number of branches. Then, contextual
adaptively filters less essential information like image back- information at different scales is aggregated by a single
ground to enhance the capture of visual structure. This has element-wise summation with a residual connection, namely,
inspired the research in deep learning. N
X
Recent computer vision community has witnessed numerous F = Fi . (3)
successes of attention mechanism in a wide range of tasks, i=0
including SOD [19], [26], [34], sequence learning [53], person Here, we use element-wise summation instead of concatena-
re-ID [54], and image recovery [55]. For image classification, tion, because concatenation will greatly increase the number
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
Identity Mapping
Attention Mechanism Element-wise sum
r=1
Element-wise mul
F1 W0V F0
Dilated DSConv3×3 Channel-wise Attention Mechanism
GAP Global average pooling
r=2 FC FC
GAP
z d
F2 Feature Fusion W1V F1
W4V F4
Fig. 1. Illustration of the proposed SAM module. Given the input feature map I, SAM first extracts multi-scale features Fi (i = 0, 1, · · · , N ) through
multiple branches. Multi-scale features Fi are aggregated via element-wise summation to produce M . Then, we calculate channel-wise attention logits d and
spatial attention logits s based on M . d and s are multiplied to get stereoscopic attention logits v that are transformed to stereoscopic attention weights
WiV (i = 0, 1, · · · , N ) using a softmax function. WiV (i = 0, 1, · · · , N ) are used as weights to aggregate Fi to obtain F . At last, a residual connection
[43] from I is added to get the output O. In the figure, ⊗ indicates element-wise multiplication, and the two multiplied feature maps are replicated to the
same shape before multiplication. The symbol r denotes the dilation rate of the dilated DSConv3 × 3. Best viewed in color.
of channels, leading to heavier computational complexity and from different scales, and thus blindly feeding all features
more network parameters. Finally, the aggregated features are F may lead to severe overfitting. A natural solution is to
further rearranged by a vanilla conv1 × 1, i.e., attentively suppress non-informative branches and promote
discriminative ones automatically, and we achieve this by the
O = Kf use (F ) + I, (4) well-established attention mechanism.
where Kf use denotes the conv1 × 1 operation to fuse con- In our setting, an ideal attention module should have the
textual information at various scales. The summation of I following features. i) Due to the independence of each chan-
denotes a residual connection [43], which has demonstrated nel, the final attention should have a strong intra-channel
to be effective in CNN training. dependency. More specifically, different feature channels are
The dilation rates and the number of branches are hyper- usually obtained by convolving independent filters with input
parameters involved in our multi-scale module. Empirically, features. In this case, if a feature from a particular channel
larger dilation rates and more branches are desired when the is informative for the final prediction, the features in the
input features I are of high resolution, as large feature maps same channel of the same branch are likely to be informative
usually possess contextual information at various scales. as well. ii) The final attention should have a strong spatial-
wise dependency. The reason could be that, as a mid-level
task, SOD requires certain level of reasoning in the local
B. Stereoscopically Attentive Multi-scale Module neighbourhood for each pixel. iii) The computation should
A potential drawback of the multi-scale module in Sec- be efficient. In this case, the naı̈ve solution of independently
tion III-A lies in the element-wise summation. When con- learning a group (e.g., C × H × W ) of branch-wise attention
textual information from different branches is directly sum- weights is suboptimal due to its heavy computational overload.
marized together, informative branches may be weakened or The first two requirements mentioned above regularize the
even overwhelmed by non-informative ones. On the other attention mechanism in a global and local manner, respectively.
hand, the layers at different network depths may prefer the This naturally motivates us to factorize the final attention
information from different scales, while the element-wise weights v into two individual weights as follows:
summation assigns equal importance to all scales. To alle-
v = d ⊗ s, (5)
viate this problem, we propose a novel stereoscopic attention
mechanism that allows each channel at each spatial location where d and s denote the channel-wise attention and the
to adaptively adjust the weight of each branch with a soft spatial-wise attention, respectively. ⊗ indicates element-wise
attention mechanism. Fig. 1 provides an illustration of the multiplication, and d and s are replicated to the same shape of
proposed SAM module with four branches. (N +1)×C ×H ×W before multiplication. More specifically,
a) Stereoscopic Attention Mechanism: Without loss of channel-wise attention d ∈ R(N +1)×C shrinks feature spatially
generality, we consider the input feature is a 4-dimensional and attentively suppress non-informative features and promote
tensor F ∈ R(N +1)×C×H×W , in which each branch i ∈ discriminative ones in a channel-wise global manner. In the
{0, · · · N } generates features Fi ∈ RC×H×W of different same way, the spatial-wise attention s ∈ R(N +1)×H×W
scale and semantic level. It is widely believed that layers from absorbs the features at a particular spatial location across
different depth levels in a network may prefer the information different channels. Finally, d and s are broadcasted into the
IEEE TRANSACTIONS ON IMAGE PROCESSING 4
F Feature Fusion
G
DSConvk×k upsample
conv1×1
Fig. 2. Overall encoder-decoder architecture of the proposed SAMNet. Si and Ri represent the output feature maps of the encoder and the decoder at the
ith stage, respectively. Pi is the predicted saliency map at the ith stage, and P1 is the final prediction of SAMNet. G is the ground-truth saliency map.
PPM denotes the well-known Pyramid Pooling Module [59]. Best viewed in color.
same dimensionality of (N + 1) × C × H × W to obtain the where d ∈ R(N +1)C is further reshaped to R(N +1)×C . The
final attention weight by Eq. (5). softmax function is applied to d on the branch-wise dimension
b) Preprocessing by scale fusion: In order to reduce the to obtain the channel-wise attention, i.e.,
computational overload, we aggregate multi-scale contextual
information with element-wise summation, namely, D edi,c
Wi,c = PN ,
dj,c
N j=0 e (9)
X
M= Fi , (6) i = 0, 1, · · · , N ; c = 1, 2, · · · , C.
i=0
With channel-wise attention incorporated, the feature aggre-
The fused features M is used to compute the gating statistics gation in Eq. (3) is rewritten as
for each branch. Different from self-attention, the attention for
each module branch is decided by the learning of all branches, N
X
instead of a single branch as in self-attention mechanisms. FD = WiD ⊗ Fi , (10)
Therefore, the SAM module can extract the informative fea- i=0
tures from all branches within a “global” view. As a result,
the SAM module has the capability to learn the features at its where WiD is replicated to the same shape as Fi
preferred scales. (i.e., RC×H×W ) before element-wise multiplication. The fu-
c) Channel-wise Attention Mechanism: The channel at- sion of Eq. (4) is applied to F D to serve as the output.
tention mechanism aims at calculating a channel-wise attention d) Spatial Attention Mechanism: The spatial attention
vector WiD ∈ RC for each branch i, i = 0, 1, · · · , N . To ex- mechanism aims at computing a spatial attention map WiS ∈
plore the inter-channel relationship among different channels RH×W to highlight or suppress the activation at specific
[57], we embed the global information using global average locations. It is well-acknowledged that large receptive fields
pooling (GAP) on the fused feature map M , i.e., can capture contextual information better, which is crucial
for learning location-wise attention [58]. Based on this, we
H W
1 XX adopt dilated DSConv3 × 3 to enlarge the receptive fields
zc = FGAP (M ) = Mc,i,j ,
HW i=1 j=1 (7) while maintaining low computational complexity. Specifically,
the fused features M is first projected to a low-dimensional
c = 0, 1, · · · , C − 1, space RC/4×H×W by a conv1 × 1 for reducing parameters
and computational cost. Then, two dilated DSConv3 × 3 are
where z ∈ RC is the latent vector encoding channel-wise
applied to the reduced features for efficient contextual infor-
information of M . Then, we apply a multi-layer perceptron
mation aggregation. Finally, the features are again reduced to
(MLP) with two layers1 on the latent vector, and extract
R(N +1)×H×W using a conv1 × 1. Mathematically, we have
channel-wise information at different scales as follows:
1 We insert batch normalization and ReLU activation function between the where Fik×k denotes the ith (depthwise separable or vanilla)
two linear transformations. k × k convolution. Similar to channel-wise attention mecha-
IEEE TRANSACTIONS ON IMAGE PROCESSING 5
TABLE II
C OMPARISON WITH EXISTING METHODS IN TERMS OF THE NUMBER OF PARAMETERS (#PARAM ), GPU MEMORY USAGE , Fβ , AND MAE.
all experiments is conducted using the Adam optimizer [69] where we set β 2 = 0.3 to emphasize the importance of
with parameters β1 = 0.9, β2 = 0.999, weight decay of precision, as in previous works [21], [23], [28], [30], [31].
10−4 , and batch size of 20. Our model is pretrained on the Note that each threshold will correspond to a Fβ score here,
ImageNet dataset [70] as in [43]. We adopt poly learning and we report the maximum Fβ score across all thresholds.
rate scheduler so that the learning rate for the nth epoch is MAE measures the difference between the predicted saliency
power
n
init lr × 1 − #epochs , where init lr = 5 × 10−4 and map P and the ground-truth saliency map G, which can be
power = 0.9. We train the proposed model for 50 epochs, computed as
i.e., #epochs = 50. H W
1 XX
b) Datasets: We extensively evaluate the proposed MAE(P , G) = |Pij − Gij | , (19)
method on six datasets, including DUTS [71], ECSSD [72], HW i=1 j=1
SOD [73], HKU-IS [47], THUR15K [74], and DUT-OMRON
where H and W denote the height and width of the saliency
[45] datasets. These six datasets consist of 15572, 1000, 300,
map, respectively. The weighted Fβω -measure [76] is designed
4447, 6232, and 5168 natural images with corresponding
to amend the interpolation flaw, the dependency flaw, and
pixel-level labels, respectively. Following recent studies [14],
the equal-importance flaw in traditional evaluation metrics,
[21], [31], [35], [75], we train the proposed model on the
and we use it with default settings to evaluate SAMNet and
DUTS training set and evaluate it on the DUTS testing set
other competitors. As for Sβ [77], it is also a widely-used
(DUTS-TE) and other five datasets.
saliency evaluation metric as it can measure the structure sim-
c) Evaluation Criteria: We evaluate the accuracy of ilarity between predictions and ground truths. Sβ consists of
SAMNet against previous state-of-the-art methods with regard region-aware and object-aware structural similarity measures,
to four widely-used metrics, i.e., Fβ -measure score (Fβ ), mean in which the former is achieved using well-known SSIM [78],
absolute error (MAE), weighted Fβω -measure (Fβω ) [76], and and the latter is based on the probability theory. We adopt the
structure similarity measure (Sβ ) [77]. Given a threshold in official code with default settings in our experiments.
the range of [0, 1), we can binarize the predicted saliency
d) Efficiency Measures: This paper targets a lightweight
probability map and then calculate the precision and recall
yet powerful solution for SOD, so we also evaluate the
values by comparing the binarized prediction map with the
efficiency and flexibility of various methods, including the
binary ground-truth map. With precision and recall calculated,
number of model parameters (#Param), GPU memory usage,
Fβ -measure is the weighted harmonic mean of precision and
the number of floating-point operations (FLOPs), and the
recall, i.e.,
inference speed (FPS). GPU memory usage measures the
(1 + β 2 ) × Precision × Recall amount of memory required for a model to test an image.
Fβ = , (18) The number of FLOPs measures the computational cost of a
β 2 × Precision + Recall
IEEE TRANSACTIONS ON IMAGE PROCESSING 7
TABLE III
C OMPARISON WITH EXISTING METHODS IN TERMS OF THE NUMBER OF FLOP S , SPEED , Fβω , AND Sβ .
0.87 DCL
0.87 0.87
RFCN
0.86 NLDF 0.86 0.86
DSS
Amulet
0.85 UCF 0.85 0.85
SRM
F-measure
F-measure
F-measure
PiCANet
0.84 BRN 0.84 0.84
C2S
0.83 RAS
0.83 0.83
DNA
CPD
0.82 BASNet 0.82 0.82
AFNet
PoolNet
0.81 EGNet 0.81 0.81
BANet
OURS
0.8 0.8 0.8
10 0 10 1 10 2 10 0 10 1 10 2 10 0 10 1 10 2
#Param (M) FLOAPs FPS
Fig. 3. Illustration of the trade-off between performance and computational cost. The F-measure (Fβ ) is averaged over six datasets. Note that the horizon
axis is logarithmic.
model, and a smaller number of FLOPs leads to lower energy B. Performance Analysis
consumption. Following lightweight backbone networks [38]– In this part, we compare the proposed SAMNet with 20
[41] and efficient semantic segmentation [65]–[67], the speed state-of-the-art SOD methods, including DRFI [8], DCL [62],
stands for the number of images that a model can do network DHSNet [24], RFCN [10], NLDF [12], DSS [30], Amulet
inference per second, tested on a single NVIDIA TITAN Xp [23], UCF [11], SRM [35], PiCANet [21], BRN [14], C2S
GPU. Since deep supervision is only used in training, we omit [16], RAS [34], DNA [31], CPD [63], BASNet [18], AFNet
the calculation of {Pi , i = 2, 3, 4, 5} in the speed testing of [27], PoolNet [28], EGNet [37], and BANet [64]. Other than
SAMNet. For saliency detectors, memory usage, FLOPs, and existing SOD methods, we also compare with several state-of-
speed are tested using a 336 × 336 input image except that a the-art lightweight backbone networks that are widely used for
method specifies its input dimensions. Since efficient semantic image classification, including MobileNet [38], MobileNetV2
segmentation methods [65]–[67] are usually designed for high- [39], ShuffleNet [40], and ShuffleNetV2 [41]. For adapting
resolution images, we use a 672×672 input to ensure accuracy; them to the SOD task, we add the same decoder in the
otherwise, we would get very low accuracy. For reformed proposed SAMNet to them. The resulting baselines are trained
baselines based on lightweight backbones [38]–[41], we use with the same settings for comparison. Besides, we compare
a 336 × 336 input. Here, #Param and GPU memory usage with some efficient semantic segmentation methods as well,
are measured in million (M), and the number of FLOPs is including ICNet [65], BiSeNet [66], and DFANet [67]. We
measured in giga (G). reform them for SOD by replacing their final softmax acti-
IEEE TRANSACTIONS ON IMAGE PROCESSING 8
Image RFCN DSS Amulet UCF SRM PiCANet BRN C2S RAS CPD EGNet Ours GT
vation function with the standard sigmoid activation function. accuracy, the number of parameters, GPU memory usage, the
For BiSeNet [66], we report its results with ResNet-18 [43] number of FLOPs, and speed.
(i.e., BiSeNet R18) and Xception-39 [79] (i.e., BiSeNet X39) b) Comparison with Lightweight Backbones: Table II
backbones, respectively. and Table III also present the comparison of SAMNet to
a) Comparison with Existing SOD Methods: Table II lightweight backbone based baselines, i.e., MobileNet [38],
shows the evaluation results of the proposed SAMNet com- MobileNetV2 [39], ShuffleNet [40], and ShuffleNetV2 [41].
pared with previous state-of-the-art alternatives in terms of the Although these baselines have similar or even faster speeds
number of parameters, GPU memory usage, Fβ , and MAE. than SAMNet, SAMNet achieves substantially better accuracy
Table III shows the evaluation results in terms of the number in terms of all metrics. This suggests that it is suboptimal
of FLOPs, speed, Fβω , and Sβ . Note that the number of to apply existing lightweight backbones for SOD directly.
FLOPs is highly related to energy consumption for the network This also demonstrates the importance of carefully designing
inference. The results clearly show that SAMNet achieves on network architectures for the lightweight SOD task and the
par accuracy with state-of-the-art SOD solutions, especially in advantage of the proposed multi-scale attention mechanism.
terms of Fβ , MAE, and Sβ . However, SAMNet requires one c) Comparison with Efficient Semantic Segmentation
or two orders of magnitude fewer computational resources. Methods: From Table II and Table III, we can see that
For example, compared with the best performing BANet [64], SAMNet outperforms efficient semantic segmentation methods
SAMNet shows a slightly lower average Fβ (0.848 vs. 0.868) [65]–[67] by a large margin, with fewer parameters, fewer
over six datasets, but SAMNet has 42× fewer parameters, FLOPs, and faster speed. It is interesting to find that efficient
5.5× less GPU memory consumption, 243× fewer FLOPs, semantic segmentation methods perform worse than baselines
27× faster running speed than BANet [64]. This has significant based on lightweight backbones. This implies that efficient
impacts on mobile devices, where the limited computational semantic segmentation methods are heavily tuned for semantic
resource, reduced energy supply, restrictive running memory segmentation and are thus unsuitable for directly applying to
and storage space cannot afford the heavy overhead of tradi- SOD. Hence, lightweight SOD is an essential problem and
tional cumbersome SOD methods. should get more attention from this community.
We also illustrate the comparison in Fig. 3, where the trade- d) Qualitative comparison: In Fig. 4, we provide some
offs between the accuracy and efficiency of various methods visualization examples to exhibit the superiority of SAMNet.
are more clearly shown. In the sub-figures of Fβ vs. #Param Although SAMNet performs slightly worse than traditional
and Fβ vs. FLOPs, SAMNet lies at the top-left corner; in the cumbersome SOD methods, it can segment salient objects
sub-figure of Fβ vs. FPS, SAMNet lies at the top-right corner. with coherent boundaries in many challenging circumstances,
This implies that SAMNet achieves comparable accuracy to such as complicated scenarios (1st and 3rd rows), low contrast
previous state-of-the-art methods with much fewer parameters between foreground and background (2nd and 4th rows),
and FLOPs, and much faster speed. Therefore, we can come to large objects (5th and 6th rows), scenarios with abnormal
the conclusion that SAMNet achieves a good trade-off among brightness (7th and 8th rows), and confusing natural scenarios
IEEE TRANSACTIONS ON IMAGE PROCESSING 9
TABLE IV
A BLATION STUDY FOR THE DESIGN CHOICES OF SAMN ET .
TABLE V
A BLATION STUDY FOR THE CONFIGURATIONS OF SAMN ET .
(9th row). Combined with the lightweight and efficient nature goal of lightweight SOD. Our default setting of SAMNet is
of SAMNet, it has the potential to promote real-world SOD set by considering the trade-off between effectiveness and
applications. lightweight restriction.
V. C ONCLUSION
C. Ablation Study
Instead of only considering accuracy, this paper focuses
In this section, we conduct ablation study to demonstrate on lightweight SOD that trades off among accuracy, effi-
the effectiveness of the proposed module components and ciency, parameter numbers, and FLOPs. We propose a novel
the parameter configurations of SAMNet. The experimental SAM module, which enables small networks to effectively
settings follow those in Section IV-B. encode both high-level features and low-level details. Incor-
a) Proposed Module Components: Table IV shows the porating the SAM module, the proposed SAMNet achieves
ablation study results for the proposed module components. comparable performances with state-of-the-art SOD methods
The proposed SAM module is designed by carefully com- that use significantly more parameters, while saving several
bining these basic components into a nontrivial module for orders of magnitude overhead. Such excellent trade-off be-
effective and efficient multi-scale learning. Table IV suggests tween performance and efficiency makes SAMNet possible to
that the performance is gradually boosted by adding each com- provide high-accuracy SOD in resource-limited environments,
ponent into the framework. Besides, the comparison between e.g., mobile devices. SAMNet also clearly outperforms other
No. 0 and No. 5 demonstrates the superiority of the proposed well-known lightweight networks in image classification [38]–
solution compared with the baseline, where the performance [41] and semantic segmentation [65]–[67], suggesting that
gap comes solely from our contributions because two models lightweight SOD is worth studying and should be set up
are both trained from scratch without the ImageNet [70] as a separate research field. To the best of our knowledge,
pretraining. SAMNet is the first lightweight SOD method that is expected
b) Configurations of SAMNet: Table V shows the ab- to pave a new path for SOD. Through this work, we want to
lation study results for different network configurations. It arouse the research for lightweight SOD that would promote
is interesting to find that the proposed SAMNet is robust more practical SOD applications. In the future, we plan
to the slight changes in configurations. Introducing more to apply the weight quantization and network compression
parameters will lead to better performance, such as increasing techniques to boost SAMNet’s CPU speed of 5fps for real-
the number of convolution filters, but this is orthogonal to our time performance.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10
ACKNOWLEDGMENT [23] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating
multi-level convolutional features for salient object detection,” in Int.
This research was supported by Major Project for New Conf. Comput. Vis., 2017, pp. 202–211.
Generation of AI under Grant No. 2018AAA0100400, NSFC [24] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network for
(61922046), S&T innovation project from Chinese Min- salient object detection,” in IEEE Conf. Comput. Vis. Pattern Recog.,
2016, pp. 678–686.
istry of Education, and Tianjin Natural Science Foundation [25] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into salient
(17JCJQJC43700). object subitizing and detection,” in Int. Conf. Comput. Vis., 2017, pp.
1059–1067.
R EFERENCES [26] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object
detection with pyramid attention and salient edges,” in IEEE Conf.
[1] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global Comput. Vis. Pattern Recog., 2019, pp. 1448–1457.
contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. [27] M. Feng, H. Lu, and E. Ding, “Attentive feedback network for boundary-
Intell., vol. 37, no. 3, pp. 569–582, 2015. aware salient object detection,” in IEEE Conf. Comput. Vis. Pattern
[2] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu, “Visual-textual Recog., 2019, pp. 1623–1632.
joint relevance learning for tag-based social image search,” IEEE Trans. [28] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple
Image Process., vol. 22, no. 1, pp. 363–376, 2012. pooling-based design for real-time salient object detection,” in IEEE
[3] M. Donoser, M. Urschler, M. Hirzer, and H. Bischof, “Saliency driven Conf. Comput. Vis. Pattern Recog., 2019, pp. 3917–3926.
total variation segmentation,” in Int. Conf. Comput. Vis., 2009, pp. 817– [29] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutual
824. learning method for salient object detection with intertwined multi-
[4] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp.
attention useful for object recognition?” in IEEE Conf. Comput. Vis. 8150–8159.
Pattern Recog., 2004, pp. 37–44. [30] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr, “Deeply
[5] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant track- supervised salient object detection with short connections,” IEEE Trans.
ing,” in IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 1007–1013. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 815–828, 2019.
[6] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliency [31] Y. Liu, M.-M. Cheng, X.-Y. Zhang, G.-Y. Nie, and M. Wang, “DNA:
detection and its application in object recognition,” IEEE Trans. Circ. Deeply-supervised nonlinear aggregation for salient object detection,”
Syst. Video Technol., vol. 24, no. 5, pp. 769–779, 2013. IEEE Trans. Cybernetics, 2021.
[7] M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, [32] Y. Liu, Y.-C. Gu, X.-Y. Zhang, W. Wang, and M.-M. Cheng,
“RepFinder: Finding approximately repeated scene elements for image “Lightweight salient object detection via hierarchical visual perception
editing,” ACM Trans. Graph., vol. 29, no. 4, pp. 83:1–83:8, 2010. learning,” IEEE Trans. Cybernetics, 2020.
[8] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object [33] Y. Qiu, Y. Liu, H. Yang, and J. Xu, “A simple saliency detection
detection: A discriminative regional feature integration approach,” in approach via automatic top-down feature fusion,” Neurocomputing, vol.
IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 2083–2090. 388, pp. 124–134, 2020.
[9] D. A. Klein and S. Frintrop, “Center-surround divergence of feature [34] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient
statistics for salient object detection,” in Int. Conf. Comput. Vis., 2011, object detection,” in Eur. Conf. Comput. Vis., 2018, pp. 234–250.
pp. 2214–2219. [35] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise
[10] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection refinement model for detecting salient objects in images,” in Int. Conf.
with recurrent fully convolutional networks,” in Eur. Conf. Comput. Vis., Comput. Vis., 2017, pp. 4019–4028.
2016, pp. 825–841. [36] Y. Qiu, Y. Liu, X. Ma, L. Liu, H. Gao, and J. Xu, “Revisiting multi-
[11] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain level feature fusion: A simple yet effective network for salient object
convolutional features for accurate saliency detection,” in Int. Conf. detection,” in IEEE Int. Conf. Image Process., 2019, pp. 4010–4014.
Comput. Vis., 2017, pp. 212–221. [37] J.-X. Zhao, J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng,
[12] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin, “EGNet: Edge guidance network for salient object detection,” in Int.
“Non-local deep features for salient object detection,” in IEEE Conf. Conf. Comput. Vis., 2019, pp. 8779–8788.
Comput. Vis. Pattern Recog., 2017, pp. 6609–6617. [38] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[13] P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for salient object M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural
detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2300– networks for mobile vision applications,” arXiv preprint, 2017.
2309. [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
[14] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji, “MobileNetV2: Inverted residuals and linear bottlenecks,” in IEEE Conf.
“Detect globally, refine locally: A novel approach to saliency detection,” Comput. Vis. Pattern Recog., 2018, pp. 4510–4520.
in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 3127–3135. [40] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
[15] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional message efficient convolutional neural network for mobile devices,” in IEEE Conf.
passing model for salient object detection,” in IEEE Conf. Comput. Vis. Comput. Vis. Pattern Recog., 2018, pp. 6848–6856.
Pattern Recog., 2018, pp. 1741–1750. [41] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet v2: Practical
[16] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowledge guidelines for efficient CNN architecture design,” in Eur. Conf. Comput.
transfer for salient object detection,” in Eur. Conf. Comput. Vis., 2018, Vis., 2018, pp. 116–131.
pp. 355–370. [42] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[17] N. D. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature large-scale image recognition,” in Int. Conf. Learn. Represent., 2015.
contrast, semantics, and beyond,” in IEEE Conf. Comput. Vis. Pattern [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Recog., 2016, pp. 516–524. recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–
[18] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, 778.
“BASNet: Boundary-aware salient object detection,” in IEEE Conf. [44] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Comput. Vis. Pattern Recog., 2019, pp. 7479–7489. “DeepLab: Semantic image segmentation with deep convolutional nets,
[19] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attention atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
guided recurrent network for salient object detection,” in IEEE Conf. Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
Comput. Vis. Pattern Recog., 2018, pp. 714–722. [45] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection
[20] W. Wang, J. Shen, X. Dong, and A. Borji, “Salient object detection via graph-based manifold ranking,” in IEEE Conf. Comput. Vis. Pattern
driven by fixation prediction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 3166–3173.
Recog., 2018, pp. 1711–1720. [46] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection
[21] N. Liu, J. Han, and M.-H. Yang, “PiCANet: Learning pixel-wise via dense and sparse reconstruction,” in Int. Conf. Comput. Vis., 2013,
contextual attention for saliency detection,” in IEEE Conf. Comput. Vis. pp. 2976–2983.
Pattern Recog., 2018, pp. 3089–3098. [47] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,”
[22] M. A. Islam, M. Kalash, and N. D. Bruce, “Revisiting salient object in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 5455–5463.
detection: Simultaneous detection, ranking, and subitizing of multiple [48] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-
salient objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. context deep learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015,
7142–7150. pp. 1265–1274.
IEEE TRANSACTIONS ON IMAGE PROCESSING 11
[49] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency [77] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure:
detection via local estimation and global search,” in IEEE Conf. Comput. A new way to evaluate foreground maps,” in Int. Conf. Comput. Vis.,
Vis. Pattern Recog., 2015, pp. 3183–3192. 2017, pp. 4548–4557.
[50] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [78] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., quality assessment: From error visibility to structural similarity,” IEEE
2015, pp. 3431–3440. Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
[51] R. A. Rensink, “The dynamic representation of scenes,” Visual Cogni- [79] F. Chollet, “Xception: Deep learning with depthwise separable convolu-
tion, vol. 7, no. 1-3, pp. 17–42, 2000. tions,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 1251–1258.
[52] M. Corbetta and G. L. Shulman, “Control of goal-directed and stimulus-
driven attention in the brain,” Nature Reviews Neuroscience, vol. 3, no. 3,
pp. 201–215, 2002.
[53] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating Yun Liu is a PhD candidate at the College of
for video classification,” arXiv preprint, 2017. Computer Science, Nankai University. He received
his bachelor degree from Nankai University in 2016.
[54] D. Chen, S. Zhang, W. Ouyang, J. Yang, and Y. Tai, “Person search
His research interests include computer vision and
via a mask-guided two-stream CNN model,” in Eur. Conf. Comput. Vis.,
machine learning.
2018, pp. 734–750.
[55] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-
resolution using very deep residual channel attention networks,” in Eur.
Conf. Comput. Vis., 2018, pp. 286–301.
[56] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
X. Tang, “Residual attention network for image classification,” in IEEE
Conf. Comput. Vis. Pattern Recog., 2017, pp. 3156–3164.
[57] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE
Conf. Comput. Vis. Pattern Recog., 2018, pp. 7132–7141. Xin-Yu Zhang is an undergraduate student from
[58] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “CBAM: Convolutional School of Mathematical Sciences at Nankai Univer-
block attention module,” in Eur. Conf. Comput. Vis., 2018, pp. 3–19. sity. His research interests include computer vision
[59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing and deep learning.
network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2881–
2890.
[60] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment:
Finding the salient objects in images via two-stream fixation-semantic
CNNs,” in Int. Conf. Comput. Vis., 2017, pp. 1050–1058.
[61] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised
nets,” in Artif. Intell. Stat., 2015, pp. 562–570.
[62] G. Li and Y. Yu, “Deep contrast learning for salient object detection,”
in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 478–487.
[63] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and Jia-Wang Bian is a PhD student at the University
accurate salient object detection,” in IEEE Conf. Comput. Vis. Pattern of Adelaide and an Associated PhD researcher with
Recog., 2019, pp. 3907–3916. the Australian Centre for Robotic Vision (ACRV).
He is advised by Prof. Ian Reid and Prof. Chun-
[64] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance:
hua Shen. His research interests lie in the field
Boundary-aware salient object detection,” in Int. Conf. Comput. Vis.,
of computer vision and robotics. Jiawang received
2019, pp. 3799–3808.
his B.Eng. degree from Nankai University, where
[65] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic
he was advised by Prof. Ming-Ming Cheng. He
segmentation on high-resolution images,” in Eur. Conf. Comput. Vis.,
was a research assistant at the Singapore University
2018, pp. 405–420.
of Technology and Design (SUTD). Jiawang also
[66] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet:
did a trainee engineer job at the Advanced Digital
Bilateral segmentation network for real-time semantic segmentation,”
Sciences Center (ADSC), Huawei Technologies Co., Ltd, and Tusimple.
in Eur. Conf. Comput. Vis., 2018, pp. 325–341.
[67] H. Li, P. Xiong, H. Fan, and J. Sun, “DFANet: Deep feature aggregation
for real-time semantic segmentation,” in IEEE Conf. Comput. Vis.
Pattern Recog., 2019, pp. 9522–9531. Le Zhang received the B.Eng degree from Univer-
[68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, sity of Electronic Science and Technology Of China
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “PyTorch: An in 2011. He received his M.Sc and PhD degree
imperative style, high-performance deep learning library,” in Adv. Neural form Nanyang Technological University (NTU) in
Inform. Process. Syst., 2019, pp. 8026–8037. 2012 and 2016, respectively. Currently, he is a
[69] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” scientist at Institute for Infocomm Research, Agency
in Int. Conf. Learn. Represent., 2015. for Science, Technology and Research (A*STAR),
[70] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Singapore. He served as TPC member in several
A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis. conferences such as AAAI, IJCAI. He has served
Pattern Recog., 2009, pp. 248–255. as a Guest Editor for Pattern Recognition and Neu-
[71] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, rocomputing. His current research interests include
“Learning to detect salient objects with image-level supervision,” in deep learning and computer vision.
IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 136–145.
[72] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 1155–1162. Ming-Ming Cheng received his PhD degree from
[73] V. Movahedi and J. H. Elder, “Design and perceptual validation of Tsinghua University in 2012. Then, he did 2 years
performance measures for salient object segmentation,” in IEEE Conf. research fellow with Prof. Philip Torr at Oxford. He
Comput. Vis. Pattern Recog. Worksh., 2010, pp. 49–56. is now a professor at Nankai University, leading the
[74] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “SalientShape: Media Computing Lab. His research interests include
Group saliency in image collections,” The Vis. Comput., vol. 30, no. 4, computer graphics, computer vision, and image
pp. 443–453, 2014. processing. He received research awards, including
[75] Y. Zeng, H. Lu, L. Zhang, M. Feng, and A. Borji, “Learning to promote ACM China Rising Star Award, IBM Global SUR
saliency detectors,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, Award, and CCF-Intel Young Faculty Researcher
pp. 1644–1653. Program. He is on the editorial boards of IEEE TIP.
[76] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground
maps?” in IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 248–255.