0% found this document useful (0 votes)
53 views

Comprehensive Review On CNN Encoder Decoder PDF

This document summarizes a research paper that comprehensively reviews CNN-based encoder-decoder networks for salient object detection. It discusses how these models have improved performance on this task and provides an empirical study evaluating different encoder backbones, loss functions, training batch sizes, and attention structures. The study discovers new baseline models that outperform state-of-the-art methods. It also evaluates these baseline models on video-based salient object detection datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Comprehensive Review On CNN Encoder Decoder PDF

This document summarizes a research paper that comprehensively reviews CNN-based encoder-decoder networks for salient object detection. It discusses how these models have improved performance on this task and provides an empirical study evaluating different encoder backbones, loss functions, training batch sizes, and attention structures. The study discovers new baseline models that outperform state-of-the-art methods. It also evaluates these baseline models on video-based salient object detection datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Information Sciences 546 (2021) 835–857

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

CNN-based encoder-decoder networks for salient object


detection: A comprehensive review and recent advances
Yuzhu Ji a, Haijun Zhang a,⇑, Zhao Zhang b, Ming Liu c
a
Department of Computer Science, Harbin Institute of Technology Shenzhen, PR China
b
Department of Computer Science, Hefei University of Technology, Hefei, PR China
c
School of Astronautics, Harbin Institute of Technology, Harbin, PR China

a r t i c l e i n f o a b s t r a c t

Article history: Convolutional neural network (CNN)-based encoder-decoder models have profoundly
Received 4 July 2020 inspired recent works in the field of salient object detection (SOD). With the rapid devel-
Received in revised form 19 August 2020 opment of encoder-decoder models with respect to most pixel-level dense prediction tasks,
Accepted 1 September 2020
an empirical study still does not exist that evaluates performance by applying a large body
Available online 9 September 2020
of encoder-decoder models on SOD tasks. In this paper, instead of limiting our survey to
SOD methods, a broader view is further presented from the perspective of fundamental
Keywords:
architectures of key modules and structures in CNN-based encoder-decoder models for
Salient object detection
Encoder-decoder model
pixel-level dense prediction tasks. Moreover, we focus on performing SOD by leveraging
Pixel-level classification deep encoder-decoder models, and present an extensive empirical study on baseline
Video saliency encoder-decoder models in terms of different encoder backbones, loss functions, training
Empirical study batch sizes, and attention structures. Moreover, state-of-the-art encoder-decoder models
adopted from semantic segmentation and deep CNN-based SOD models are also investi-
gated. New baseline models that can outperform state-of-the-art performance were dis-
covered. In addition, these newly discovered baseline models were further evaluated on
three video-based SOD benchmark datasets. Experimental results demonstrate the effec-
tiveness of these baseline models on both image- and video-based SOD tasks. This empir-
ical study is concluded by a comprehensive summary which provides suggestions on future
perspectives.
Ó 2020 Elsevier Inc. All rights reserved.

1. Introduction

The human visual attention system intends to extract the most informative objects and regions in a scene, and then com-
bines this local information to efficiently understand the whole scene. This kind of visual attention mechanism has prompted
many researchers to stimulate this ability in computer vision tasks [61,6]. Salient object detection (SOD) aims at finding the
most attractive object(s) in a scene in order to simulate the functionality of the biological visual attention system [5]. In the
past decade, remarkable success of deep convolutional neural networks (CNN) has been achieved in a large number of com-
puter vision tasks. Due to their powerful generalization capability, deep CNN models have been developed and applied not
only on image-level classification tasks [70,36] but also pixel-level classification tasks[58,110].

⇑ Corresponding author.
E-mail address: [email protected] (H. Zhang).

https://doi.org/10.1016/j.ins.2020.09.003
0020-0255/Ó 2020 Elsevier Inc. All rights reserved.
Y. Ji et al. Information Sciences 546 (2021) 835–857

Concretely, fully convolutional network (FCN)-based encoder-decoder models have dramatically improved performance
on pixel-wise image-to-image learning tasks, including semantic segmentation [58,110,12], edge detection [93], SOD
[51,82,84,108], and crowd counting [109,49]. In essence, the trend of mainstream SOD methods developed in recent years
indicates that most of them work under the encoder-decoder framework. Some researchers have developed structures based
on encoder-decoder model for SOD task and achieved state-of-the-art performance [51,82,84]. Specifically, CNN-based
encoder-decoder models play an important role in continuously updating the SOD performance on benchmark datasets
[5]. Techniques, including multi-scale or multi-level structures [24], attention layers [51], etc., are also developed and intro-
duced into SOD models.
However, an important issue is whether a generic routine exists to augment performance by determining which compo-
nents are key factors under the encoder-decoder framework. To the best of our knowledge, an empirical study does not yet
exist that thoroughly evaluates the performance of this kind of generic framework on SOD task. In this work, we focus on
investigating the profound influence of the CNN-based encoder-decoder model on SOD, and providing an empirical study
on the performance by applying encoder-decoder models to SOD task. Moreover, we also provide a literature review in terms
of key components of the encoder-decoder framework on a broad range of pixel-level classification or regression tasks.
According to our experimental results, baseline models and its variants composed of an encoder with ResNet [22], and deco-
ders with pyramid parsing module (PPM) [110], atrous spatial pyramid pooling module (ASPP) [11], and feature pyramid
network (FPN) [46], have been found. The new baseline models outperform state-of-the-art deep SOD models. To further
understand these results, we performed a thoroughly ablation study on each key module and techniques, under the
encoder-decoder framework.
The main idea of this paper lies in broadening the research exploration of SOD by introducing modules and techniques in
other similar pixel-level dense prediction tasks of computer vision, such as semantic segmentation [11] and edge detection
[93]. In particular, CNN-based encoder-decoder models have been widely used in semantic segmentation. In essence, the
trend of mainstream SOD methods developed in recent years indicates that most of them work under the encoder-
decoder framework. Some researchers have developed structures based on an encoder-decoder model for SOD task and
achieved state-of-the-art performance [51,82,84]. However, no article has fully performed cross-domain module validation.
Therefore, our paper aims at quantifying the model effectiveness by introducing key techniques into an encoder-decoder
model and explore the possibility and potential network structures and learning strategies in developing a baseline SOD
model, as well as provides some insights for researchers in SOD.
In this work, the reviewed papers largely cover topics including SOD, semantic segmentation, and the encoder-decoder
model and its key techniques and sub-modules. We also noticed that several survey papers [5,83] exist that review the lit-
erature with respect to SOD methods. Specifically, according to the literature review [5,83] on CNN-based SOD models pro-
posed in recent years, CNN-based encoder-decoder models play an important role in continuously updating the SOD
performance on benchmark datasets. Other techniques, including multi-scale or multi-level structures [24], attention struc-
tures [51], etc., are also developed and introduced into SOD models. It can be observed that most of the techniques in SOD are
inspired or derived from deep CNN-based encoder-decoder models for other similar tasks. Different to the above survey
papers, in this work, we focus on solving SOD by leveraging deep CNN-based encoder-decoder models, and present a thor-
oughly empirical study on baseline encoder-decoder models and comparison of state-of-the-art deep CNN models for SOD
on seven image-based benchmark datasets. In addition, the discovered new baseline models were further evaluated on three
video-based SOD datasets in comparison to 18 state-of-the-art methods. Instead of limiting our survey to SOD methods, a
broader view is presented from the perspective of fundamental architectures of key modules and techniques in CNN-
based encoder-decoder models for pixel-level dense prediction tasks.
The remainder of this paper is organized as follows. In Section 2, we reviewed a large body of SOD models proposed in the
deep learning era. It becomes a trend to construct CNN-based encoder-decoder models in SOD task for higher performance.
The rationale behind techniques in each part of a component is highlighted, including backbone network with powerful gen-
eralization capacity, header structures for rich feature mining, attention structure for feature recalibration, multi-scale and
multi-level feature integration structure for recovering the details of a salient object segmentation mask. Therefore, Section 3
further outlines the techniques in CNN-based encoder-decoder models for image-to-image learning task which is tightly
coupled in similar tasks. Details of empirical study and experimental results on validating the efficacy of each key module
are given in Section 4. Finally, in Section 5, we conclude the paper with directions for future work.

2. Salient Object Detection (SOD)

This section outlines the taxonomy in terms of SOD from the respective of traditional methods and deep CNN-based mod-
els, respectively. In particular, relations between the state-of-the-art deep SOD models and encoder-decoder models are also
provided.

2.1. Traditional methods

Traditional methods of SOD can be categorized into two main classes: bottom-up methods and top-down methods. The
concepts of ‘‘bottom-up” and ‘‘top-down” are mainly based on the theory of experimental psychology and cognitive
836
Y. Ji et al. Information Sciences 546 (2021) 835–857

neuroscience [59]. Specifically, the bottom-up methods denote the way to define saliency by focusing on the low-level fea-
ture as exogenous factors triggering the visual attention mechanism. Top-down methods define the saliency from the per-
spective of endogenous, which means the visual attention is closely related to an individual’s experience, memory and
emotion [59].
In practice, the design of bottom-up methods largely depends on saliency priors [64,89,31,45,15,94,68], including center
surround prior, foreground prior, boundary connectivity prior, local and global contrast prior, focusness prior, and geodesic
prior. Those priors can serve as a kind of semi-supervised information to design heuristic bottom-up methods based on con-
straints. Zhu et al. [114] introduced a saliency optimization method to find the backgroundness probability of superpixels by
considering geodesic saliency. Yang et al. [94] proposed a graph-based model by using manifold ranking, in which a query
sequence is constructed by considering that the boundary nodes can be generally treated as background or non-salient.
However, bottom-up methods are heuristically designed and rely on specific saliency priors. Consequently, they may not
work well when the images are miss-matched with the corresponding priors. Thus, utilizing unsupervised methods based
on preset constraints constitutes a performance bottleneck. Top-down approaches place more effort into feature extraction
and classifier design [45] before the era of deep learning. Jiang et al. [32] proposed a discriminative regional feature integra-
tion approach to map regional feature vectors to saliency scores by using a random forest regressor. Liu et al. [55] presented a
method by modeling groups of saliency features through conditional random field (CRF) learning for detecting salient
objects, which was also applied to video saliency detection tasks by modeling extra spatial–temporal information.

2.2. Deep CNN-based models

Attributed to the success of FCN-based encoder-decoder models achieved in pixel-level densely prediction tasks
[58,110,12,93], ideas inspired by encoder-decoder models for solving saliency detection have been presented. Thus, the per-
formance of SOD has been updated continuously [51,82,84,108,50,62,81,105,106,24]. These deep CNN-based models can be
further classified into three categories: CNN-based encoder-decoder models, deep context-saliency models, and other
models.

2.2.1. CNN-based encoder-decoder models


In recent years, most of the deep CNN-based models can be divided into an encoder-decoder framework based on FCN,
and can be trained in an end-to-end fashion by using pixel-wise annotated saliency maps. More specifically, most of these
kinds of models work under an FCN-like structure, which was initially proposed to solve other image-to-image learning
tasks, including semantic segmentation [12,11,110], and edge detection [93]. Certain techniques, such as skip connection
[22,71], atrous convolution [12], deeply supervised net with holistically-nested side-way output [93], and pyramid pooling
module [110] can be implicitly or explicitly introduced to develop new deep CNN models for SOD. Concretely, Liu et al. [50]
developed a deep hierarchical saliency detection network under a multi-scale learning architecture. Hou et al. [24] proposed
a deeply supervised saliency object detection model (DSS), in which short connections were introduced to make the original
holistically-nested edge detection network (HED [93]) densely connected. Moreover, Wang et al. [84] proposed an attentive
saliency network (ASNet) for both fixation prediction and SOD. Zhang et al. [103] proposed a FCN-based bi-directional mes-
sage passing model for SOD. Liu et al. [51] proposed a pixel-wise contextual attention network, namely, PiCANet, as an atten-
tion generation module learning to selectively attend to informative context locations for each pixel by using the Bi-LSTM
module within a feature map.

2.2.2. Deep context-saliency models


Deep context-saliency models regard saliency as a local-to-global context relationship. It can be treated as a kind of
region-based saliency detection method to some extent. For example, Zhao et al. [112] proposed a deep CNN-based two-
stream model for global and local-context saliency modeling. A key operation involved in such region-based method is uti-
lized for superpixel-centered window cropping and padding. In addition, Lee et al. [37] proposed to integrate the encoded
low-level distance (ELD) map and the extracted high-level global feature under a two-stream unified network. Li and Yu
[40] developed a multi-scale deep features fusion framework, in which saliency of a region is determined by incorporating
three scales of input features, including a query region, a sub-region containing the query and the whole image. Essentially,
deep context-saliency models attempt to model saliency by classifying local patches. However, the superpixel or local-patch
generation procedure for region-based SOD methods must be performed separately or offline during the training process. It
thus becomes time-consuming to produce the final saliency map for an image because of inferring local-to-global region
samples many times.

2.2.3. Other models


Lots of researchers in recent years have explored the possibility to broaden the research exposure of SOD, such as unsu-
pervised methods [102,98], weakly supervised methods [80], multi-task learning [44]. Specifically, Zhang et al. [102] pro-
posed an unsupervised learning framework by integrating an intra-image fusion stream and an inter-image fusion stream
to generate the learning curriculum and pseudo ground-truth. Wang et al. [80] introduced a weakly supervised SOD method
by leveraging image-level tags. Li et al. [44] developed an architecture for multi-task learning under a fully convolutional
network (FCN) structure and designed a Laplacian regularized nonlinear regression scheme to refine the final saliency map.
837
Y. Ji et al. Information Sciences 546 (2021) 835–857

Table 1
Overview of deep CNN-based models for SOD.

Methods Publication Core ideas Backbone


PFA [113] CVPR2019 Pyramid feature attention network, channel- and spatial-wise attention module, context-aware pyramid VGGNet
feature extraction.
BASNet CVPR2019 Deeply supervised encoder-decoder predict module, residual refinement module, hybrid loss by integrating ResNet34
[67] BCE, SSIM and IoU
BiMCFEM CVPR2018 Bi-directional message passing between multi-scale feature maps (Multi-scale Context-aware Feature VGGNet
[103] Extraction Module (MCFEM), and gated bi-directional message passing).
DGRL [82] CVPR2018 Recurrent localization network, Inception-like contextual weighting module. Recurrent Module. Boundary ResNet50
refinement network (BRN [107]). Local saliency map refinement module.
PiCANet CVPR2018 Horizontal and vertical way biLSTM attention guided convolutional layer. VGGNet
[51]
ASNet [84] CVPR2018 Multi-task learning, fixation prediction and SOD, multi-level convLSTM, and metrics loss. VGGNet
DSS [24] CVPR2017 Short connections in decoders, multi-side way outputs, dense CRF refinement. VGGNet
SRM [81] ICCV2017 Stage-wise saliency map refinement, spatial pyramid pooling, atrous convolution. ResNet50
Amulet ICCV2017 Recursively embedding multi-level features maps. VGGNet
[105]
UCF [106] ICCV2017 R-dropout operation for uncertainty in encoder, and hybrid upsampling for smoothing. –
DHSNet CVPR2016 Recurrent convolutional layer for hierarchical saliency map refinement. VGGNet
[50]
DCL [39] CVPR2016 Multi-scale fully convolutional network, integrating multi-scale feature maps. VGGNet
MDF [38] CVPR2015 Multi-scale deep feature fusion network. Three scales of input features, including query, sub-region, and whole VGGNet
image.
MC [112] CVPR2015 Two-stream of CNN models for global and local feature learning based on superpixels. Multi-context learning GoogLeNet
framework.
ELD [37] CVPR2016 Two-stream feature encoding network, encoded low-level distance maps, global and local feature VGGNet
concatenation.
WSS [80] CVPR2017 Foregroundness activation map. Weakly supervised saliency detection. VGGNet
DNM [102] CVPR2018 Multiple noisy labeling, noise modeling, ‘‘latent” saliency prediction module, unsupervised method. ResNet101
DS [44] TIP2016 Multi-task learning, Laplacian regularized nonlinear regression, FCN-based model. VGGNet
UFNet [25] CoRR2018 Unified encoder-decoder framework for salient object segmentation, edge detection and skeleton extraction, VGGNet
deeply supervised framework, multi-level feature fusion, multi-task.

CNN-based state-of-the-art SOD models are shown in Table 1. Core ideas and network architectures are also listed. Table 1
reveals that techniques, including atrous convolution, spatial pyramid pooling, multi-scale and multi-level feature fusion,
and CRF refinement [12,24], are closely coupled with encoder-decoder models proposed in other pixel-level prediction tasks.
For clarity, Fig. 1 presents key techniques and structures in SOD and encoder-decoder models, and illustrates the relations
between SOD and encoder-decoder models, as well as connections among sub-level topics.

Fig. 1. Taxonomy of surveys with respect to key techniques and structures in SOD and encoder-decoder models.

838
Y. Ji et al. Information Sciences 546 (2021) 835–857

3. CNN-based encoder-decoder model

In this section, we briefly review state-of-the-art encoder-decoder models for pixel-level dense prediction tasks
[58,12,93,109,49,11]. Key components and techniques will be introduced in the following subsections.

3.1. Background

Initially, FCN [58] can be treated as a prototype of the encoder-decoder framework by using CNN for semantic segmen-
tation. In FCN, the fully connected layer for image-level classification is removed and replaced by multiple layers of trans-
posed convolution operation or bilinear up-sampling operations. FCN-based or FCN-like models have been proposed in
recent years. Consequently, semantic segmentation performance has been continuously improved. Most of the CNN-based
models can be reduced to the encoder-decoder model. Models, e.g., SegNet [1,34], UNet [71], and DeepLab series
[8,12,9,11,48], have been recently advanced. Specifically, Badrinarayanan et al. [1] proposed an encoder-decoder model
for semantic segmentation, which is based on the FCN structure and a symmetrical decoder. In [34], this network structure
is expanded to model segmentation uncertainty. The atrous convolution operation from DeepLab [12,11] and early down-
sampling have been proven to be effective in this framework. Moreover, Ronneberger et al. [71] designed a symmetric net-
work into a U-shape by recursively compensating information between encoder and decoder layers with the same scales.
However, CNN-based encoder-decoder models are not only developed in semantic segmentation, but have also been
adopted in many other pixel-level prediction fields, such as edge detection [93], crowd counting [109,49], fixation prediction
[52,17], and SOD [51,84,105,106,24,29]. Actually, these aforementioned image-to-image learning tasks are similar and clo-
sely related. For example, SOD and semantic segmentation can be categorized into the same kind of vision task, i.e., pixel-
wise dense classification, from the perspective of output forms. This makes the relationship between SOD and semantic seg-
mentation quite close. Generally, a semantic segmentation model aims at accurately and efficiently classifying each pixel
into a category to facilitate applications, such as autonomous driving, indoor navigation, and virtual or augmented reality
systems to name a few [35]. However, from the perspective of task objectives, SOD tends to predict the most attractive object
in a scene by simulating the visual attention mechanism. In addition, edge detection is also a pixel-level binary classification
task, but it has to address extremely unbalanced positive and negative samples. Moreover, crowd counting and fixation pre-
diction can be reduced to pixel-level regression tasks.
In the following subsections, key techniques and components involved in each important part of encoder-decoder models
will be briefly discussed, including encoder backbone networks, ‘‘header” structures and their variants, multi-scale and
multi-level feature fusion, attention structures, and loss functions.

3.2. Encoder backbone network for feature extraction

3.2.1. Network structures


Many successful networks, including VGGNet [74], ResNet [22] and Inception [77,78,76], are gradually improving gener-
alization ability with respect to multi-class object recognition tasks. The influence of the encoder backbone network on final
prediction accuracy is significant and fundamental by stacking convolutional layers. Many researches in object detection,
semantic segmentation, etc., utilize off-the-shelf networks as their backbone networks for feature extraction. In practice,
generalization ability of backbone networks in the encoder part can be transferred by loading parameters that are pre-
trained on other large-scale image classification datasets, such as the ImageNet dataset [19].
Researches on networks for basic vision tasks are still making progress and influencing the development of other high-
level vision tasks. Many insightful ideas with respect to the development of the network building blocks have significantly
inspired other researches to rethink the ability of the convolution and pooling operations, which augments the boundary of
CNN models.

3.2.2. Convolutional blocks


Ideas on building basic blocks (unit structure), including the ResNet block (basic/bottleneck structure), Inception block,
squeeze-and-excitation (SE) block, and separable convolutional layer, have been reported. The core idea of the ResNet block
is the residual connection, which makes it possible to construct an extremely deep network [22]. Intuitively, the information
of previous layers with the same scale will be reused and directly compensate the current layer of feature maps. In practice,
regular convolutional operation couples the filter learning process in a 3D space. However, the Inception block attempts to
decouple cross-channel correlations and spatial correlations with respect to two spatial dimensions (i.e., width and height)
and a channel dimension [77]. Szegedy et al. [76] introduced a hybrid structure by combining Inception and ResNet blocks,
which is termed the ‘Inception-ResNet’ block. In [76], a residual connection was introduced into the inception structure.
Depth-wise separable convolution is an extreme form of Inception structure introduced in Xception [16]. It is based on a
stronger hypothesis that cross-channel correlations and spatial correlations can be mapped and decoupled completely.
839
Y. Ji et al. Information Sciences 546 (2021) 835–857

3.2.3. Convolutional operations


Ideas on building basic convolutional operations, such as atrous convolution [12] and deformable convolution [18], pro-
vide insights into the fundamental operation of CNNs. The motivation behind these ideas lies in constructing more generic
operations with respect to variants of receptive field and kernel size to improve the model’s generalization power [18].
Among these works, atrous convolution is proposed and applied in DeepLab series models for semantic segmentation
[12,11]. In practice, the atrous convolution is a special form of feature selection on feature maps with fixed sampling posi-
tions within a relatively large receptive field. Thus, in a later study, Dai et al. [18] proposed a deformable convolutional oper-
ation, which is a more generic form of convolution by using the following two operation steps: (1) sampling in an irregular
grid; and (2) performing a convolutional operation on the sampling positions. With stacking such deformable convolution
layers, the model can perceive more context information with respect to a relatively large receptive field within an irregular
grid.
Most of the above-mentioned literatures have proposed the basic convolutional operation by considering context infor-
mation in CNN. Fig. 2 summarizes some variants and a generic form of context-based operations. Concretely, Jaderberg et al.
[28] proposed a spatial transformer network to simulate affine transformation in image processing. Jia et al. [30] presented a
dynamic filtering network for image filtering and motion simulation in video frame prediction. Fig. 2(d) illustrates a generic
form of such context-related operation, where uðÞ and wðÞ denote the transformation functions with respect to different
motivations and tasks. It shows that we can design specific CNN-based structures for simulating a series of traditional image
processing techniques in low-level vision tasks, such as image filtering, image transformation, image de-blur, and noise
removal.

3.3. Attention mechanism

An attention mechanism was initially proposed for neural machine translation (NMT) in natural language processing
(NLP) [2]. The attention module proposed in Ref. [2] aims at learning the alignment between the source language sentence
and the current word in the target language in a machine translation model. Intuitively, different parts of the source sentence
make different contributions to translating a current word in a target language. Such weights for each word in the source
sentence can be learned by using an attention model.
Attention mechanisms have largely influenced the development of sequence modeling tasks in computer vision, such as
image captioning, visual question answering (VQA), and video classification [10]. The motivation behind attention structures
in CNNs lies in that the diversity of feature maps can be improved by introducing such implicit attention structures into con-
volutional layers. Fig. 3 presents the generic form of attention structures. In practice, the difference of such attention struc-
tures in the literature is the specific implementation of an attention generation module. Among these structures, Fig. 3(a)
and (b) show the channel-wise and spatial-wise attention modules, respectively. Fig. 3(d)–(f) illustrate three different kinds
of combinations of spatial and channel-wise attention layers. Fig. 3(e) presents a generic form of attention module with a
residual connection. Concretely, Hu et al. [26] proposed a SE network, in which a self-gated channel-wise attention structure
is introduced for capturing channel dependence. Moreover, Roy et al. [72] proposed a concurrent structure of a spatial- and
channel-wise attention module organized into a parallel form based on the SE-block, which is termed the scSE-block. Sim-
ilarly, Park et al. [63] also presented another form of combination of spatial and channel-wise attention module, and applied
it to the bottleneck block.
From the perspective of the source and target with respect to an attention module, we can roughly classify attention mod-
ules into two main categories: (1) self-attention; and (2) guided attention modules. Generally, the self-attention module can
be designed and applied in the encoder part, as an implicit attention structure, for robust feature learning. However, for the
guided attention module, attention is generated from a source feature map and applied to a target for feature recalibration
and integration.
Actually, the concept of self-attention was originally proposed in NLP tasks. In practice, instead of learning to perform
inter-alignment between encoder and decoder, self-attention, which is also called intra-attention, is a special form of atten-
tion mechanism for intra-relation reasoning between tokens within both encoder and decoder parts [79]. It has been widely
used in a variety of tasks in NLP, e.g., reading comprehension, abstractive summarization, textual entailment, and learning
task-independent sentence representations [79,14]. Specifically, Vaswani et al. [79] proposed to integrate a generic form of
self-attention function into a simple but efficient network architecture, namely, the transformer, for language sequence

Fig. 2. Variants of context-based convolutional operations.

840
Y. Ji et al. Information Sciences 546 (2021) 835–857

Fig. 3. Channel and spatial-wise attention layer and its variants.

modeling. By considering that the self-attention module is essentially a mapping function from a query and a set of key-
value pairs to an output, Wang et al. [87] argued that self-attention module is a special case of non-local operations in
the embedded Gaussian version. It thus relates self-attention module to the classic non-local means, and extends the
sequential self-attention to a generic space/spacetime non-local network for both image and video recognition.
Attention modules generally play a key role to perform feature recalibration and feature enhancement [29]. Moreover, the
attention mechanism has been widely used in each part of encoder-decoder models for dense prediction tasks, and achieved
performance improvement [26,42,91,99]. However, the attention module usually requires to introduce more delicate and
sophisticated feature operations. The introduced attention structure needs more calculation operation on input tensors. Net-
work structures for generating attention need to be designed carefully in order to avoid introducing more parameters. On the
other hand, the attention module may increase the number of layers of the proposed model alongside the trunk branch of a
deep CNN model. A residual connection is often applied to the attention module for preserving the information flow in a
deeper neural network (see Fig. 3(c)).

3.4. Header structure for rich feature mining

In practice, ‘‘header” can be regarded as a connection structure which aims at producing rich feature maps and mining
context-aware information based on the extracted feature maps from encoder part.
In general, a ‘‘header” component can be located at the end of encoder before decoder parts or dense pixel-wise classifiers.
Multi-scale or multi-level pyramid feature generation module [110,11] can be treated as a representative ‘‘header” structure.
Among them, PPM [110] is developed for rich high-level concept abstraction by using multi-scale pooling operations. The
feature maps in the end of the encoder are pooled by using several different sizes of pooling kernels to obtain different
sub-region representations. Similarly, the ASPP module [11] aims at obtaining rich feature representations with respect to
different receptive fields by leveraging multiple atrous convolution operations with different dilation rates. The above-
mentioned two kinds of efficient ‘‘header” components were introduced for obtaining rich feature representations, but they
were developed from different perspectives based on pooling and convolutional operations, respectively. The core idea aims
at harvesting multi-level feature representation by gradually increasing the receptive field under a feature pyramid
framework.
841
Y. Ji et al. Information Sciences 546 (2021) 835–857

Moreover, modules with feature reuse with dense connections, multi-level pooling and atrous convolution operations
incorporated with attention structures were also developed. several variants of PPM and ASPP structures have been proposed
in recent years. For example, Yang et al. [95] designed an ASPP module with dense connections, which is largely inspired by
the idea of rich feature mining in DenseNet [27]. These models bring the idea of feature enhancement by leveraging attention
mechanisms and dense connections into multi-scale feature map pooling and atrous convolution. High-level feature maps
with respect to accurate semantic meaning and context information can be learned for improving segmentation
performance.

3.5. Decoder

This section summarizes the various structures and key components applied in decoder part. The most commonly used
models aim at multi-level and multi-scale feature learning and integration. Rich hierarchical representations with different
levels of visual perception can be utilized for parsing multi-level visual concepts and recovering the local details[93,24,92]. In
particular, skip connection, multi-scale and multi-level structures are commonly considered for multi-scale response fusion
under a feature pyramid hierarchy.

3.5.1. Skip connection


Intuitively, skip connection can be regarded as a kind of information compensation strategy in CNNs when constructing
deeper CNN models. For clarity, we categorize the commonly used form of network structure for information compensation
into four classes, including residual block, concatenation, gated network module, and attention module with residual con-
nection (see Fig. 4).
The concept of information compensation was proposed in long-short term memory (LSTM) units for preserving informa-
tion flow with respect to long-range dependency in sequence modeling [23]. In ResNet [22], it has been demonstrated that
the loss of information flow can cause a severe degradation problem of training accuracy when constructing deeper net-
works. Skip connection was initially proposed in ResNet, which makes it possible to construct an extremely deep network.
Motivated by ResNet, Huang et al. [27] proposed a densely connected convolutional network (DenseNet) for rich feature
mining by densely connecting and concatenating multi-level feature maps. By considering the skip connection within ResNet
and DenseNet, the structure with skip connections can be reduced to a generic framework under high order RNN [13].
According to this, Chen et al. [13] presented a dual path network (DPN) to integrate both the power of rich feature mining
and feature reuseinto a unified framework.
Summation and concatenation, as well as variants with attention mechanism, are common forms of skip connections,
which have been widely applied in CNN models for information compensation and feature integration. In practice, it is usu-
ally applied to connect feature maps within a convolutional layer. Moreover, it can also be utilized to connect feature maps
which come from multi-level layers between the encoder and decoder part for multi-scale or multi-level feature fusion.
Additional discussions are presented in Section 3.5.2.

3.5.2. Multi-scale and multi-level structure


The downscaling of spatial resolution with respect to feature maps in each level of convolutional layers can be regarded as
a process of information abstraction, because along with feature learning for high-level abstract concepts, irrelevant infor-
mation, including low-level features, context information, local details, etc., will be abandoned to reduce the influence on the
training of a image-level classifier. However, the loss of spatial resolution can largely influence per-pixel prediction tasks for
producing accurate dense prediction maps [92]. Furthermore, both high and low-level features from multi-scale or multi-
level feature maps are essential for pixel-level predicition tasks.
In practice, multi-scale or multi-level structures, including multi-stream, skip-net, single model with multiple inputs,
separated network integration, and holistically-nested architectures, have been developed and widely applied in most
pixel-level image-to-image learning tasks [93]. For clarity, we summarize eight well-known multi-scale and multi-level
structures applied in the decoder part in Fig. 5. Specifically, Fig. 5(a) and (b) are the initial decoder architectures of FCN mod-
els [58]. Fig. 5(c) shows the UNet architecture with multi-level skip connections under a symmetric form of encoder-decoder
model [71]. Fig. 5(f) and (g) illustrate holistically-nested architecture and its variant under a deeply supervised framework
with multi-scale side-way outputs [93,24]. Fig. 5(h) and (i) present the FPN-based decoder part [46,92] and its variants with
attention modules [108,42].

Fig. 4. Skip connection and its variants.

842
Y. Ji et al. Information Sciences 546 (2021) 835–857

Fig. 5. Multi-level structures for feature integration.

Initially, in FCN, the decoder part is a pioneer in providing a straightforward method for utilizing multi-level feature maps
[58]. In [46], the FPN structure was introduced as a pyramid-based feature integration scheme for multi-scale RoI pooling
and feature extraction. In FPN, multi-level feature maps are integrated with high- and low-level features by utilizing up-
sampling and pixel-wise add-up operations. Moreover, FPN and its variants, with sophisticated refinement modules and
attention mechanisms, have been proposed and widely used in a number of networks for object detection, semantic segmen-
tation, etc. Concretely, Xiao et al. [46] presented a unified encoder-decoder model for multi-task learning by exploiting the
different levels of features for different tasks. In addition, Liu et al. [54] proposed a progressive dual path version of FPN for
accurate instance segmentation. A dual path with top-down and bottom-up directions of low- and high-level feature inte-
gration structure is introduced to shorten information flow within a deep CNN model. In addition, variants of FPN-based
decoders with attention modules [51,84,103,42] have also been proposed for SOD in recent years.

3.6. Loss function

For different tasks, various loss functions should be considered according to different objectives. For pixel-level dense pre-
diction tasks, for instance, most of the deep CNN-based segmentation methods rely on logistic regression, optimizing cross-
entropy loss between prediction and ground-truth. On the other hand, for pixel-wise regression tasks, l2 loss is often applied
to fixation prediction [17] and crowd counting tasks [109,49]. For SOD, which can be treated as a binary classification task
[24], several types of loss functions, including binary cross entropy (BCE) loss, dice loss, and metric-based loss, have been
applied.
BCE loss, a special form of cross entropy loss function, has been widely used in binary classification tasks (including SOD
[24], edge detection [93], medical image segmentation [71], etc.) in the form of:
X
LBCE ¼  ½Gj log PðGj ¼ 1jX; WÞ þ ð1  Gj Þ log PðGj ¼ 0jX; WÞ; ð1Þ
j2G

where G represents the ground-truth label list; X is the input training image; j indicates the index of pixel location; and P
represents the probability of activation of the j-th pixel in the output map.
However, the measure of cross-entropy loss is often a poor indicator of the quality of segmentation [3]. It indicates that
the loss function aims at penalizing the error of classification at each pixel, instead of directly optimizing evaluation metrics,
e.g., the Jaccard index (which is also called the intersection-over-union (IoU) score) between produced segmentation masks
and ground-truth. Thus, several metric-based loss functions have been proposed in recent literature, including dice loss for
binary classification in medical image segmentation [71], Lovász-Softmax loss for semantic segmentation and binary seg-
mentation [3], precision, recall, and F-measure and mean absolute error (MAE)-based loss function for SOD [84,17,111].
Concretely, the dice loss function is proposed in V-net [60] for volumetric medical image segmentation. It is based on the
dice coefficient for binary classification. The dice coefficient between a binary mask and a ground-truth mask can be defined
as:
843
Y. Ji et al. Information Sciences 546 (2021) 835–857

P
2 N pi g i
D ¼ PN i P N 2
; ð2Þ
i pi þ
2
i gi

where pi and g i denote the prediction and ground-truth label of the i-th pixel, respectively. For binary classification, dice
coefficient is equivalent to F1 score, which ranges from 0 to 1. It describes the similarity between the predicted mask and
ground-truth mask. Thus, the dice loss function can be written as Ldice ¼ 1  D.
Moreover, Lovász-Softmax loss [3] is proposed for direct optimizing the mean IoU (mIoU) for both foreground-
background segmentation and multi-class semantic segmentation.In practice, the Jaccard index of class c, can be calculated
by:

jfP ¼ cg \ fG ¼ cgj
J c ðP; GÞ ¼ ; ð3Þ
jfP ¼ cg [ fG ¼ cgj

where P and G represent the predicted and ground-truth label list, respectively. Thus, a corresponding loss function can be
defined as Lc ðP; GÞ ¼ 1  J c ðP; GÞ. However, computing the convex closure of set functions is NP-hard. Accordingly, Berman
et al. [3] constructed two kinds of piecewise linear convex surrogate losses based on the Lovász extension of submodular
set functions, i.e., Lovász hinge and Lovász-Softmax loss, in order to train deep CNN models for binary segmentation and
multi-class segmentation, respectively.
In addition, a metrics-based loss function for saliency detection was also proposed for fixation prediction [17] and SOD
[84,111], respectively. Specifically, in Ref. [84], a joint form of weighted cross entropy and metrics-based loss function is
applied to SOD. The corresponding metrics include precision, recall, F-measure, and MAE. Similarly, Cornia et al. [17] pro-
posed to apply a metric-based loss function to fixation prediction, including Normalized Scan path Saliency (NSS), the Linear
Correlation Coefficient (CC), and the Kullback–Leibler Divergence (KL-Div). A network structure, which consists of a ResNet
backbone with dilation convolution and attentive ConvLSTM in the decoder part, was trained by leveraging learned fixation
maps. Zhao et al. [111] proposed a relaxed F-measure loss function to overcome the in-differentiability of the standard F-
measure.
Other forms of loss functions, e.g., auxiliary loss, can be applied to shallow layers to optimize and stabilize the training
process [110]. Another form of auxiliary loss is multiple side-way output loss under a deeply supervised framework [93,24].
Moreover, in Ref. [33], an auxiliary affinity loss based on adaptive affinity field is proposed, which aims at learning pixel-wise
discriminative feature representations by leveraging adversarial training [33]. The idea of adversarial training is avoided to
trivialize the affinity loss of pixel pairs in an affinity field. It is quite close to the motivation of hard example mining for train-
ing models with powerful generalization capability.
This kind of auxiliary loss function can assist to train CNN models, such that more stability and higher performance in
pixel-wise prediction tasks can be achieved. In practice, these sub-modules for calculating auxiliary loss are usually removed
at the testing stage. Moreover, additional supervision information can be generated by utilizing the pixel-wisely annotated
ground-truth, e.g., edge [97], affinity between neighbor pixels, etc. The resulting supervision information can also be used as
‘ground-truth’ to calculate auxiliary loss for training CNN models in the tasks of semantic segmentation and instance seg-
mentation [53].

4. Empirical study

In this section, we evaluate a series of encoder-decoder models and their variants for SOD task. Seven widely-used SOD
benchmark datasets are utilized in our empirical analysis. To evaluate the performance, three most widely-used evaluation
metrics for SOD including F-measures, MAE, and area under curve (AUC) are applied. Experimental results on quantifying the
effectiveness of different components and learning strategies, including encoder backbone networks, header structures,
attention modules, loss functions and batch sizes are presented. In addition, all of the experiments are performed by follow-
ing the cross-dataset evaluation strategy, i.e., training models on one dataset and evaluating the performance on the other
datasets in the following subsections. Due to the space limit, we summarized commonly-utilized benchmark datasets, eval-
uation metrics, implementation details, hyper-parameter settings and analysis on batch sizes of baseline architectures in the
Appendices A–C.

4.1. Comparison of encoder-decoder frameworks

In this section, we explored the baseline encoder-decoder models with ResNet backbone, which is named as ‘‘BENDer” in
the experimental results, by utilizing different combinations of commonly used sub-modules in the encoder and decoder
parts, including atrous convolution, PPM, ASPP, and FPN. We thus harvested a series of model variants by changing the enco-
der, header, and decoder parts. In addition, we also investigated the contributions of setting different loss functions, batch
sizes, and attention structures. Superior performance has been achieved by following such a flow of model analysis. Due to
the space limit, analysis on batch sizes of baseline architectures is summarized in the Appendix C.
844
Y. Ji et al. Information Sciences 546 (2021) 835–857

4.1.1. Analysis of ‘‘BENDer” models and its variants


For a complete analysis, we evaluated encoder parts with respect to ResNet50 and ResNet101 with/without fixing the fea-
ture map by using atrous convolution at scale (1/8), and decoder parts with respect to PPM+bilinear up-sampling, one layer
of convolutional operation with kernel size 1-by-1(conv1x1) and bilinear up-sampling, PPM+FPN (Upernet), FPN, and ASPP
+FPN (see Table 2). Concretely, ResNet50 with the setting ‘atrous8’ denotes that the encoder layers of the original ResNet50
are modified by using at atrous convolution from the 1/8 scale of fesature maps. Resolutions of feature maps will not be fur-
ther down-sampled from 1/8 scale to the final output layer. The decoder part with the ‘deepsup’ setting indicates that an
auxiliary loss is employed before the last convolution layer during the training process (see the Appendix B). In our imple-
mentation, we only evaluated the auxiliary loss on models with simple decoder parts, i.e., upsampling operation and 1x1
convolutional layer.
From the experimental results, we can observe that superior performance was achieved by using ResNet101 as an enco-
der backbone network and the decoder part with the ASPP+FPN module and PPM+FPN module, respectively (see Table 3). A
slight performance gap is found between ASPP+FPN and PPM+FPN (with 0.002 on MAE and 0.4%–0.6% on max F-measure).
The ablation study of the ResNet101+FPN model demonstrates that the performance achieved by the ResNet101+PPM+FPN
model is largely contributed by the FPN module. On the other hand, simple models with encoder part ResNet101(atrous8)
and decoder part PPM(+bilinear up-sampling) can also achieve competitive results. It only shows a 0.5%–1.0% performance
gap with respect to max F-measure and 0.020.05 on MAE.

4.1.2. Analysis of loss functions


To investigate the performance by using different loss functions, we performed experiments on different choices of loss
functions, including Lovász-softmax loss [3], dice loss [60], BCE loss [24], F-measure (FM) loss [111], focal loss [47] and
online hard example mining with cross entropy (OHEM-CE) loss [73,96]. Specifically, Upernet50 (ResNet50 with PPM and
FPN) was selected as baseline model in this section of experiments. We trained models by using the above-mentioned loss
functions. For implementation details, the training strategy for evaluating the performance of models with respect to differ-
ent loss functions is based on the settings of evaluating the performance against different batch size in Appendix C. For a fair
comparison, all the models were trained by setting the same hyper-parameters, e.g., setting the number of batch size to 16,
the number of training epochs to 20, and the number of iterations in each epoch to 2,500. The DUT-TR [80] dataset was
adopted as training set. Other datasets were utilized as testing sets. It is worth noting that, for different parts of the evalu-
ations in terms of the backbone network and loss functions, we chose different backbone networks respectively. For exam-
ple, in Section 4.1.1, we mainly selected ResNet50 and ResNet101 to illustrate the contribution of performance improvement
concerning different backbones in the encoder part. Moreover, we also chose ResNet with atrous convolutions to evaluate
the performance gain. Experimental results show that the Upernet can achieve better performance in comparison with other
network structures. Therefore, in Table 4, to evaluate the performance for different loss functions, we select Upernet50, i.e.,
ResNet50 with PPM and FPN structure as the baseline model in this part of experiment by both considering previous exper-
imental results and the limitation of memory size of our GPUs.
Table 4 lists the quantitative results of baseline models trained by using different loss functions. Fig. 6 illustrates the per-
formance of each model across all of examined datasets in terms of four evaluation metrics. It shows that the models trained
by using Lovász-softmax and dice loss can achieve the best performance consistently. According to our observation, the mod-
els trained by utilizing Lovász-softamx loss and dice loss, which aim at optimizing the Jaccard Index and F1 score, respec-
tively, can produce saliency maps which are quite similar to the ground-truth, that is, the saliency values predicted by
models are largely distributed around 0 or 1. Therefore, the predicted saliency maps can show more stable performance
when varying the binary threshold from 0 to 255.
On the other hand, OHEM-CE loss can produce higher performance in terms of mean and max F-measure on DUT-OMRON
and MSRA10K datasets, but relatively lower performance on the other datasets in comparison with BCE loss. However, base-
line model with F-measure loss produced relatively lower performance on max F-measure and MAE. Interestingly, the model

Table 2
Structures of ‘BENDer’ models for SOD.

ID Encoder Header Decoder Loss function


BENDer#1 ResNet50(atrous8) PPM upsample BCE
BENDer#2 ResNet50(atrous8) PPM upsample BCE+deepsup
BENDer#3 ResNet50(atrous8) – conv1x1+upsample BCE
BENDer#4 ResNet50(atrous8) – conv1x1+upsample BCE+deepsup
BENDer#5 ResNet101(atrous8) PPM upsample BCE+deepsup
BENDer#6 ResNet50 PPM FPN BCE
BENDer#7 ResNet50 ASPP FPN BCE
BENDer#8 ResNet101 ASPP FPN BCE
BENDer#9 ResNet101 PPM FPN BCE
BENDer#10 ResNet101 – FPN BCE
BENDer#11 ResNet50 PPM FPN Lovász loss
BENDer#12 ResNet101 PPM FPN Lovász loss

845
Y. Ji et al.
Table 3
Quantitative results for evaluating ‘BENDer’ models. (The upward arrow means higher values represent better results, while downward arrow denotes lower values indicate better performance.)

BENDer#1 BENDer#2 BENDer#3 BENDer#4 BENDer#5 BENDer#6 BENDer#7 BENDer#8 BENDer#9 BENDer#10 BENDer#11 BENDer#12
ECSSD AUC" 0.9880 0.9854 0.9834 0.9821 0.9855 0.9887 0.9854 0.9887 0.9878 0.9865 0.9670 0.9655
MeanF" 0.9068 0.9144 0.9036 0.9025 0.9202 0.9178 0.9165 0.9259 0.9198 0.9188 0.9302 0.9338
MaxF" 0.9355 0.9386 0.9275 0.9266 0.9437 0.9435 0.9413 0.9502 0.9441 0.9433 0.9394 0.9418
MAE# 0.0541 0.0509 0.0557 0.0583 0.0493 0.0467 0.0502 0.0456 0.0472 0.0485 0.0435 0.0422
HKU-IS AUC" 0.9857 0.9846 0.9810 0.9805 0.9799 0.9876 0.9876 0.9848 0.9879 0.9811 0.9672 0.9662
MeanF" 0.8882 0.8983 0.8877 0.8878 0.9015 0.9019 0.9020 0.9063 0.9054 0.9008 0.9217 0.9283
MaxF" 0.9250 0.9281 0.9169 0.9177 0.9311 0.9331 0.9330 0.9360 0.9366 0.9324 0.9320 0.9370
MAE# 0.0468 0.0425 0.0453 0.0461 0.0424 0.0403 0.0402 0.0401 0.0389 0.0419 0.0326 0.0309
PASCAL-S AUC" 0.9675 0.9662 0.9606 0.9599 0.9599 0.9686 0.9679 0.9654 0.9678 0.9613 0.9420 0.9413
MeanF" 0.8383 0.8446 0.8312 0.8334 0.8334 0.8440 0.8450 0.8524 0.8454 0.8506 0.8517 0.8559
MaxF" 0.8642 0.8686 0.8550 0.8580 0.8580 0.8694 0.8712 0.8756 0.8713 0.8754 0.8645 0.8671
MAE# 0.0787 0.0744 0.0798 0.0801 0.0801 0.0754 0.0741 0.0724 0.0742 0.0734 0.0714 0.0688
DUT-O AUC" 0.9492 0.9386 0.9235 0.9206 0.9211 0.9532 0.9528 0.9421 0.9561 0.9236 0.9240 0.9274
MRON
846

MeanF" 0.7508 0.7616 0.7246 0.7274 0.7700 0.7736 0.7722 0.7859 0.7836 0.7689 0.7895 0.8120
MaxF" 0.7908 0.7937 0.7573 0.7627 0.8037 0.8092 0.8072 0.8194 0.8189 0.8065 0.8073 0.8257
MAE# 0.0673 0.0644 0.0746 0.0755 0.0613 0.0600 0.0603 0.0571 0.0582 0.0607 0.0559 0.0525
DUTS-TR AUC" 0.9985 0.9989 0.9990 0.9989 0.9989 0.9989 0.9991 0.9991 0.9991 0.9990 0.9897 0.9883
(Training)
MeanF" 0.9584 0.9661 0.9648 0.9646 0.9665 0.9657 0.9666 0.9678 0.9658 0.9662 0.9755 0.9745
MaxF" 0.9784 0.9817 0.9816 0.9814 0.9825 0.9831 0.9840 0.9843 0.9830 0.9836 0.9807 0.9795
MAE# 0.0252 0.0210 0.0215 0.0219 0.0207 0.0199 0.0196 0.0194 0.0198 0.0200 0.0131 0.0138
DUTS-TE AUC" 0.9773 0.9749 0.9674 0.9666 0.9654 0.9814 0.9808 0.9767 0.9815 0.9683 0.9550 0.9511
MeanF" 0.8217 0.8358 0.8124 0.8154 0.8438 0.8411 0.8411 0.8575 0.8492 0.8445 0.8578 0.8734
MaxF" 0.8629 0.8698 0.8478 0.8510 0.8803 0.8803 0.8805 0.8951 0.8896 0.8843 0.8775 0.8882
MAE# 0.0529 0.0480 0.0540 0.0535 0.0469 0.0459 0.0459 0.0435 0.0448 0.0461 0.0415 0.0395

Information Sciences 546 (2021) 835–857


MSRA10K AUC" 0.9769 0.9726 0.9702 0.9678 0.9710 0.9785 0.9786 0.9768 0.9806 0.9734 0.9618 0.9629
MeanF" 0.8848 0.8855 0.8782 0.8741 0.8973 0.8967 0.8967 0.9027 0.9034 0.8962 0.9104 0.9145
MaxF" 0.9112 0.9075 0.9006 0.8964 0.9174 0.9190 0.9188 0.9235 0.9258 0.9184 0.9192 0.9234
MAE# 0.0628 0.0631 0.0657 0.0694 0.0576 0.0559 0.0557 0.0539 0.0521 0.0567 0.0490 0.0465

The significance of bold values represents the best results.


Y. Ji et al. Information Sciences 546 (2021) 835–857

Table 4
Quantitative results for evaluating baseline models trained by using different loss functions. (The upward arrow means higher values represent better results,
while downward arrow denotes lower values indicate better performance.)

Lovász loss Dice loss BCE loss FM loss Focal loss OHEM-CE
ECSSD AUC" 0.9749 0.9751 0.9867 0.9242 0.9917 0.976
MeanF" 0.9364 0.9372 0.9233 0.9078 0.8305 0.9276
MaxF" 0.9456 0.9461 0.9441 0.9209 0.9362 0.9406
MAE# 0.0379 0.0379 0.0449 0.0614 0.0844 0.0443
HKU-IS AUC" 0.9698 0.9703 0.9863 0.9037 0.9916 0.9728
MeanF" 0.9272 0.927 0.9082 0.8886 0.8016 0.9137
MaxF" 0.9363 0.9365 0.9341 0.9065 0.9269 0.9271
MAE# 0.0305 0.0307 0.0378 0.0554 0.0752 0.0365
PASCAL-S AUC" 0.9532 0.9546 0.9678 0.9043 0.9766 0.9475
MeanF" 0.8639 0.8631 0.853 0.8433 0.7713 0.8438
MaxF" 0.876 0.8755 0.8756 0.8553 0.8688 0.8584
MAE# 0.0643 0.066 0.07 0.0852 0.104 0.0774
DUT-O AUC" 0.9243 0.9275 0.9459 0.8718 0.9615 0.9326
MRON
MeanF" 0.7925 0.7942 0.7724 0.7671 0.6638 0.7849
MaxF" 0.8065 0.8117 0.8 0.7785 0.7849 0.8018
MAE# 0.0552 0.0565 0.06 0.0652 0.0958 0.0572
DUTS-TR AUC" 0.9916 0.9917 0.9991 0.9386 0.9991 0.9937
(Training)
MeanF" 0.9809 0.9801 0.9684 0.9483 0.8994 0.971
MaxF" 0.9853 0.9847 0.9847 0.9634 0.9788 0.9784
MAE# 0.0107 0.0114 0.0188 0.0426 0.0551 0.0159
DUTS-TE AUC" 0.9595 0.9609 0.9796 0.8864 0.9857 0.9592
MeanF" 0.8751 0.8741 0.8503 0.8293 0.7313 0.8472
MaxF" 0.8926 0.8925 0.8826 0.8407 0.866 0.8667
MAE# 0.0364 0.0375 0.0427 0.0549 0.0769 0.0454
MSRA10K AUC" 0.961 0.9638 0.9753 0.9124 0.9786 0.9723
MeanF" 0.9075 0.9097 0.8949 0.8881 0.8073 0.9116
MaxF" 0.9165 0.9195 0.9139 0.8984 0.9033 0.9223
MAE# 0.0503 0.0488 0.0565 0.0677 0.0952 0.0483

The significance of bold values represents the best results.

Fig. 6. Overall performance of baseline models trained by utilizing different loss functions.

trained by using focal loss achieved unstable performance in terms of mean and max F-measure but relatively high perfor-
mance in terms of AUC score over all testing sets. For clarity, we also provide comparative PR curves in Fig. 7. According to
our observation, the precision and recall points may distribute within a small range when the models were trained by uti-
lizing metric-based loss functions. It indicates that the produced saliency maps are not sensitive to binary thresholds when
calculating saliency mask. As a result, AUC scores may be lower than other models. In contrast, saliency maps which are sen-
sitive to binary thresholds can produce relatively smoothed PR curves. In this case, higher AUC scores can be achieved. Visual
comparison of saliency maps produced by baseline models in terms of using different loss functions can be found in the
Appendix D.

4.1.3. Analysis of attention modules


By considering both the rationality and feasibility of incorporating an attention mechanism in a baseline model, two kinds
of representative attention structures were involved in our experiment. Concretely, SE attention [26] and self-attention
847
Y. Ji et al. Information Sciences 546 (2021) 835–857

Fig. 7. PR curves of baseline models by using different loss functions.

structure [87,100] were incorporated into different parts of encoder-decoder models. Fig. 8 illustrates eight kinds of atten-
tion blocks by applying attention structures into different parts of a given encoder-decoder model. Among them, Fig. 8(a)
illustrates the SE-ResNet block which was exploited in a backbone network, namely SE-ResNet50. Fig. 8(b) and (c) show
the PPM and FPN with SE attention structures, respectively. By considering the computational complexity, we only exploited
self-attention module on high-level feature maps produced by a PPM for modeling pixel-wise relation (see Fig. 8(d)). Fig. 8
(e)–(h) show four kinds of different feature fusion strategies by using SE attention structure under the FPN framework in the
decoder part. In Fig. 8(e) and (f), SE attention modules were applied to each high and low-level branch to perform feature
recalibration, respectively. Fig. 8(g) shows a structure for feature recalibration after high- and low-level feature integration.
Fig. 8(h) shows a guided SE attention model for low-level feature recalibration according to the attention weights produced
by high-level feature maps.
By considering different compositions of above-mentioned attention structures, as well as the feasibility of adopting an
attention module, ten kinds of models with attention mechanism concerning different motivations were implemented. For
clarity, all of model structures involved in this section of experiments concerning attention mechanism analysis are listed in
Table 5, which illustrates the specific settings in different parts of an encoder-decoder model with different attention struc-
tures corresponding to Fig. 8. All the models were trained by using the Lovász-softmax loss with setting batch size to 16 on
DUTS-TR [80] datasets. Other datasets were utilized for testing. In this section, we selected SE-ResNet50 which is a widely
used attention structure for improving the generalization capacity of the backbone part. In Tables 5 and 6, since the attention
structure can be applied into different parts of an encoder-decoder model, SE blocks are also applied in ‘‘header” structure
and decoder part for a thorough comparison. Another reason for selecting SE-ResNet50 as backbone networks in baseline
models lies in that the effectiveness of SE block has been proven and the pre-trained models are provided for a fair compar-
ison with other baselines by loading a pre-trained model for training. During the training process, we loaded the weights of
backbone networks, i.e., ResNet50 and SE-ResNet50, which were pre-trained on the ImageNet dataset.
Quantitative results are listed in Table 6. In order to visually illustrate the performance of each model by using different
attention structures, Fig. 9 summarizes the comparative results in terms of the evaluation metrics on six testing datasets. It
shows that the performance can be further improved by introducing attention modules. Specifically, for models with
ResNet50 as a backbone of encoder part, models with self-attention (w/ d), and self-attention with auxiliary loss

848
Y. Ji et al. Information Sciences 546 (2021) 835–857

Fig. 8. Illustration of different structures with commonly used attention modules in each part of an encoder-decoder model.

Table 5
Model structures for attention mechanism analysis.

Abbr. Encoder Header Decoder


None ResNet50 PPM FPN
w/ d ResNet50 PPM w/ (d) self-attention FPN
w/ d&ds ResNet50 PPM w/ (d) & deeply supervised FPN
w/ b ResNet50 PPM w/ (b) SE-attention FPN
w/ a SE-ResNet50 (a) PPM FPN
w/ a + b SE-ResNet50 (a) PPM w/ (b) SE-attention FPN
w/ a + e SE-ResNet50 (a) PPM FPN w/ (e)
w/ a + g SE-ResNet50 (a) PPM FPN w/ (g)
w/ a + h SE-ResNet50 (a) PPM FPN w/ (h)
w/ a + f SE-ResNet50 (a) PPM FPN w/ (f)
w/ a + b + g SE-ResNet50 (a) PPM w/ (b) SE-attention FPN w/ (g)
w/ a + c SE-ResNet50 (a) PPM FPN w/ (c)

(w/ d&ds) show comparable performance in comparison with the baseline model without introducing attention module
(None). Performance improvement can be observed on DUT-OMRON and MSRA10K datasets when introducing auxiliary loss
after the PPM with self-attention module (see Figs. 8(d) and 9). On the other hand, it did not achieve consistent performance
improvement by introducing PPM module with SE-attention structures into the baseline model (i.e., Upernet50).
Interestingly, backbone network with SE-ResNet blocks consistently achieved 0.13%–2.43% performance improvement in
terms of max F-measure on the six benchmark datasets. Moreover, for models with SE-ResNet50 as encoder backbone, quan-
titative results show that the performance can be further improved by introducing extra attention modules in decoder part,
but cannot achieve significant performance improvement in comparison with SE-ResNet block. From the effects of introduc-
ing attention modules into different parts, SE attention structures were integrated into encoder backbone (w/a), and header
(w/b), as well as FPN (w/a + eg). It shows that backbone with SE-ResNet blocks can further improve the performance. More-
over, when implementing different types of attention structures, we observe that self-attention module and self-attention
module with auxiliary loss under deeply supervised framework can produce relatively higher improvement in comparison
with the SE attention structures (w/b). In contrast, it may degrade the performance slightly by setting attention structures.

849
Y. Ji et al. Information Sciences 546 (2021) 835–857

Table 6
Quantitative results for evaluating baseline models with different attention structures. (The upward arrow means higher values represent better results, while
downward arrow denotes lower values indicate better performance.)

None w/ d w/d&ds w/ b w/ a w/ a + b w/ a + e w/ a + g w/ a + h w/ a + f w/ a + b + g w/ a + c
ECSSD AUC" 0.9749 0.9722 0.9739 0.9739 0.9789 0.9775 0.9762 0.9771 0.9777 0.9756 0.9783 0.9748
MeanF" 0.9364 0.9387 0.9395 0.9371 0.9414 0.9413 0.9385 0.9406 0.9373 0.9358 0.9421 0.9380
MaxF" 0.9456 0.9471 0.9480 0.9454 0.9503 0.9500 0.9472 0.9490 0.9458 0.9447 0.9507 0.9465
MAE# 0.0379 0.0384 0.0362 0.0378 0.0333 0.0344 0.0361 0.0346 0.0357 0.0372 0.0336 0.0366
HKU-IS AUC" 0.9698 0.9687 0.9701 0.9698 0.9739 0.9742 0.9712 0.9734 0.9726 0.9717 0.9739 0.9718
MeanF" 0.9272 0.9286 0.9277 0.9268 0.9327 0.9318 0.9287 0.9310 0.9301 0.9291 0.9324 0.9296
MaxF" 0.9363 0.9374 0.9367 0.9360 0.9421 0.9410 0.9381 0.9405 0.9395 0.9386 0.9418 0.9387
MAE# 0.0305 0.0293 0.0299 0.0307 0.0277 0.0277 0.0292 0.0284 0.0287 0.0299 0.0277 0.0291
PASCAL-S AUC" 0.9532 0.9479 0.9495 0.9521 0.9526 0.9552 0.9489 0.9531 0.9515 0.9503 0.9545 0.9497
MeanF" 0.8639 0.8647 0.8636 0.8627 0.8639 0.8689 0.8583 0.8621 0.8611 0.8605 0.8642 0.8609
MaxF" 0.8760 0.8772 0.8758 0.8754 0.8773 0.8813 0.8706 0.8759 0.8747 0.8722 0.8779 0.8745
MAE# 0.0643 0.0661 0.0667 0.0654 0.0645 0.0612 0.0676 0.0644 0.0670 0.0670 0.0630 0.0673
DUT-O AUC" 0.9243 0.9218 0.9434 0.9255 0.9401 0.9373 0.9397 0.9394 0.9412 0.9401 0.9374 0.9394
MRON
MeanF" 0.7925 0.7978 0.7987 0.7928 0.8129 0.8152 0.8129 0.8144 0.8147 0.8131 0.8113 0.8120
MaxF" 0.8065 0.8113 0.8203 0.8088 0.8308 0.8320 0.8311 0.8337 0.8328 0.8323 0.8303 0.8296
MAE# 0.0552 0.0540 0.0565 0.0556 0.0500 0.0486 0.0502 0.0496 0.0502 0.0490 0.0494 0.0505
DUTS-TR AUC" 0.9916 0.9914 0.9907 0.9914 0.9914 0.9915 0.9907 0.9912 0.9912 0.9912 0.9915 0.9907
(Training)
MeanF" 0.9809 0.9802 0.9785 0.9810 0.9804 0.9805 0.9793 0.9800 0.9794 0.9793 0.9805 0.9796
MaxF" 0.9853 0.9847 0.9831 0.9855 0.9850 0.9851 0.9840 0.9845 0.9841 0.9840 0.9850 0.9841
MAE# 0.0107 0.0109 0.0117 0.0106 0.0108 0.0107 0.0114 0.0110 0.0113 0.0114 0.0108 0.0114
DUTS-TE AUC" 0.9595 0.9571 0.9626 0.9588 0.9650 0.9644 0.9617 0.9642 0.9613 0.9621 0.9650 0.9618
MeanF" 0.8751 0.8748 0.8691 0.8722 0.8812 0.8818 0.8750 0.8778 0.8739 0.8767 0.8798 0.8763
MaxF" 0.8926 0.8919 0.8895 0.8902 0.8995 0.8990 0.8929 0.8966 0.8918 0.8937 0.8976 0.8928
MAE# 0.0364 0.0364 0.0384 0.0369 0.0349 0.0338 0.0363 0.0351 0.0370 0.0362 0.0344 0.0362
MSRA10K AUC" 0.9610 0.9597 0.9650 0.9610 0.9666 0.9651 0.9667 0.9663 0.9662 0.9668 0.9661 0.9664
MeanF" 0.9075 0.9109 0.9148 0.9087 0.9194 0.9183 0.9207 0.9194 0.9193 0.9203 0.9193 0.9212
MaxF" 0.9165 0.9190 0.9247 0.9177 0.9289 0.9271 0.9297 0.9289 0.9281 0.9295 0.9282 0.9297
MAE# 0.0503 0.0482 0.0455 0.0496 0.0431 0.0441 0.0426 0.0434 0.0436 0.0432 0.0433 0.0427

The significance of bold values represents the best results.

Fig. 9. Comparative results of baseline models against different attention mechanisms.

According to our experiments, encoder backbone network plays a key role on performance enhancement. It may not perform
well consistently by setting different types of attention structures for feature recalibration in PPM or top-down path of fea-
ture fusion in FPN. Visual comparison of saliency maps produced by baseline models with different attention structures can
be found in the Appendix E.

4.2. Comparison of state-of-the-art SOD models

To evaluate the performance of state-of-the art models for SOD, we compared the most recent works. In Table 7, we com-
pared 14 state-of-the-art deep CNN based models, namely, DCL [39], DS [44], DSS [24], ELD [37], MDF [38], MC [112], Amulet
[105], UCF [106], SRM [81], ASNet [84], BiMCFEM [103], BASNet [67], PFA [113], and PiCANet [51] on six benchmark datasets.
For BiMCFEM and BASNet, the saliency maps are provided by the authors with respect to four benchmark datasets. The
850
Y. Ji et al.
Table 7
Comparison of state-of-the-art deep CNN based models. (The upward arrow means higher values represent better results, while downward arrow denotes lower values indicate better performance.)

DCL DS DSS ELD MDF MC Amulet UCF SRM ASNet BiMC-FEM BAS-Net PFA PiCA-R PiCA-RC w/a w/a + b
ECSSD AUC" 0.9743 0.9846 0.9704 0.9573 0.9381 0.8343 0.9799 0.9826 0.9819 0.9873 0.9811 0.9666 0.9811 0.9891 0.9671 0.9789 0.9775
MeanF" 0.8479 0.8354 0.8811 0.8312 0.7321 0.8115 0.8824 0.8522 0.8962 0.8985 0.9001 0.9272 0.8956 0.9001 0.9299 0.9414 0.9413
MaxF" 0.8958 0.8999 0.9062 0.8654 0.8075 0.8135 0.9127 0.9081 0.9159 0.9320 0.9284 0.9425 0.9220 0.9317 0.9371 0.9503 0.9500
MAE# 0.0800 0.0803 0.0647 0.0809 0.1376 0.0965 0.0607 0.0797 0.0564 0.0468 0.0446 0.0370 0.0449 0.0484 0.0370 0.0333 0.0344
HKU-IS AUC" 0.9800 0.9816 0.9761 0.9562 0.9481 0.8220 0.9829 0.9839 0.9831 0.9858 0.9808 0.9634 0.9859 0.9894 0.9648 0.9739 0.9742
MeanF" 0.8346 0.7935 0.8676 0.7920 0.7263 0.7626 0.8582 0.8341 0.8787 0.8832 0.8884 0.9088 0.8968 0.8796 0.9194 0.9327 0.9318
MaxF" 0.8903 0.8662 0.8984 0.8377 0.8066 0.7647 0.8974 0.8877 0.9031 0.9217 0.9207 0.9269 0.9261 0.9185 0.9281 0.9421 0.9410
MAE# 0.0637 0.0781 0.0509 0.0742 0.1148 0.0898 0.0507 0.0620 0.0469 0.0417 0.0387 0.0330 0.0326 0.0433 0.0308 0.0277 0.0277
PASCAL-S AUC" 0.9532 0.9670 0.9338 0.9275 0.9155 0.7777 0.9565 0.9586 0.9599 0.9746 0.9540 0.9329 0.9684 0.9734 0.9400 0.9526 0.9552
MeanF" 0.7621 0.7574 0.7952 0.7379 0.6609 0.7057 0.7871 0.7510 0.8154 0.8386 0.8199 0.8343 0.8310 0.8218 0.8498 0.8639 0.8689
MaxF" 0.8058 0.8282 0.8193 0.7684 0.7285 0.7074 0.8287 0.8179 0.8386 0.8743 0.8501 0.8539 0.8696 0.8573 0.8615 0.8773 0.8813
MAE# 0.1141 0.1080 0.1023 0.1215 0.1635 0.1397 0.1002 0.1278 0.0841 0.0668 0.0737 0.0758 0.0655 0.0756 0.0642 0.0645 0.0612
851

DUT-O AUC" 0.9341 0.9681 0.9283 0.9316 0.9239 0.8060 0.9497 0.9459 0.9454 0.9782 0.9238 0.9263 0.9719 0.9630 0.9138 0.9401 0.9373
MRON
MeanF" 0.6907 0.6865 0.7272 0.6611 0.6151 0.6588 0.6932 0.6628 0.7441 0.8208 0.7450 0.7905 0.8185 0.7619 0.8084 0.8129 0.8152
MaxF" 0.7333 0.7734 0.7604 0.7164 0.6795 0.6606 0.7429 0.7297 0.7690 0.8615 0.7742 0.8053 0.8565 0.8029 0.8183 0.8308 0.8320
MAE# 0.0949 0.0843 0.0745 0.0923 0.1147 0.0889 0.0976 0.1204 0.0694 0.0411 0.0636 0.0565 0.0415 0.0653 0.0543 0.0500 0.0486
DUTS-TE AUC" 0.9583 0.9683 0.9573 0.9356 0.9327 0.7871 0.9620 0.9610 0.9692 0.9668 0.9694 0.9460 0.9752 0.9821 0.9444 0.9650 0.9644
MeanF" 0.7276 0.6937 0.7730 0.6793 0.6323 0.6477 0.7263 0.6866 0.7960 0.7906 0.8134 0.8420 0.8292 0.8139 0.8587 0.8812 0.8818
MaxF" 0.7857 0.7757 0.8131 0.7368 0.7090 0.6495 0.7784 0.7710 0.8263 0.8354 0.8515 0.8595 0.8708 0.8598 0.8691 0.8995 0.8990
MAE# 0.0819 0.0901 0.0651 0.0923 0.1139 0.1004 0.0851 0.1173 0.0587 0.0607 0.0490 0.0476 0.0409 0.0506 0.0404 0.0349 0.0338
MSRA10K AUC" 0.9837 0.9900 0.9784 0.9880 0.9737 0.9173 0.9983 0.9959 0.9788 0.9947 – 0.9679 – 0.9819 0.9629 0.9666 0.9651
MeanF" 0.8800 0.8543 0.9031 0.9195 0.8264 0.8978 0.9435 0.9104 0.8864 0.9248 – 0.9112 – 0.8865 0.9111 0.9194 0.9183
MaxF" 0.9174 0.9155 0.9256 0.9450 0.8808 0.9003 0.9687 0.9524 0.9069 0.9552 – 0.9278 – 0.9165 0.9221 0.9289 0.9271

Information Sciences 546 (2021) 835–857


MAE# 0.0602 0.0655 0.0470 0.0293 0.0906 0.0422 0.0222 0.0389 0.0548 0.0306 – 0.0406 – 0.0564 0.0461 0.0431 0.0441

The significance of bold values represents the best results.


Y. Ji et al. Information Sciences 546 (2021) 835–857

released codes of the other methods are adopted, and trained models are directly utilized for saliency map generation. Then,
evaluation codes provided by [4] were adopted to calculate the evaluation metrics for fully quantifying the experimental
results. Quantitative results in italics in Table 7 denote that performance is evaluated on the training set with respect to cor-
responding methods, i.e., ELD, Amulet, UCF and ASNet. PiCA-R denotes PiCANet with ResNet50 backbone, and PiCA-RC refers
to the PiCANet-R model with dense CRF inference as post-processing. For clarity, the best results achieved by our trained
baseline models and other state-of-the-art methods are marked in bold.
The quantitative results in Table 7 show that SE-ResNet50 (w/a) and SE-ResNet50 with PPM+SE attention (w/a + b) mod-
els trained with Lovász loss and batch size setting at 16 outperform other state-of-the-art deep CNN models with respect to
max F-measure on the five benchmark datasets, except for DUT-OMRON dataset. Specifically, the trained SE-ResNet50 (w/a)
model produces 0.69% to 3.04% performance improvement in terms of maximum F-measure in comparison to the state-of-
the-art DL model, PiCA-RC, over six benchmark datasets. Moreover, the trained SE-ResNet50 with PPM+SE attention (w/a + b)
model also yields 0.5% to 2.99% performance gain in terms of max F-measure in comparison with the PiCA-RC model. Qual-
itative comparison of saliency maps produced by baseline models and state-of-the-art CNN-based SOD models are provided
in the Appendix F.

4.3. Extension study on video-based SOD

In this section, we perform an extension study on analyzing the performance of our trained baseline models on three
video-based SOD benchmark datasets. A brief review of related work on video SOD is outlined. Moreover, experimental
results on both image- and video-based state-of-the-art SOD methods are also presented.

4.3.1. Overview of video-based SOD methods


Traditional methods attempt to treat the video SOD task as an extension to saliency detection in static images [55,7,56].
For example, Liu et al. [55] introduced a CRF model for learning to detect a salient object in images and videos. Chen et al. [7]
proposed to model saliency coherency under a low-rank analysis framework. They established a cross-frame super-pixel
low-rank coherency correspondence for intra-batch saliency diffusion. Furthermore, Liu et al. [56] developed a spatiotempo-
ral propagation (SGSP) method based on superpixel-level graph.
Core ideas of recent works on video SOD concentrate on modeling spatial–temporal information for preserving inter-
frame saliency detection accuracy and consistency. Specifically, Wanget al.[86] developed a two-stage FCN-based model.
The first stage of the FCN model is utilized for modeling static saliency, and the second stage produces spatial–temporal sal-
iency by using two consecutive frames and static saliency cues as network input. Subsequently, Wang et al. [85] proposed a
RNN-based framework for video saliency detection and fixation heat-map prediction. In [85], a ConvLSTM module was uti-
lized for spatial–temporal modeling between a fixed-length of video frames. In addition, Song et al.[75] introduced a deeper
bidirectional ConvLSTM framework to implicitly characterize the long and short-term saliency dependency of video
sequences. A pyramid dilated convolution (PDC) module was designed for extracting spatial features at multiple scales.
In this part of experiment, image-based models are directly applied to producing saliency mask for each video frame
without using any video data for additional training.

4.3.2. Experimental results on video-based SOD


To evaluate the performance of the new baseline models, an additional experiment was performed on three video-based
SOD benchmark datasets, i.e., DAVIS [65], UVSD [56], and VOS [43]. Specifically, the DAVIS dataset contains 50 videos with
high-quality pixel-wise ground-truth annotations for each frame with high resolution. UVSD is a relatively challenging data-
set for video salient-object detection. It consists of 18 unconstrained videos with a cluttered background and lower resolu-
tion. A pixel-wise annotated ground-truth mask for each frame is also provided. Motion blur and complex scene context
make UVSD relatively challenging for video salient-object detection. VOS is the largest dataset constructed for video SOD.
It contains 200 videos with 116,103 frames. The frame length ranges from 71 to 2,249. Among these frames, 7,467 frames
were annotated to binary saliency mask pixel-wise.
We compared 18 state-of-the-art SOD methods, including wCtr [114], GMR [94], BSCA [68], DFRI [32], MB+ [101], DCL
[39], DHSNet [50], ELD [37], MDF [38], MC [112], Amulet [105], UCF [106], SRM [81], DSS [24], SGSP [56], VFCN [86], SSA
[43] and PDB [75]. Among these methods, there are five image-based non-DL SOD methods, nine image-based DL models,
and four video-based methods (see Table 8). For image-based non-DL methods, codes and parameter settings provided by
the authors were adopted directly to produce saliency maps for each frame. For image-based DL methods, trained models
and codes provided by the authors were utilized for generating saliency maps without extra fine-tuning or training process
on video data.
For performance evaluation, we assessed the overall performance in terms of maximum F-measure and MAE by straight-
forwardly computing the corresponding metrics on all annotated keyframes. The evaluation process is similar to the image-
based salient-object detection described in Section 4.1 and Section 4.2 by considering all annotated frames as testing images
in a dataset.
The quantitative results shown in Table 8 suggest that the newly discovered baseline models in this empirical study deli-
ver promising performance over the three video-based benchmark datasets. Concretely, the BENder#2 model illustrates rel-
atively powerful generalization capability. It brings 6.24%, 2.24%, and 3.81% performance gain with respect to max F-measure
852
Y. Ji et al. Information Sciences 546 (2021) 835–857

Table 8
Comparison of state-of-the-art SOD models on video datasets. (The upward arrow means higher values represent better results, while downward arrow denotes
lower values indicate better performance.)

DAVIS2016 (1080p) UVSD VOS


AUC" MeanF" MaxF" MAE# AUC" MeanF" MaxF" MAE# AUC" MeanF" MaxF" MAE#
wCtr 0.8860 0.3854 0.4829 0.1456 0.8601 0.3208 0.3959 0.1084 0.8988 0.5835 0.6343 0.1347
GMR 0.8678 0.3901 0.4980 0.1708 0.8771 0.3259 0.4346 0.1831 0.8541 0.5528 0.6044 0.1856
BSCA 0.8572 0.3593 0.4638 0.1953 0.8308 0.2350 0.3046 0.2225 0.8885 0.5342 0.6031 0.1844
DRFI 0.9269 0.4602 0.6056 0.1361 0.9107 0.3697 0.5018 0.1285 0.9331 0.5237 0.6345 0.1345
MB+ 0.8997 0.3549 0.5187 0.2265 0.8719 0.3332 0.4257 0.1782 0.9170 0.5944 0.6239 0.1480
DCL 0.9662 0.6928 0.7697 0.0590 0.9340 0.5175 0.5908 0.0616 0.9291 0.6873 0.7188 0.0820
DHSNet 0.9611 0.7297 0.7689 0.0460 0.9244 0.5944 0.6321 0.0445 0.9329 0.7414 0.7652 0.0556
ELD 0.9342 0.5956 0.6841 0.0789 0.9088 0.4525 0.5325 0.0712 0.9395 0.6683 0.7235 0.0811
MCDL 0.8088 0.6015 0.6059 0.0689 0.7240 0.3930 0.3943 0.0778 0.8969 0.6023 0.6417 0.0835
MDF 0.9435 0.6110 0.6984 0.0900 0.9038 0.4472 0.5198 0.0627 0.9236 0.5898 0.6378 0.0995
Amulet 0.9661 0.6632 0.7395 0.0720 0.9201 0.4551 0.5151 0.0992 0.9529 0.6720 0.7098 0.0848
UCF 0.9636 0.6241 0.7411 0.1060 0.9329 0.4473 0.5321 0.1299 0.9634 0.6665 0.7260 0.1104
SRM 0.9759 0.7549 0.7970 0.0408 0.9302 0.5524 0.5939 0.0502 0.9490 0.7193 0.7379 0.0607
DSS 0.9729 0.7147 0.7832 0.0492 0.9360 0.5702 0.6213 0.0496 0.9141 0.6937 0.7265 0.0771
SGSP 0.9470 0.5154 0.6945 0.1410 0.9517 0.4211 0.6061 0.1559 – – – –
VFCN 0.9631 0.6675 0.7469 0.0592 0.9401 0.4948 0.5854 0.0561 0.9531 0.6534 0.7059 0.0788
SSA 0.8723 0.6940 0.7081 0.0673 0.8171 0.5279 0.5406 0.0800 0.8098 0.6857 0.6895 0.1009
PDB – – – – 0.9878 0.8030 0.8607 0.0173 0.9701 0.7579 0.7924 0.0534
BENDer#11 0.9642 0.8200 0.8587 0.0315 0.8603 0.5994 0.6275 0.0387 0.9238 0.7571 0.7677 0.0559
BENDer#12 0.9638 0.8261 0.8594 0.0314 0.8238 0.6306 0.6545 0.0360 0.9157 0.7944 0.8033 0.0433
Lovász 0.9682 0.8241 0.8655 0.0304 0.8685 0.6326 0.6672 0.0425 0.9288 0.7965 0.8081 0.0450
Dice 0.9651 0.8305 0.8620 0.0310 0.8634 0.6361 0.6631 0.0431 0.9289 0.7836 0.7914 0.0490
w/a 0.9665 0.8352 0.8613 0.0286 0.8768 0.6753 0.7004 0.0419 0.9372 0.7975 0.8076 0.0418

The significance of bold values represents the best results.

in comparison with state-of-the-art image-based DL models over the three benchmark datasets, respectively. The Upernet50
model trained with Lovász-softmax loss and setting batch size to 16 (row of ‘Lovász’ in Table 8) outperforms other baseline
models on DAVIS2016 dataset. In addition, the SE-ResNet50 model (w/a) consistently outperforms other baseline models on
UVSD and VOS datasets.
For video SOD models, PDB [75] outperforms the trained baseline models by a significant margin on UVSD dataset, but
indicates a slightly lower performance on VOS dataset. On the other hand, all evaluated baseline models show lower perfor-
mance on the UVSD dataset. This may be caused by the video sequences with lower resolution and cluttered scenes in the
dataset. In addition, the lack of spatial–temporal saliency modeling in our baseline models also affects the performance. The
inconsistent performance improvement also reflects the generalization ability of different image-based models on different
video datasets. For VOS dataset, frames were annotated discretely with a large interval in a video recording. When dealing
with frames with long-range dependency, models with spatial–temporal modules, e.g., PDB with ConvLSTM and VFCN, may
degenerate to perform single-frame saliency map prediction independently. It can be observed that the performance looks
similar when using models with or without long-range saliency coherence.
For failure cases analysis, since the image-based models are neither trained nor fine-tuned on the video data, they may
not work well on some challenging video sequences, including videos with complex backgrounds, objects surrounded by
multiple objects in clutter scene context, videos with low resolutions. Moreover, failure cases may be caused by the different
distributions between training data (static images) and testing data (video frames), and the lack of a spatial–temporal sal-
iency modeling structure can greatly affect the stability and performance of an image-based SOD model. However, the exper-
imental results also indicate the powerful image-based DL models without training on video data can also achieve
competitive performance in comparison with video-based models. Thus, to further obtain higher accuracy for video-based
SOD task, it motivates us to develop an encoder-decoder model incorporated with spatial–temporal modeling modules
for SOD consistency and stability in our future work.

4.4. Discussion

4.4.1. Dataset distribution bias


A standard evaluation scheme in SOD is usually performed by training the proposed model on a single dataset and per-
forming cross-dataset evaluations on other datasets. However, dataset bias may largely influence performance on testing
datasets. Obviously, if the distributions of the training and testing sets differ, performance will decrease dramatically (see
Table 7). In our experiment, we found that detection accuracy cannot be assured if the distributions of image samples over
datasets are quite different according to the pixel proportion of salient regions. Additional details and analysis on dataset
distribution bias can be found in the Appendix G.
853
Y. Ji et al. Information Sciences 546 (2021) 835–857

4.4.2. Category annotation bias


Annotation bias on specific objects will bring more failure cases in testing datasets. For example, in the DUTS-TR dataset, a
large number of annotated samples, e.g., images with a ‘person’, are annotated with ‘person’ as salient. This may cause the
trained models to conclude that a person should be the salient object when a person appears in an image. However, it will
bring more failure cases when testing the same trained models on the DUT-OMRON dataset. Moreover, certain objects, such
as trees and roads, occur infrequently in a training set. The trained models tend to predict such objects as non-salient. Mean-
while, multi-object annotations will make the trained models mark all objects appearing in a scene as salient, which favors
testing on the HKU-IS dataset, but degrades performance when testing on the DUT-OMRON dataset. Furthermore, semantic
information play an important role in saliency detection to some extent. In fact, although the annotated datasets do not
explicitly provide any semantic information, semantic meaning may have been implicitly given by annotating the same kind
of objects many times in a large dataset.

4.4.3. Influence of sub-module structures in encoder and decoder parts


Our results show that the influence of the backbone network in the encoder part is more significant than decoder parts in
encoder-decoder networks for saliency detection. Since the decoder part plays the role of a classifier, it largely depends on
the feature extraction of the encoder part. However, the decoder part is also important. According to the experimental
results, the sub-structures utilized in the decoder part, including FPN and its variants with attention structures, as well as
more complicated structures aiming at integrating multi-level or multi-scale feature maps, can also largely augment the per-
formance of a baseline model.

5. Conclusion and future directions

In this paper, we presented an empirical study on encoder-decoder models for SOD. We also performed a review of key
components in encoder-decoder models for pixel-wise dense prediction tasks. Experimental results indicate that newly dis-
covered baseline models can achieve state-of-the-art performance with respect to F-measures and MAE on both image- and
video-based benchmark datasets when compared with existing competitive DL and non-DL models for SOD. An ablation
study suggests the efficiency of the encoder-decoder model involved with ResNet backbone network, PPM, FPN, and ASPP
modules. Moreover, numerous techniques, including attention mechanisms, loss functions, and evaluation schemes, can
be applied to designing models for achieving higher performance. Based on our study, some research directions in future
work with respect to SOD may focus on the following aspects:

1) Object relation and image understanding: According to our observation, recent works on SOD by leveraging CNN-
based encoder-decoder models have treated SOD as a pixel-level classification task. At present, the relationship
between a salient object and image context, which is a high-level concept, is not well captured in most recent
encoder-decoder models. In addition, the object-to-object, and object-to-context relationship can be treated as impli-
cit saliency prior cues for SOD. It would be interesting to characterize the saliency relationship between local and glo-
bal context from both pixel- or region-level relationships by using recurrent neural network (RNN) or graph
convolutional neural network (GCNN). Moreover, recent advances in visual attention and image understanding also
show promising performance on SOD and semantic segmentation by explicitly modeling semantics from language
information [69]. Techniques in VQA and image caption are exploited to learn the alignment between language
descriptions and visual semantics of salient objects [69,66,104]. Specifically, Qian et al. [66] proposed to detect salient
objects from natural language. A language-ware weakly supervised method was proposed for SOD. Zhang et al. [104]
proposed a CapSal model for SOD by leveraging caption, as an extra semantic, to improve SOD performance in complex
scenarios. These ideas may provide some insights into the provision of speech-to-visual-attention guidance for people
with visual impairment. Thus, it needs further investigation in improving SOD performance and interpretability of
SOD models by considering both implicit and explicit information from the perspective of image understanding in
future directions.
2) Robustness of SOD models: Recent advances have explored the robustness of SOD models against adversarial attacks
and non-saliency cases. Concretely, Fernandez [21] explored the effectiveness of state-of-the-art SOD models in a
complex scene by a comparison of the performance on original natural images and adversarial examples. It has
demonstrated the vulnerability of deep learning-based saliency models to adversarial examples. Li et al. [41] proposed
the first end-to-end trainable framework, ROSA, which successfully launched adversarial attacks to boost the robust-
ness for arbitrary FCN-based SOD models. Moreover, Fan et al. [20] newly proposed a high-quality dataset for detect-
ing salient objects in clutter. Images with salient and non-salient objects are collected to avoid the design bias based
on the assumption that each image contains at least one salient object in a relative low clutter context. Furthermore,
Liu et al. [57] proposed a multi-task framework for both predicting the saliency mask and the existence of a salient
object in an image. Therefore, it is promising to develop robust SOD models from the perspectives of adversarial
attacks, images without a salient object, and salient objects with complex scene context in both image- and video-
based SOD tasks.

854
Y. Ji et al. Information Sciences 546 (2021) 835–857

3) Saliency-assisted single-object segmentation and weakly supervised semantic segmentation: According to our
observation, current SOD methods can achieve higher performance when dealing with simple images. This motivates
us to apply off-the-shelf SOD models to assisting single-object image segmentation tasks for specific application sce-
narios, e.g., clothing segmentation for street and online store clothing images. Moreover, applications also exist of
using saliency detection to facilitate weakly supervised semantic segmentation by regarding saliency as a kind of
backgroundness cue to generate ‘fake’ ground-truth maps [88,90]. Encouraging results demonstrate the potential of
using saliency detection techniques to image parsing under certain applications.
4) Video saliency detection: This is a more challenging task because video data contain more complex scene contexts,
motion cues, cluttered backgrounds, etc. In addition, both spatial and temporal information should be considered in
practice for modeling video sequence data. Encoder-decoder models, which have exhibited promising performance for
static images, can also be transferred to solving video saliency detection by leveraging spatial–temporal information,
such as inter-frame constrains, motion cues (optical flow), etc. It would be interesting to apply an encoder-decoder
model with a well-designed spatial–temporal module for video SOD in future work.

CRediT authorship contribution statement

Haijun Zhang played a role in Conceptualization, Funding acquisition, Investigation, Methodology, Project administra-
tion, Resources, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Key Research and Development Program of China under Grant
2018YFB1003800 and Grant 2018YFB1003805, in part by the National Natural Science Foundation of China under Grant
61972112 and Grant 61832004, and in part by the Shenzhen Science and Technology Program under Grant
JCYJ20170413105929681 and Grant JCYJ20170811161545863.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.ins.2020.
09.003.

References

[1] V. Badrinarayanan, A. Kendall, et al, SegNet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal.
Mach. Intell. 39 (12) (2017) 2481–2495.
[2] D. Bahdanau, K. Cho, et al, Neural machine translation by jointly learning to align and translate, CoRR (abs/1409.0473.).
[3] M. Berman, A.R. Triki, et al., The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural
networks, in: CVPR, 2018, pp. 4413–4421.
[4] A. Borji, M. Cheng, et al, Salient object detection: a survey, Computat. Visual Media 5 (2) (2019) 117–150.
[5] A. Borji, M.-M. Cheng, et al, Salient object detection: a benchmark, IEEE Trans. Image Proc. 24 (12) (2015) 5706–5722.
[6] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 185–207.
[7] C. Chen, S. Li, et al, Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion, IEEE Trans. Image Proc. 26 (7) (2017) 3156–
3170.
[8] L. Chen, G. Papandreou, et al., Semantic image segmentation with deep convolutional nets and fully connected CRFs, CoRR abs/1412.7062.
[9] L. Chen, G. Papandreou, et al., Rethinking atrous convolution for semantic image segmentation, CoRR abs/1706.05587.
[10] L. Chen, H. Zhang, et al., Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, in: CVPR, 2017, pp. 6298–6306.
[11] L. Chen, Y. Zhu, et al., Encoder-decoder with atrous separable convolution for semantic image segmentation, in: ECCV, 2018, pp. 833–851.
[12] L.-C. Chen, G. Papandreou, et al, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,
IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848.
[13] Y. Chen, J. Li, et al., Dual path networks, in: NIPS, 2017, pp. 4467–4475.
[14] J. Cheng, L. Dong, et al., Long short-term memory-networks for machine reading, in: EMNLP, 2016, pp. 551–561.
[15] M.-M. Cheng, N.J. Mitra, et al, Global contrast based salient region detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 569–582.
[16] F. Chollet, Xception: deep learning with depthwise separable convolutions, in: CVPR, 2017, pp. 1800–1807.
[17] M. Cornia, L. Baraldi, et al, Predicting human eye fixations via an LSTM-based saliency attentive model, IEEE Trans. Image Proc. 27 (10) (2018) 5142–
5154.
[18] J. Dai, H. Qi, et al., Deformable convolutional networks, in: ICCV, 2017, pp. 764–773.
[19] J. Deng, W. Dong, et al., Imagenet: A large-scale hierarchical image database, in: CVPR, 2009, pp. 248–255.
[20] D.-P. Fan, M.-M. Cheng, et al., Salient objects in clutter: bringing salient object detection to the foreground, in: ECCV, 2018, pp. 186–202.
[21] A. Fernandez, On the Salience of Adversarial Examples, in: ISVC, 2019, pp. 221–232.
[22] K. He, X. Zhang, et al., Deep residual learning for image recognition, in: CVPR, 2016, pp. 770–778.
[23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.

855
Y. Ji et al. Information Sciences 546 (2021) 835–857

[24] Q. Hou, M. Cheng, et al, Deeply supervised salient object detection with short connections, IEEE Trans. Pattern Anal. Mach. Intell. 41 (4) (2019) 815–
828.
[25] Q. Hou, J. Liu, et al., Three birds one stone: a unified framework for salient object segmentation, edge detection and skeleton extraction, CoRR abs/
1803.09860.
[26] J. Hu, L. Shen, et al., Squeeze-and-excitation networks, in: CVPR, 2018, pp. 7132–7141.
[27] G. Huang, Z. Liu, et al., Densely Connected Convolutional Networks, in: CVPR, 2017, pp. 2261–2269.
[28] M. Jaderberg, K. Simonyan, et al., Spatial transformer networks, in: NIPS, 2015, pp. 2017–2025.
[29] Y. Ji, H. Zhang, et al, Salient object detection via multi-scale attention CNN, Neurocomputing 322 (2018) 130–140.
[30] X. Jia, B. De Brabandere, et al., Dynamic filter networks, in: NIPS, 2016, pp. 667–675.
[31] B. Jiang, L. Zhang, et al., Saliency detection via absorbing markov chain, in: ICCV, 2013, pp. 1665–1672.
[32] H. Jiang, J. Wang, et al., Salient object detection: a discriminative regional feature integration approach, in: CVPR, 2013, pp. 2083–2090.
[33] T. Ke, J. Hwang, et al., Adaptive affinity fields for semantic segmentation, in: ECCV, 2018, pp. 605–621.
[34] A. Kendall, V. Badrinarayanan, et al., Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene
understanding, in: BMVC, 2017, pp. 1–12.
[35] F. Lateef, Y. Ruichek, Survey on semantic segmentation using deep learning techniques, Neurocomputing 338 (2019) 321–348.
[36] Y. LeCun, Y. Bengio, et al., Deep learning, Nature 521 (7553) (2015) 436.
[37] G. Lee, Y.-W. Tai, et al., Deep saliency with encoded low level distance map and high level features, in: CVPR, 2016, pp. 660–668.
[38] G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: CVPR, 2015, pp. 5455–5463.
[39] G. Li, Y. Yu, Deep contrast learning for salient object detection, in: CVPR, 478–487, 2016a.
[40] G. Li, Y. Yu, Visual saliency detection based on multiscale deep CNN features, IEEE Trans. Image Proc. 25 (11) (2016) 5012–5024.
[41] H. Li, G. Li, et al., ROSA: robust salient object detection against adversarial attacks, CoRR abs/1905.03434.
[42] H. Li, P. Xiong, et al., Pyramid attention network for semantic segmentation, in: BMVC, 2018, p. 285.
[43] J. Li, C. Xia, et al, A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection, IEEE Trans. Image Proc. 27
(1) (2018) 349–364.
[44] X. Li, L. Zhao, et al, DeepSaliency: Multi-task deep neural network model for salient object detection, IEEE Trans. Image Proc. 25 (8) (2016) 3919–3930.
[45] Y. Li, X. Hou, et al., The secrets of salient object segmentation, in: CVPR, 2014, pp. 280–287.
[46] T. Lin, P. Dollár, et al., Feature pyramid networks for object detection, in: CVPR, 2017, pp. 936–944.
[47] T. Lin, P. Goyal, et al., Focal loss for dense object detection, in: ICCV, 2017, pp. 2999–3007.
[48] C. Liu, L. Chen, et al., Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation, in: CVPR, 2019, pp. 82–92.
[49] L. Liu, H. Wang, et al., Crowd counting using deep recurrent spatial-aware network, in: IJCAI, 2018, pp. 849–855.
[50] N. Liu, J. Han, DHSNet: Deep hierarchical saliency network for salient object detection, in: CVPR, 2016, pp. 678–686.
[51] N. Liu, J. Han, et al., PiCANet: Learning pixel-wise contextual attention for saliency detection, in: CVPR, 2018, pp. 3089–3098.
[52] N. Liu, J. Han, et al., Predicting eye fixations using convolutional neural networks, in: CVPR, 2015, pp. 362–370.
[53] S. Liu, S.D. Mello, et al., Learning affinity via spatial propagation networks, in: NIPS, 2017, pp. 1519–1529.
[54] S. Liu, L. Qi, et al., Path aggregation network for instance segmentation, in: CVPR, 2018, pp. 8759–8768.
[55] T. Liu, Z. Yuan, et al, Learning to detect a salient object, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 353–367.
[56] Z. Liu, J. Li, et al, Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation, IEEE Trans. Circ. Syst.
Video Techn. 27 (12) (2017) 2527–2542.
[57] Z. Liu, Q. Xiang, et al, Robust salient object detection for RGB images, Vis. Comput. 36 (9) (2020) 1823–1835.
[58] J. Long, E. Shelhamer, et al., Fully convolutional networks for semantic segmentation, in: CVPR, 2015, pp. 3431–3440.
[59] M. Mancas, V.P. Ferrera, et al, From Human Attention to Computational Attention, vol. 2, Springer, 2016.
[60] F. Milletari, N. Navab, et al., V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 3DV, 2016, pp. 565–571.
[61] V. Mnih, N. Heess, et al., Recurrent models of visual attention, in: NIPS, 2014, pp. 2204–2212.
[62] J. Pan, E. Sayrol, et al., Shallow and deep convolutional networks for saliency prediction, in: CVPR, 2016, pp. 598–606.
[63] J. Park, S. Woo, et al., BAM: Bottleneck Attention Module, in: BMVC, 2018, p. 147.
[64] F. Perazzi, P. Krähenbühl, et al., Saliency filters: Contrast based filtering for salient region detection, in: CVPR, 2012, pp. 733–740.
[65] F. Perazzi, J. Pont-Tuset, et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, in: CVPR, 2016, pp. 724–732.
[66] M. Qian, J. Qi, et al, Language-aware weak supervision for salient object detection, Pattern Recognit. 96 (106955) (2019) 1–11.
[67] X. Qin, Z. Zhang, et al., BASNet: Boundary-aware salient object detection, in: CVPR, 2019, pp. 7479–7489.
[68] Y. Qin, H. Lu, et al., Saliency detection via cellular automata, in: CVPR, 2015, pp. 110–119.
[69] V. Ramanishka, A. Das, et al., Top-down visual saliency guided by captions, in: CVPR, 2017, pp. 7206–7215.
[70] S. Ren, K. He, et al., Faster r-cnn: Towards real-time object detection with region proposal networks, in: NIPS, 2015, pp. 91–99.
[71] O. Ronneberger, P. Fischer, et al., U-net: Convolutional networks for biomedical image segmentation, in: MICCAI, 2015, pp. 234–241.
[72] A. G. Roy, N. Navab, et al., Concurrent spatial and channel ’Squeeze & Excitation’ in fully convolutional networks, in: MICCAI, 2018, pp. 421–429.
[73] A. Shrivastava, A. Gupta, et al., Training region-based object detectors with online hard example mining, in: CVPR, 2016, pp. 761–769.
[74] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556.
[75] H. Song, W. Wang, et al., Pyramid dilated deeper ConvLSTM for video salient object detection, in: ECCV, 2018, pp. 744–760.
[76] C. Szegedy, S. Ioffe, et al., Inception-v4, inception-resnet and the impact of residual connections on learning., in: AAAI, vol. 4, 2017, p. 12.
[77] C. Szegedy, W. Liu, et al., Going deeper with convolutions, in: CVPR, 2015, pp. 1–9.
[78] C. Szegedy, V. Vanhoucke, et al., Rethinking the inception architecture for computer vision, in: CVPR, 2016, pp. 2818–2826.
[79] A. Vaswani, N. Shazeer, et al., Attention is all you need, in: NIPS, 2017, pp. 5998–6008.
[80] L. Wang, H. Lu, et al., Learning to detect salient objects with image-level supervision, in: Proceedings of the CVPR, 2017, pp. 3796–3805.
[81] T. Wang, A. Borji, et al., A stagewise refinement model for detecting salient objects in images, in: CVPR, 2017, pp. 4019–4028.
[82] T. Wang, L. Zhang, et al., Detect globally, refine locally: a novel approach to saliency detection, in: CVPR, 2018, pp. 3127–3135.
[83] W. Wang, Q. Lai, et al., Salient object detection in the deep learning era: an in-depth survey, CoRR abs/1904.09146.
[84] W. Wang, J. Shen, et al., Salient object detection driven by fixation prediction, in: CVPR, 2018, pp. 1711–1720.
[85] W. Wang, J. Shen, et al., Revisiting video saliency: a large-scale benchmark and a new model, in: CVPR, 2018, pp. 4894–4903.
[86] W. Wang, J. Shen, et al, Video salient object detection via fully convolutional networks, IEEE Trans. Image Proc. 27 (1) (2018) 38–49.
[87] X. Wang, R.B. Girshick, et al., Non-Local Neural Networks, in: CVPR, 2018, pp. 7794–7803.
[88] Y. Wei, X. Liang, et al, Stc: A simple to complex framework for weakly-supervised semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39
(11) (2017) 2314–2320.
[89] Y. Wei, F. Wen, et al., Geodesic saliency using background priors, in: ECCV, 2012, pp. 29–42.
[90] Y. Wei, H. Xiao, et al., Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation, in: CVPR, 2018, pp.
7268–7277.
[91] S. Woo, J. Park, et al., CBAM: Convolutional block attention module, in: ECCV, 2018, pp. 3–19.
[92] T. Xiao, Y. Liu, et al., Unified perceptual parsing for scene understanding, in: ECCV, 2018, pp. 418–434.
[93] S. Xie, Z. Tu, Holistically-nested edge detection, in: ICCV, 2015, pp. 1395–1403.
[94] C. Yang, L. Zhang, et al., Saliency detection via graph-based manifold ranking, in: CVPR, 2013, pp. 3166–3173.
[95] M. Yang, K. Yu, et al., DenseASPP for semantic segmentation in street scenes, in: CVPR, 2018, pp. 3684–3692.

856
Y. Ji et al. Information Sciences 546 (2021) 835–857

[96] C. Yu, J. Wang, et al., BiSeNet: Bilateral segmentation network for real-time semantic segmentation, in: ECCV, 2018, pp. 334–349.
[97] C. Yu, J. Wang, et al., Learning a discriminative feature network for semantic segmentation, in: CVPR, 2018, pp. 1857–1866.
[98] D. Zhang, J. Han, et al., Supervision by fusion: towards unsupervised learning of deep salient object detector, in: ICCV, 2017, pp. 4068–4076.
[99] H. Zhang, I.J. Goodfellow, et al., Self-attention generative adversarial networks, in: ICML, 2019, pp. 7354–7363.
[100] H. Zhang, I.J. Goodfellow, et al., Self-attention generative adversarial networks, in: ICML, 2019, pp. 7354–7363.
[101] J. Zhang, S. Sclaroff, et al., Minimum barrier salient object detection at 80 FPS, in: ICCV, 2015, pp. 1404–1412.
[102] J. Zhang, T. Zhang, et al., Deep unsupervised saliency detection: a multiple noisy labeling perspective, in: CVPR, 2018, pp. 9029–9038.
[103] L. Zhang, J. Dai, et al., A bi-directional message passing model for salient object detection, in: CVPR, 2018, pp. 1741–1750.
[104] L. Zhang, J. Zhang, et al., CapSal: Leveraging captioning to boost semantics for salient object detection, in: CVPR, 2019, pp. 6024–6033.
[105] P. Zhang, D. Wang, et al., Amulet: aggregating multi-level convolutional features for salient object detection, in: ICCV, 2017, pp. 202–211.
[106] P. Zhang, D. Wang, et al., Learning uncertain convolutional features for accurate saliency detection, in: ICCV, 2017, pp. 212–221.
[107] R. Zhang, S. Tang, et al., Global-residual and local-boundary refinement networks for rectifying scene parsing predictions, in: IJCAI, 2017, pp. 3427–
3433.
[108] X. Zhang, T. Wang, et al., Progressive attention guided recurrent network for salient object detection, in: CVPR, 2018, pp. 714–722.
[109] Y. Zhang, D. Zhou, et al., Single-image crowd counting via multi-column convolutional neural network, in: CVPR, 2016, pp.589–597.
[110] H. Zhao, J. Shi, et al., Pyramid scene parsing network, in: CVPR, 2017, pp. 2881–2890.
[111] K. Zhao, S. Gao, et al., Optimizing the F-Measure for threshold-free salient object detection, in: ICCV, 2019, pp. 8848–8856.
[112] R. Zhao, W. Ouyang, et al., Saliency detection by multi-context deep learning, in: CVPR, 2015, pp. 1265–1274.
[113] T. Zhao, X. Wu, Pyramid feature attention network for saliency detection, in: CVPR, 2019, pp. 3085–3094.
[114] W. Zhu, S. Liang, et al., Saliency optimization from robust background detection, in: CVPR, 2014, pp. 2814–2821.

857

You might also like