Comprehensive Review On CNN Encoder Decoder PDF
Comprehensive Review On CNN Encoder Decoder PDF
Information Sciences
journal homepage: www.elsevier.com/locate/ins
a r t i c l e i n f o a b s t r a c t
Article history: Convolutional neural network (CNN)-based encoder-decoder models have profoundly
Received 4 July 2020 inspired recent works in the field of salient object detection (SOD). With the rapid devel-
Received in revised form 19 August 2020 opment of encoder-decoder models with respect to most pixel-level dense prediction tasks,
Accepted 1 September 2020
an empirical study still does not exist that evaluates performance by applying a large body
Available online 9 September 2020
of encoder-decoder models on SOD tasks. In this paper, instead of limiting our survey to
SOD methods, a broader view is further presented from the perspective of fundamental
Keywords:
architectures of key modules and structures in CNN-based encoder-decoder models for
Salient object detection
Encoder-decoder model
pixel-level dense prediction tasks. Moreover, we focus on performing SOD by leveraging
Pixel-level classification deep encoder-decoder models, and present an extensive empirical study on baseline
Video saliency encoder-decoder models in terms of different encoder backbones, loss functions, training
Empirical study batch sizes, and attention structures. Moreover, state-of-the-art encoder-decoder models
adopted from semantic segmentation and deep CNN-based SOD models are also investi-
gated. New baseline models that can outperform state-of-the-art performance were dis-
covered. In addition, these newly discovered baseline models were further evaluated on
three video-based SOD benchmark datasets. Experimental results demonstrate the effec-
tiveness of these baseline models on both image- and video-based SOD tasks. This empir-
ical study is concluded by a comprehensive summary which provides suggestions on future
perspectives.
Ó 2020 Elsevier Inc. All rights reserved.
1. Introduction
The human visual attention system intends to extract the most informative objects and regions in a scene, and then com-
bines this local information to efficiently understand the whole scene. This kind of visual attention mechanism has prompted
many researchers to stimulate this ability in computer vision tasks [61,6]. Salient object detection (SOD) aims at finding the
most attractive object(s) in a scene in order to simulate the functionality of the biological visual attention system [5]. In the
past decade, remarkable success of deep convolutional neural networks (CNN) has been achieved in a large number of com-
puter vision tasks. Due to their powerful generalization capability, deep CNN models have been developed and applied not
only on image-level classification tasks [70,36] but also pixel-level classification tasks[58,110].
⇑ Corresponding author.
E-mail address: [email protected] (H. Zhang).
https://doi.org/10.1016/j.ins.2020.09.003
0020-0255/Ó 2020 Elsevier Inc. All rights reserved.
Y. Ji et al. Information Sciences 546 (2021) 835–857
Concretely, fully convolutional network (FCN)-based encoder-decoder models have dramatically improved performance
on pixel-wise image-to-image learning tasks, including semantic segmentation [58,110,12], edge detection [93], SOD
[51,82,84,108], and crowd counting [109,49]. In essence, the trend of mainstream SOD methods developed in recent years
indicates that most of them work under the encoder-decoder framework. Some researchers have developed structures based
on encoder-decoder model for SOD task and achieved state-of-the-art performance [51,82,84]. Specifically, CNN-based
encoder-decoder models play an important role in continuously updating the SOD performance on benchmark datasets
[5]. Techniques, including multi-scale or multi-level structures [24], attention layers [51], etc., are also developed and intro-
duced into SOD models.
However, an important issue is whether a generic routine exists to augment performance by determining which compo-
nents are key factors under the encoder-decoder framework. To the best of our knowledge, an empirical study does not yet
exist that thoroughly evaluates the performance of this kind of generic framework on SOD task. In this work, we focus on
investigating the profound influence of the CNN-based encoder-decoder model on SOD, and providing an empirical study
on the performance by applying encoder-decoder models to SOD task. Moreover, we also provide a literature review in terms
of key components of the encoder-decoder framework on a broad range of pixel-level classification or regression tasks.
According to our experimental results, baseline models and its variants composed of an encoder with ResNet [22], and deco-
ders with pyramid parsing module (PPM) [110], atrous spatial pyramid pooling module (ASPP) [11], and feature pyramid
network (FPN) [46], have been found. The new baseline models outperform state-of-the-art deep SOD models. To further
understand these results, we performed a thoroughly ablation study on each key module and techniques, under the
encoder-decoder framework.
The main idea of this paper lies in broadening the research exploration of SOD by introducing modules and techniques in
other similar pixel-level dense prediction tasks of computer vision, such as semantic segmentation [11] and edge detection
[93]. In particular, CNN-based encoder-decoder models have been widely used in semantic segmentation. In essence, the
trend of mainstream SOD methods developed in recent years indicates that most of them work under the encoder-
decoder framework. Some researchers have developed structures based on an encoder-decoder model for SOD task and
achieved state-of-the-art performance [51,82,84]. However, no article has fully performed cross-domain module validation.
Therefore, our paper aims at quantifying the model effectiveness by introducing key techniques into an encoder-decoder
model and explore the possibility and potential network structures and learning strategies in developing a baseline SOD
model, as well as provides some insights for researchers in SOD.
In this work, the reviewed papers largely cover topics including SOD, semantic segmentation, and the encoder-decoder
model and its key techniques and sub-modules. We also noticed that several survey papers [5,83] exist that review the lit-
erature with respect to SOD methods. Specifically, according to the literature review [5,83] on CNN-based SOD models pro-
posed in recent years, CNN-based encoder-decoder models play an important role in continuously updating the SOD
performance on benchmark datasets. Other techniques, including multi-scale or multi-level structures [24], attention struc-
tures [51], etc., are also developed and introduced into SOD models. It can be observed that most of the techniques in SOD are
inspired or derived from deep CNN-based encoder-decoder models for other similar tasks. Different to the above survey
papers, in this work, we focus on solving SOD by leveraging deep CNN-based encoder-decoder models, and present a thor-
oughly empirical study on baseline encoder-decoder models and comparison of state-of-the-art deep CNN models for SOD
on seven image-based benchmark datasets. In addition, the discovered new baseline models were further evaluated on three
video-based SOD datasets in comparison to 18 state-of-the-art methods. Instead of limiting our survey to SOD methods, a
broader view is presented from the perspective of fundamental architectures of key modules and techniques in CNN-
based encoder-decoder models for pixel-level dense prediction tasks.
The remainder of this paper is organized as follows. In Section 2, we reviewed a large body of SOD models proposed in the
deep learning era. It becomes a trend to construct CNN-based encoder-decoder models in SOD task for higher performance.
The rationale behind techniques in each part of a component is highlighted, including backbone network with powerful gen-
eralization capacity, header structures for rich feature mining, attention structure for feature recalibration, multi-scale and
multi-level feature integration structure for recovering the details of a salient object segmentation mask. Therefore, Section 3
further outlines the techniques in CNN-based encoder-decoder models for image-to-image learning task which is tightly
coupled in similar tasks. Details of empirical study and experimental results on validating the efficacy of each key module
are given in Section 4. Finally, in Section 5, we conclude the paper with directions for future work.
This section outlines the taxonomy in terms of SOD from the respective of traditional methods and deep CNN-based mod-
els, respectively. In particular, relations between the state-of-the-art deep SOD models and encoder-decoder models are also
provided.
Traditional methods of SOD can be categorized into two main classes: bottom-up methods and top-down methods. The
concepts of ‘‘bottom-up” and ‘‘top-down” are mainly based on the theory of experimental psychology and cognitive
836
Y. Ji et al. Information Sciences 546 (2021) 835–857
neuroscience [59]. Specifically, the bottom-up methods denote the way to define saliency by focusing on the low-level fea-
ture as exogenous factors triggering the visual attention mechanism. Top-down methods define the saliency from the per-
spective of endogenous, which means the visual attention is closely related to an individual’s experience, memory and
emotion [59].
In practice, the design of bottom-up methods largely depends on saliency priors [64,89,31,45,15,94,68], including center
surround prior, foreground prior, boundary connectivity prior, local and global contrast prior, focusness prior, and geodesic
prior. Those priors can serve as a kind of semi-supervised information to design heuristic bottom-up methods based on con-
straints. Zhu et al. [114] introduced a saliency optimization method to find the backgroundness probability of superpixels by
considering geodesic saliency. Yang et al. [94] proposed a graph-based model by using manifold ranking, in which a query
sequence is constructed by considering that the boundary nodes can be generally treated as background or non-salient.
However, bottom-up methods are heuristically designed and rely on specific saliency priors. Consequently, they may not
work well when the images are miss-matched with the corresponding priors. Thus, utilizing unsupervised methods based
on preset constraints constitutes a performance bottleneck. Top-down approaches place more effort into feature extraction
and classifier design [45] before the era of deep learning. Jiang et al. [32] proposed a discriminative regional feature integra-
tion approach to map regional feature vectors to saliency scores by using a random forest regressor. Liu et al. [55] presented a
method by modeling groups of saliency features through conditional random field (CRF) learning for detecting salient
objects, which was also applied to video saliency detection tasks by modeling extra spatial–temporal information.
Attributed to the success of FCN-based encoder-decoder models achieved in pixel-level densely prediction tasks
[58,110,12,93], ideas inspired by encoder-decoder models for solving saliency detection have been presented. Thus, the per-
formance of SOD has been updated continuously [51,82,84,108,50,62,81,105,106,24]. These deep CNN-based models can be
further classified into three categories: CNN-based encoder-decoder models, deep context-saliency models, and other
models.
Table 1
Overview of deep CNN-based models for SOD.
CNN-based state-of-the-art SOD models are shown in Table 1. Core ideas and network architectures are also listed. Table 1
reveals that techniques, including atrous convolution, spatial pyramid pooling, multi-scale and multi-level feature fusion,
and CRF refinement [12,24], are closely coupled with encoder-decoder models proposed in other pixel-level prediction tasks.
For clarity, Fig. 1 presents key techniques and structures in SOD and encoder-decoder models, and illustrates the relations
between SOD and encoder-decoder models, as well as connections among sub-level topics.
Fig. 1. Taxonomy of surveys with respect to key techniques and structures in SOD and encoder-decoder models.
838
Y. Ji et al. Information Sciences 546 (2021) 835–857
In this section, we briefly review state-of-the-art encoder-decoder models for pixel-level dense prediction tasks
[58,12,93,109,49,11]. Key components and techniques will be introduced in the following subsections.
3.1. Background
Initially, FCN [58] can be treated as a prototype of the encoder-decoder framework by using CNN for semantic segmen-
tation. In FCN, the fully connected layer for image-level classification is removed and replaced by multiple layers of trans-
posed convolution operation or bilinear up-sampling operations. FCN-based or FCN-like models have been proposed in
recent years. Consequently, semantic segmentation performance has been continuously improved. Most of the CNN-based
models can be reduced to the encoder-decoder model. Models, e.g., SegNet [1,34], UNet [71], and DeepLab series
[8,12,9,11,48], have been recently advanced. Specifically, Badrinarayanan et al. [1] proposed an encoder-decoder model
for semantic segmentation, which is based on the FCN structure and a symmetrical decoder. In [34], this network structure
is expanded to model segmentation uncertainty. The atrous convolution operation from DeepLab [12,11] and early down-
sampling have been proven to be effective in this framework. Moreover, Ronneberger et al. [71] designed a symmetric net-
work into a U-shape by recursively compensating information between encoder and decoder layers with the same scales.
However, CNN-based encoder-decoder models are not only developed in semantic segmentation, but have also been
adopted in many other pixel-level prediction fields, such as edge detection [93], crowd counting [109,49], fixation prediction
[52,17], and SOD [51,84,105,106,24,29]. Actually, these aforementioned image-to-image learning tasks are similar and clo-
sely related. For example, SOD and semantic segmentation can be categorized into the same kind of vision task, i.e., pixel-
wise dense classification, from the perspective of output forms. This makes the relationship between SOD and semantic seg-
mentation quite close. Generally, a semantic segmentation model aims at accurately and efficiently classifying each pixel
into a category to facilitate applications, such as autonomous driving, indoor navigation, and virtual or augmented reality
systems to name a few [35]. However, from the perspective of task objectives, SOD tends to predict the most attractive object
in a scene by simulating the visual attention mechanism. In addition, edge detection is also a pixel-level binary classification
task, but it has to address extremely unbalanced positive and negative samples. Moreover, crowd counting and fixation pre-
diction can be reduced to pixel-level regression tasks.
In the following subsections, key techniques and components involved in each important part of encoder-decoder models
will be briefly discussed, including encoder backbone networks, ‘‘header” structures and their variants, multi-scale and
multi-level feature fusion, attention structures, and loss functions.
An attention mechanism was initially proposed for neural machine translation (NMT) in natural language processing
(NLP) [2]. The attention module proposed in Ref. [2] aims at learning the alignment between the source language sentence
and the current word in the target language in a machine translation model. Intuitively, different parts of the source sentence
make different contributions to translating a current word in a target language. Such weights for each word in the source
sentence can be learned by using an attention model.
Attention mechanisms have largely influenced the development of sequence modeling tasks in computer vision, such as
image captioning, visual question answering (VQA), and video classification [10]. The motivation behind attention structures
in CNNs lies in that the diversity of feature maps can be improved by introducing such implicit attention structures into con-
volutional layers. Fig. 3 presents the generic form of attention structures. In practice, the difference of such attention struc-
tures in the literature is the specific implementation of an attention generation module. Among these structures, Fig. 3(a)
and (b) show the channel-wise and spatial-wise attention modules, respectively. Fig. 3(d)–(f) illustrate three different kinds
of combinations of spatial and channel-wise attention layers. Fig. 3(e) presents a generic form of attention module with a
residual connection. Concretely, Hu et al. [26] proposed a SE network, in which a self-gated channel-wise attention structure
is introduced for capturing channel dependence. Moreover, Roy et al. [72] proposed a concurrent structure of a spatial- and
channel-wise attention module organized into a parallel form based on the SE-block, which is termed the scSE-block. Sim-
ilarly, Park et al. [63] also presented another form of combination of spatial and channel-wise attention module, and applied
it to the bottleneck block.
From the perspective of the source and target with respect to an attention module, we can roughly classify attention mod-
ules into two main categories: (1) self-attention; and (2) guided attention modules. Generally, the self-attention module can
be designed and applied in the encoder part, as an implicit attention structure, for robust feature learning. However, for the
guided attention module, attention is generated from a source feature map and applied to a target for feature recalibration
and integration.
Actually, the concept of self-attention was originally proposed in NLP tasks. In practice, instead of learning to perform
inter-alignment between encoder and decoder, self-attention, which is also called intra-attention, is a special form of atten-
tion mechanism for intra-relation reasoning between tokens within both encoder and decoder parts [79]. It has been widely
used in a variety of tasks in NLP, e.g., reading comprehension, abstractive summarization, textual entailment, and learning
task-independent sentence representations [79,14]. Specifically, Vaswani et al. [79] proposed to integrate a generic form of
self-attention function into a simple but efficient network architecture, namely, the transformer, for language sequence
840
Y. Ji et al. Information Sciences 546 (2021) 835–857
modeling. By considering that the self-attention module is essentially a mapping function from a query and a set of key-
value pairs to an output, Wang et al. [87] argued that self-attention module is a special case of non-local operations in
the embedded Gaussian version. It thus relates self-attention module to the classic non-local means, and extends the
sequential self-attention to a generic space/spacetime non-local network for both image and video recognition.
Attention modules generally play a key role to perform feature recalibration and feature enhancement [29]. Moreover, the
attention mechanism has been widely used in each part of encoder-decoder models for dense prediction tasks, and achieved
performance improvement [26,42,91,99]. However, the attention module usually requires to introduce more delicate and
sophisticated feature operations. The introduced attention structure needs more calculation operation on input tensors. Net-
work structures for generating attention need to be designed carefully in order to avoid introducing more parameters. On the
other hand, the attention module may increase the number of layers of the proposed model alongside the trunk branch of a
deep CNN model. A residual connection is often applied to the attention module for preserving the information flow in a
deeper neural network (see Fig. 3(c)).
In practice, ‘‘header” can be regarded as a connection structure which aims at producing rich feature maps and mining
context-aware information based on the extracted feature maps from encoder part.
In general, a ‘‘header” component can be located at the end of encoder before decoder parts or dense pixel-wise classifiers.
Multi-scale or multi-level pyramid feature generation module [110,11] can be treated as a representative ‘‘header” structure.
Among them, PPM [110] is developed for rich high-level concept abstraction by using multi-scale pooling operations. The
feature maps in the end of the encoder are pooled by using several different sizes of pooling kernels to obtain different
sub-region representations. Similarly, the ASPP module [11] aims at obtaining rich feature representations with respect to
different receptive fields by leveraging multiple atrous convolution operations with different dilation rates. The above-
mentioned two kinds of efficient ‘‘header” components were introduced for obtaining rich feature representations, but they
were developed from different perspectives based on pooling and convolutional operations, respectively. The core idea aims
at harvesting multi-level feature representation by gradually increasing the receptive field under a feature pyramid
framework.
841
Y. Ji et al. Information Sciences 546 (2021) 835–857
Moreover, modules with feature reuse with dense connections, multi-level pooling and atrous convolution operations
incorporated with attention structures were also developed. several variants of PPM and ASPP structures have been proposed
in recent years. For example, Yang et al. [95] designed an ASPP module with dense connections, which is largely inspired by
the idea of rich feature mining in DenseNet [27]. These models bring the idea of feature enhancement by leveraging attention
mechanisms and dense connections into multi-scale feature map pooling and atrous convolution. High-level feature maps
with respect to accurate semantic meaning and context information can be learned for improving segmentation
performance.
3.5. Decoder
This section summarizes the various structures and key components applied in decoder part. The most commonly used
models aim at multi-level and multi-scale feature learning and integration. Rich hierarchical representations with different
levels of visual perception can be utilized for parsing multi-level visual concepts and recovering the local details[93,24,92]. In
particular, skip connection, multi-scale and multi-level structures are commonly considered for multi-scale response fusion
under a feature pyramid hierarchy.
842
Y. Ji et al. Information Sciences 546 (2021) 835–857
Initially, in FCN, the decoder part is a pioneer in providing a straightforward method for utilizing multi-level feature maps
[58]. In [46], the FPN structure was introduced as a pyramid-based feature integration scheme for multi-scale RoI pooling
and feature extraction. In FPN, multi-level feature maps are integrated with high- and low-level features by utilizing up-
sampling and pixel-wise add-up operations. Moreover, FPN and its variants, with sophisticated refinement modules and
attention mechanisms, have been proposed and widely used in a number of networks for object detection, semantic segmen-
tation, etc. Concretely, Xiao et al. [46] presented a unified encoder-decoder model for multi-task learning by exploiting the
different levels of features for different tasks. In addition, Liu et al. [54] proposed a progressive dual path version of FPN for
accurate instance segmentation. A dual path with top-down and bottom-up directions of low- and high-level feature inte-
gration structure is introduced to shorten information flow within a deep CNN model. In addition, variants of FPN-based
decoders with attention modules [51,84,103,42] have also been proposed for SOD in recent years.
For different tasks, various loss functions should be considered according to different objectives. For pixel-level dense pre-
diction tasks, for instance, most of the deep CNN-based segmentation methods rely on logistic regression, optimizing cross-
entropy loss between prediction and ground-truth. On the other hand, for pixel-wise regression tasks, l2 loss is often applied
to fixation prediction [17] and crowd counting tasks [109,49]. For SOD, which can be treated as a binary classification task
[24], several types of loss functions, including binary cross entropy (BCE) loss, dice loss, and metric-based loss, have been
applied.
BCE loss, a special form of cross entropy loss function, has been widely used in binary classification tasks (including SOD
[24], edge detection [93], medical image segmentation [71], etc.) in the form of:
X
LBCE ¼ ½Gj log PðGj ¼ 1jX; WÞ þ ð1 Gj Þ log PðGj ¼ 0jX; WÞ; ð1Þ
j2G
where G represents the ground-truth label list; X is the input training image; j indicates the index of pixel location; and P
represents the probability of activation of the j-th pixel in the output map.
However, the measure of cross-entropy loss is often a poor indicator of the quality of segmentation [3]. It indicates that
the loss function aims at penalizing the error of classification at each pixel, instead of directly optimizing evaluation metrics,
e.g., the Jaccard index (which is also called the intersection-over-union (IoU) score) between produced segmentation masks
and ground-truth. Thus, several metric-based loss functions have been proposed in recent literature, including dice loss for
binary classification in medical image segmentation [71], Lovász-Softmax loss for semantic segmentation and binary seg-
mentation [3], precision, recall, and F-measure and mean absolute error (MAE)-based loss function for SOD [84,17,111].
Concretely, the dice loss function is proposed in V-net [60] for volumetric medical image segmentation. It is based on the
dice coefficient for binary classification. The dice coefficient between a binary mask and a ground-truth mask can be defined
as:
843
Y. Ji et al. Information Sciences 546 (2021) 835–857
P
2 N pi g i
D ¼ PN i P N 2
; ð2Þ
i pi þ
2
i gi
where pi and g i denote the prediction and ground-truth label of the i-th pixel, respectively. For binary classification, dice
coefficient is equivalent to F1 score, which ranges from 0 to 1. It describes the similarity between the predicted mask and
ground-truth mask. Thus, the dice loss function can be written as Ldice ¼ 1 D.
Moreover, Lovász-Softmax loss [3] is proposed for direct optimizing the mean IoU (mIoU) for both foreground-
background segmentation and multi-class semantic segmentation.In practice, the Jaccard index of class c, can be calculated
by:
jfP ¼ cg \ fG ¼ cgj
J c ðP; GÞ ¼ ; ð3Þ
jfP ¼ cg [ fG ¼ cgj
where P and G represent the predicted and ground-truth label list, respectively. Thus, a corresponding loss function can be
defined as Lc ðP; GÞ ¼ 1 J c ðP; GÞ. However, computing the convex closure of set functions is NP-hard. Accordingly, Berman
et al. [3] constructed two kinds of piecewise linear convex surrogate losses based on the Lovász extension of submodular
set functions, i.e., Lovász hinge and Lovász-Softmax loss, in order to train deep CNN models for binary segmentation and
multi-class segmentation, respectively.
In addition, a metrics-based loss function for saliency detection was also proposed for fixation prediction [17] and SOD
[84,111], respectively. Specifically, in Ref. [84], a joint form of weighted cross entropy and metrics-based loss function is
applied to SOD. The corresponding metrics include precision, recall, F-measure, and MAE. Similarly, Cornia et al. [17] pro-
posed to apply a metric-based loss function to fixation prediction, including Normalized Scan path Saliency (NSS), the Linear
Correlation Coefficient (CC), and the Kullback–Leibler Divergence (KL-Div). A network structure, which consists of a ResNet
backbone with dilation convolution and attentive ConvLSTM in the decoder part, was trained by leveraging learned fixation
maps. Zhao et al. [111] proposed a relaxed F-measure loss function to overcome the in-differentiability of the standard F-
measure.
Other forms of loss functions, e.g., auxiliary loss, can be applied to shallow layers to optimize and stabilize the training
process [110]. Another form of auxiliary loss is multiple side-way output loss under a deeply supervised framework [93,24].
Moreover, in Ref. [33], an auxiliary affinity loss based on adaptive affinity field is proposed, which aims at learning pixel-wise
discriminative feature representations by leveraging adversarial training [33]. The idea of adversarial training is avoided to
trivialize the affinity loss of pixel pairs in an affinity field. It is quite close to the motivation of hard example mining for train-
ing models with powerful generalization capability.
This kind of auxiliary loss function can assist to train CNN models, such that more stability and higher performance in
pixel-wise prediction tasks can be achieved. In practice, these sub-modules for calculating auxiliary loss are usually removed
at the testing stage. Moreover, additional supervision information can be generated by utilizing the pixel-wisely annotated
ground-truth, e.g., edge [97], affinity between neighbor pixels, etc. The resulting supervision information can also be used as
‘ground-truth’ to calculate auxiliary loss for training CNN models in the tasks of semantic segmentation and instance seg-
mentation [53].
4. Empirical study
In this section, we evaluate a series of encoder-decoder models and their variants for SOD task. Seven widely-used SOD
benchmark datasets are utilized in our empirical analysis. To evaluate the performance, three most widely-used evaluation
metrics for SOD including F-measures, MAE, and area under curve (AUC) are applied. Experimental results on quantifying the
effectiveness of different components and learning strategies, including encoder backbone networks, header structures,
attention modules, loss functions and batch sizes are presented. In addition, all of the experiments are performed by follow-
ing the cross-dataset evaluation strategy, i.e., training models on one dataset and evaluating the performance on the other
datasets in the following subsections. Due to the space limit, we summarized commonly-utilized benchmark datasets, eval-
uation metrics, implementation details, hyper-parameter settings and analysis on batch sizes of baseline architectures in the
Appendices A–C.
In this section, we explored the baseline encoder-decoder models with ResNet backbone, which is named as ‘‘BENDer” in
the experimental results, by utilizing different combinations of commonly used sub-modules in the encoder and decoder
parts, including atrous convolution, PPM, ASPP, and FPN. We thus harvested a series of model variants by changing the enco-
der, header, and decoder parts. In addition, we also investigated the contributions of setting different loss functions, batch
sizes, and attention structures. Superior performance has been achieved by following such a flow of model analysis. Due to
the space limit, analysis on batch sizes of baseline architectures is summarized in the Appendix C.
844
Y. Ji et al. Information Sciences 546 (2021) 835–857
Table 2
Structures of ‘BENDer’ models for SOD.
845
Y. Ji et al.
Table 3
Quantitative results for evaluating ‘BENDer’ models. (The upward arrow means higher values represent better results, while downward arrow denotes lower values indicate better performance.)
BENDer#1 BENDer#2 BENDer#3 BENDer#4 BENDer#5 BENDer#6 BENDer#7 BENDer#8 BENDer#9 BENDer#10 BENDer#11 BENDer#12
ECSSD AUC" 0.9880 0.9854 0.9834 0.9821 0.9855 0.9887 0.9854 0.9887 0.9878 0.9865 0.9670 0.9655
MeanF" 0.9068 0.9144 0.9036 0.9025 0.9202 0.9178 0.9165 0.9259 0.9198 0.9188 0.9302 0.9338
MaxF" 0.9355 0.9386 0.9275 0.9266 0.9437 0.9435 0.9413 0.9502 0.9441 0.9433 0.9394 0.9418
MAE# 0.0541 0.0509 0.0557 0.0583 0.0493 0.0467 0.0502 0.0456 0.0472 0.0485 0.0435 0.0422
HKU-IS AUC" 0.9857 0.9846 0.9810 0.9805 0.9799 0.9876 0.9876 0.9848 0.9879 0.9811 0.9672 0.9662
MeanF" 0.8882 0.8983 0.8877 0.8878 0.9015 0.9019 0.9020 0.9063 0.9054 0.9008 0.9217 0.9283
MaxF" 0.9250 0.9281 0.9169 0.9177 0.9311 0.9331 0.9330 0.9360 0.9366 0.9324 0.9320 0.9370
MAE# 0.0468 0.0425 0.0453 0.0461 0.0424 0.0403 0.0402 0.0401 0.0389 0.0419 0.0326 0.0309
PASCAL-S AUC" 0.9675 0.9662 0.9606 0.9599 0.9599 0.9686 0.9679 0.9654 0.9678 0.9613 0.9420 0.9413
MeanF" 0.8383 0.8446 0.8312 0.8334 0.8334 0.8440 0.8450 0.8524 0.8454 0.8506 0.8517 0.8559
MaxF" 0.8642 0.8686 0.8550 0.8580 0.8580 0.8694 0.8712 0.8756 0.8713 0.8754 0.8645 0.8671
MAE# 0.0787 0.0744 0.0798 0.0801 0.0801 0.0754 0.0741 0.0724 0.0742 0.0734 0.0714 0.0688
DUT-O AUC" 0.9492 0.9386 0.9235 0.9206 0.9211 0.9532 0.9528 0.9421 0.9561 0.9236 0.9240 0.9274
MRON
846
MeanF" 0.7508 0.7616 0.7246 0.7274 0.7700 0.7736 0.7722 0.7859 0.7836 0.7689 0.7895 0.8120
MaxF" 0.7908 0.7937 0.7573 0.7627 0.8037 0.8092 0.8072 0.8194 0.8189 0.8065 0.8073 0.8257
MAE# 0.0673 0.0644 0.0746 0.0755 0.0613 0.0600 0.0603 0.0571 0.0582 0.0607 0.0559 0.0525
DUTS-TR AUC" 0.9985 0.9989 0.9990 0.9989 0.9989 0.9989 0.9991 0.9991 0.9991 0.9990 0.9897 0.9883
(Training)
MeanF" 0.9584 0.9661 0.9648 0.9646 0.9665 0.9657 0.9666 0.9678 0.9658 0.9662 0.9755 0.9745
MaxF" 0.9784 0.9817 0.9816 0.9814 0.9825 0.9831 0.9840 0.9843 0.9830 0.9836 0.9807 0.9795
MAE# 0.0252 0.0210 0.0215 0.0219 0.0207 0.0199 0.0196 0.0194 0.0198 0.0200 0.0131 0.0138
DUTS-TE AUC" 0.9773 0.9749 0.9674 0.9666 0.9654 0.9814 0.9808 0.9767 0.9815 0.9683 0.9550 0.9511
MeanF" 0.8217 0.8358 0.8124 0.8154 0.8438 0.8411 0.8411 0.8575 0.8492 0.8445 0.8578 0.8734
MaxF" 0.8629 0.8698 0.8478 0.8510 0.8803 0.8803 0.8805 0.8951 0.8896 0.8843 0.8775 0.8882
MAE# 0.0529 0.0480 0.0540 0.0535 0.0469 0.0459 0.0459 0.0435 0.0448 0.0461 0.0415 0.0395
Table 4
Quantitative results for evaluating baseline models trained by using different loss functions. (The upward arrow means higher values represent better results,
while downward arrow denotes lower values indicate better performance.)
Lovász loss Dice loss BCE loss FM loss Focal loss OHEM-CE
ECSSD AUC" 0.9749 0.9751 0.9867 0.9242 0.9917 0.976
MeanF" 0.9364 0.9372 0.9233 0.9078 0.8305 0.9276
MaxF" 0.9456 0.9461 0.9441 0.9209 0.9362 0.9406
MAE# 0.0379 0.0379 0.0449 0.0614 0.0844 0.0443
HKU-IS AUC" 0.9698 0.9703 0.9863 0.9037 0.9916 0.9728
MeanF" 0.9272 0.927 0.9082 0.8886 0.8016 0.9137
MaxF" 0.9363 0.9365 0.9341 0.9065 0.9269 0.9271
MAE# 0.0305 0.0307 0.0378 0.0554 0.0752 0.0365
PASCAL-S AUC" 0.9532 0.9546 0.9678 0.9043 0.9766 0.9475
MeanF" 0.8639 0.8631 0.853 0.8433 0.7713 0.8438
MaxF" 0.876 0.8755 0.8756 0.8553 0.8688 0.8584
MAE# 0.0643 0.066 0.07 0.0852 0.104 0.0774
DUT-O AUC" 0.9243 0.9275 0.9459 0.8718 0.9615 0.9326
MRON
MeanF" 0.7925 0.7942 0.7724 0.7671 0.6638 0.7849
MaxF" 0.8065 0.8117 0.8 0.7785 0.7849 0.8018
MAE# 0.0552 0.0565 0.06 0.0652 0.0958 0.0572
DUTS-TR AUC" 0.9916 0.9917 0.9991 0.9386 0.9991 0.9937
(Training)
MeanF" 0.9809 0.9801 0.9684 0.9483 0.8994 0.971
MaxF" 0.9853 0.9847 0.9847 0.9634 0.9788 0.9784
MAE# 0.0107 0.0114 0.0188 0.0426 0.0551 0.0159
DUTS-TE AUC" 0.9595 0.9609 0.9796 0.8864 0.9857 0.9592
MeanF" 0.8751 0.8741 0.8503 0.8293 0.7313 0.8472
MaxF" 0.8926 0.8925 0.8826 0.8407 0.866 0.8667
MAE# 0.0364 0.0375 0.0427 0.0549 0.0769 0.0454
MSRA10K AUC" 0.961 0.9638 0.9753 0.9124 0.9786 0.9723
MeanF" 0.9075 0.9097 0.8949 0.8881 0.8073 0.9116
MaxF" 0.9165 0.9195 0.9139 0.8984 0.9033 0.9223
MAE# 0.0503 0.0488 0.0565 0.0677 0.0952 0.0483
Fig. 6. Overall performance of baseline models trained by utilizing different loss functions.
trained by using focal loss achieved unstable performance in terms of mean and max F-measure but relatively high perfor-
mance in terms of AUC score over all testing sets. For clarity, we also provide comparative PR curves in Fig. 7. According to
our observation, the precision and recall points may distribute within a small range when the models were trained by uti-
lizing metric-based loss functions. It indicates that the produced saliency maps are not sensitive to binary thresholds when
calculating saliency mask. As a result, AUC scores may be lower than other models. In contrast, saliency maps which are sen-
sitive to binary thresholds can produce relatively smoothed PR curves. In this case, higher AUC scores can be achieved. Visual
comparison of saliency maps produced by baseline models in terms of using different loss functions can be found in the
Appendix D.
structure [87,100] were incorporated into different parts of encoder-decoder models. Fig. 8 illustrates eight kinds of atten-
tion blocks by applying attention structures into different parts of a given encoder-decoder model. Among them, Fig. 8(a)
illustrates the SE-ResNet block which was exploited in a backbone network, namely SE-ResNet50. Fig. 8(b) and (c) show
the PPM and FPN with SE attention structures, respectively. By considering the computational complexity, we only exploited
self-attention module on high-level feature maps produced by a PPM for modeling pixel-wise relation (see Fig. 8(d)). Fig. 8
(e)–(h) show four kinds of different feature fusion strategies by using SE attention structure under the FPN framework in the
decoder part. In Fig. 8(e) and (f), SE attention modules were applied to each high and low-level branch to perform feature
recalibration, respectively. Fig. 8(g) shows a structure for feature recalibration after high- and low-level feature integration.
Fig. 8(h) shows a guided SE attention model for low-level feature recalibration according to the attention weights produced
by high-level feature maps.
By considering different compositions of above-mentioned attention structures, as well as the feasibility of adopting an
attention module, ten kinds of models with attention mechanism concerning different motivations were implemented. For
clarity, all of model structures involved in this section of experiments concerning attention mechanism analysis are listed in
Table 5, which illustrates the specific settings in different parts of an encoder-decoder model with different attention struc-
tures corresponding to Fig. 8. All the models were trained by using the Lovász-softmax loss with setting batch size to 16 on
DUTS-TR [80] datasets. Other datasets were utilized for testing. In this section, we selected SE-ResNet50 which is a widely
used attention structure for improving the generalization capacity of the backbone part. In Tables 5 and 6, since the attention
structure can be applied into different parts of an encoder-decoder model, SE blocks are also applied in ‘‘header” structure
and decoder part for a thorough comparison. Another reason for selecting SE-ResNet50 as backbone networks in baseline
models lies in that the effectiveness of SE block has been proven and the pre-trained models are provided for a fair compar-
ison with other baselines by loading a pre-trained model for training. During the training process, we loaded the weights of
backbone networks, i.e., ResNet50 and SE-ResNet50, which were pre-trained on the ImageNet dataset.
Quantitative results are listed in Table 6. In order to visually illustrate the performance of each model by using different
attention structures, Fig. 9 summarizes the comparative results in terms of the evaluation metrics on six testing datasets. It
shows that the performance can be further improved by introducing attention modules. Specifically, for models with
ResNet50 as a backbone of encoder part, models with self-attention (w/ d), and self-attention with auxiliary loss
848
Y. Ji et al. Information Sciences 546 (2021) 835–857
Fig. 8. Illustration of different structures with commonly used attention modules in each part of an encoder-decoder model.
Table 5
Model structures for attention mechanism analysis.
(w/ d&ds) show comparable performance in comparison with the baseline model without introducing attention module
(None). Performance improvement can be observed on DUT-OMRON and MSRA10K datasets when introducing auxiliary loss
after the PPM with self-attention module (see Figs. 8(d) and 9). On the other hand, it did not achieve consistent performance
improvement by introducing PPM module with SE-attention structures into the baseline model (i.e., Upernet50).
Interestingly, backbone network with SE-ResNet blocks consistently achieved 0.13%–2.43% performance improvement in
terms of max F-measure on the six benchmark datasets. Moreover, for models with SE-ResNet50 as encoder backbone, quan-
titative results show that the performance can be further improved by introducing extra attention modules in decoder part,
but cannot achieve significant performance improvement in comparison with SE-ResNet block. From the effects of introduc-
ing attention modules into different parts, SE attention structures were integrated into encoder backbone (w/a), and header
(w/b), as well as FPN (w/a + eg). It shows that backbone with SE-ResNet blocks can further improve the performance. More-
over, when implementing different types of attention structures, we observe that self-attention module and self-attention
module with auxiliary loss under deeply supervised framework can produce relatively higher improvement in comparison
with the SE attention structures (w/b). In contrast, it may degrade the performance slightly by setting attention structures.
849
Y. Ji et al. Information Sciences 546 (2021) 835–857
Table 6
Quantitative results for evaluating baseline models with different attention structures. (The upward arrow means higher values represent better results, while
downward arrow denotes lower values indicate better performance.)
None w/ d w/d&ds w/ b w/ a w/ a + b w/ a + e w/ a + g w/ a + h w/ a + f w/ a + b + g w/ a + c
ECSSD AUC" 0.9749 0.9722 0.9739 0.9739 0.9789 0.9775 0.9762 0.9771 0.9777 0.9756 0.9783 0.9748
MeanF" 0.9364 0.9387 0.9395 0.9371 0.9414 0.9413 0.9385 0.9406 0.9373 0.9358 0.9421 0.9380
MaxF" 0.9456 0.9471 0.9480 0.9454 0.9503 0.9500 0.9472 0.9490 0.9458 0.9447 0.9507 0.9465
MAE# 0.0379 0.0384 0.0362 0.0378 0.0333 0.0344 0.0361 0.0346 0.0357 0.0372 0.0336 0.0366
HKU-IS AUC" 0.9698 0.9687 0.9701 0.9698 0.9739 0.9742 0.9712 0.9734 0.9726 0.9717 0.9739 0.9718
MeanF" 0.9272 0.9286 0.9277 0.9268 0.9327 0.9318 0.9287 0.9310 0.9301 0.9291 0.9324 0.9296
MaxF" 0.9363 0.9374 0.9367 0.9360 0.9421 0.9410 0.9381 0.9405 0.9395 0.9386 0.9418 0.9387
MAE# 0.0305 0.0293 0.0299 0.0307 0.0277 0.0277 0.0292 0.0284 0.0287 0.0299 0.0277 0.0291
PASCAL-S AUC" 0.9532 0.9479 0.9495 0.9521 0.9526 0.9552 0.9489 0.9531 0.9515 0.9503 0.9545 0.9497
MeanF" 0.8639 0.8647 0.8636 0.8627 0.8639 0.8689 0.8583 0.8621 0.8611 0.8605 0.8642 0.8609
MaxF" 0.8760 0.8772 0.8758 0.8754 0.8773 0.8813 0.8706 0.8759 0.8747 0.8722 0.8779 0.8745
MAE# 0.0643 0.0661 0.0667 0.0654 0.0645 0.0612 0.0676 0.0644 0.0670 0.0670 0.0630 0.0673
DUT-O AUC" 0.9243 0.9218 0.9434 0.9255 0.9401 0.9373 0.9397 0.9394 0.9412 0.9401 0.9374 0.9394
MRON
MeanF" 0.7925 0.7978 0.7987 0.7928 0.8129 0.8152 0.8129 0.8144 0.8147 0.8131 0.8113 0.8120
MaxF" 0.8065 0.8113 0.8203 0.8088 0.8308 0.8320 0.8311 0.8337 0.8328 0.8323 0.8303 0.8296
MAE# 0.0552 0.0540 0.0565 0.0556 0.0500 0.0486 0.0502 0.0496 0.0502 0.0490 0.0494 0.0505
DUTS-TR AUC" 0.9916 0.9914 0.9907 0.9914 0.9914 0.9915 0.9907 0.9912 0.9912 0.9912 0.9915 0.9907
(Training)
MeanF" 0.9809 0.9802 0.9785 0.9810 0.9804 0.9805 0.9793 0.9800 0.9794 0.9793 0.9805 0.9796
MaxF" 0.9853 0.9847 0.9831 0.9855 0.9850 0.9851 0.9840 0.9845 0.9841 0.9840 0.9850 0.9841
MAE# 0.0107 0.0109 0.0117 0.0106 0.0108 0.0107 0.0114 0.0110 0.0113 0.0114 0.0108 0.0114
DUTS-TE AUC" 0.9595 0.9571 0.9626 0.9588 0.9650 0.9644 0.9617 0.9642 0.9613 0.9621 0.9650 0.9618
MeanF" 0.8751 0.8748 0.8691 0.8722 0.8812 0.8818 0.8750 0.8778 0.8739 0.8767 0.8798 0.8763
MaxF" 0.8926 0.8919 0.8895 0.8902 0.8995 0.8990 0.8929 0.8966 0.8918 0.8937 0.8976 0.8928
MAE# 0.0364 0.0364 0.0384 0.0369 0.0349 0.0338 0.0363 0.0351 0.0370 0.0362 0.0344 0.0362
MSRA10K AUC" 0.9610 0.9597 0.9650 0.9610 0.9666 0.9651 0.9667 0.9663 0.9662 0.9668 0.9661 0.9664
MeanF" 0.9075 0.9109 0.9148 0.9087 0.9194 0.9183 0.9207 0.9194 0.9193 0.9203 0.9193 0.9212
MaxF" 0.9165 0.9190 0.9247 0.9177 0.9289 0.9271 0.9297 0.9289 0.9281 0.9295 0.9282 0.9297
MAE# 0.0503 0.0482 0.0455 0.0496 0.0431 0.0441 0.0426 0.0434 0.0436 0.0432 0.0433 0.0427
According to our experiments, encoder backbone network plays a key role on performance enhancement. It may not perform
well consistently by setting different types of attention structures for feature recalibration in PPM or top-down path of fea-
ture fusion in FPN. Visual comparison of saliency maps produced by baseline models with different attention structures can
be found in the Appendix E.
To evaluate the performance of state-of-the art models for SOD, we compared the most recent works. In Table 7, we com-
pared 14 state-of-the-art deep CNN based models, namely, DCL [39], DS [44], DSS [24], ELD [37], MDF [38], MC [112], Amulet
[105], UCF [106], SRM [81], ASNet [84], BiMCFEM [103], BASNet [67], PFA [113], and PiCANet [51] on six benchmark datasets.
For BiMCFEM and BASNet, the saliency maps are provided by the authors with respect to four benchmark datasets. The
850
Y. Ji et al.
Table 7
Comparison of state-of-the-art deep CNN based models. (The upward arrow means higher values represent better results, while downward arrow denotes lower values indicate better performance.)
DCL DS DSS ELD MDF MC Amulet UCF SRM ASNet BiMC-FEM BAS-Net PFA PiCA-R PiCA-RC w/a w/a + b
ECSSD AUC" 0.9743 0.9846 0.9704 0.9573 0.9381 0.8343 0.9799 0.9826 0.9819 0.9873 0.9811 0.9666 0.9811 0.9891 0.9671 0.9789 0.9775
MeanF" 0.8479 0.8354 0.8811 0.8312 0.7321 0.8115 0.8824 0.8522 0.8962 0.8985 0.9001 0.9272 0.8956 0.9001 0.9299 0.9414 0.9413
MaxF" 0.8958 0.8999 0.9062 0.8654 0.8075 0.8135 0.9127 0.9081 0.9159 0.9320 0.9284 0.9425 0.9220 0.9317 0.9371 0.9503 0.9500
MAE# 0.0800 0.0803 0.0647 0.0809 0.1376 0.0965 0.0607 0.0797 0.0564 0.0468 0.0446 0.0370 0.0449 0.0484 0.0370 0.0333 0.0344
HKU-IS AUC" 0.9800 0.9816 0.9761 0.9562 0.9481 0.8220 0.9829 0.9839 0.9831 0.9858 0.9808 0.9634 0.9859 0.9894 0.9648 0.9739 0.9742
MeanF" 0.8346 0.7935 0.8676 0.7920 0.7263 0.7626 0.8582 0.8341 0.8787 0.8832 0.8884 0.9088 0.8968 0.8796 0.9194 0.9327 0.9318
MaxF" 0.8903 0.8662 0.8984 0.8377 0.8066 0.7647 0.8974 0.8877 0.9031 0.9217 0.9207 0.9269 0.9261 0.9185 0.9281 0.9421 0.9410
MAE# 0.0637 0.0781 0.0509 0.0742 0.1148 0.0898 0.0507 0.0620 0.0469 0.0417 0.0387 0.0330 0.0326 0.0433 0.0308 0.0277 0.0277
PASCAL-S AUC" 0.9532 0.9670 0.9338 0.9275 0.9155 0.7777 0.9565 0.9586 0.9599 0.9746 0.9540 0.9329 0.9684 0.9734 0.9400 0.9526 0.9552
MeanF" 0.7621 0.7574 0.7952 0.7379 0.6609 0.7057 0.7871 0.7510 0.8154 0.8386 0.8199 0.8343 0.8310 0.8218 0.8498 0.8639 0.8689
MaxF" 0.8058 0.8282 0.8193 0.7684 0.7285 0.7074 0.8287 0.8179 0.8386 0.8743 0.8501 0.8539 0.8696 0.8573 0.8615 0.8773 0.8813
MAE# 0.1141 0.1080 0.1023 0.1215 0.1635 0.1397 0.1002 0.1278 0.0841 0.0668 0.0737 0.0758 0.0655 0.0756 0.0642 0.0645 0.0612
851
DUT-O AUC" 0.9341 0.9681 0.9283 0.9316 0.9239 0.8060 0.9497 0.9459 0.9454 0.9782 0.9238 0.9263 0.9719 0.9630 0.9138 0.9401 0.9373
MRON
MeanF" 0.6907 0.6865 0.7272 0.6611 0.6151 0.6588 0.6932 0.6628 0.7441 0.8208 0.7450 0.7905 0.8185 0.7619 0.8084 0.8129 0.8152
MaxF" 0.7333 0.7734 0.7604 0.7164 0.6795 0.6606 0.7429 0.7297 0.7690 0.8615 0.7742 0.8053 0.8565 0.8029 0.8183 0.8308 0.8320
MAE# 0.0949 0.0843 0.0745 0.0923 0.1147 0.0889 0.0976 0.1204 0.0694 0.0411 0.0636 0.0565 0.0415 0.0653 0.0543 0.0500 0.0486
DUTS-TE AUC" 0.9583 0.9683 0.9573 0.9356 0.9327 0.7871 0.9620 0.9610 0.9692 0.9668 0.9694 0.9460 0.9752 0.9821 0.9444 0.9650 0.9644
MeanF" 0.7276 0.6937 0.7730 0.6793 0.6323 0.6477 0.7263 0.6866 0.7960 0.7906 0.8134 0.8420 0.8292 0.8139 0.8587 0.8812 0.8818
MaxF" 0.7857 0.7757 0.8131 0.7368 0.7090 0.6495 0.7784 0.7710 0.8263 0.8354 0.8515 0.8595 0.8708 0.8598 0.8691 0.8995 0.8990
MAE# 0.0819 0.0901 0.0651 0.0923 0.1139 0.1004 0.0851 0.1173 0.0587 0.0607 0.0490 0.0476 0.0409 0.0506 0.0404 0.0349 0.0338
MSRA10K AUC" 0.9837 0.9900 0.9784 0.9880 0.9737 0.9173 0.9983 0.9959 0.9788 0.9947 – 0.9679 – 0.9819 0.9629 0.9666 0.9651
MeanF" 0.8800 0.8543 0.9031 0.9195 0.8264 0.8978 0.9435 0.9104 0.8864 0.9248 – 0.9112 – 0.8865 0.9111 0.9194 0.9183
MaxF" 0.9174 0.9155 0.9256 0.9450 0.8808 0.9003 0.9687 0.9524 0.9069 0.9552 – 0.9278 – 0.9165 0.9221 0.9289 0.9271
released codes of the other methods are adopted, and trained models are directly utilized for saliency map generation. Then,
evaluation codes provided by [4] were adopted to calculate the evaluation metrics for fully quantifying the experimental
results. Quantitative results in italics in Table 7 denote that performance is evaluated on the training set with respect to cor-
responding methods, i.e., ELD, Amulet, UCF and ASNet. PiCA-R denotes PiCANet with ResNet50 backbone, and PiCA-RC refers
to the PiCANet-R model with dense CRF inference as post-processing. For clarity, the best results achieved by our trained
baseline models and other state-of-the-art methods are marked in bold.
The quantitative results in Table 7 show that SE-ResNet50 (w/a) and SE-ResNet50 with PPM+SE attention (w/a + b) mod-
els trained with Lovász loss and batch size setting at 16 outperform other state-of-the-art deep CNN models with respect to
max F-measure on the five benchmark datasets, except for DUT-OMRON dataset. Specifically, the trained SE-ResNet50 (w/a)
model produces 0.69% to 3.04% performance improvement in terms of maximum F-measure in comparison to the state-of-
the-art DL model, PiCA-RC, over six benchmark datasets. Moreover, the trained SE-ResNet50 with PPM+SE attention (w/a + b)
model also yields 0.5% to 2.99% performance gain in terms of max F-measure in comparison with the PiCA-RC model. Qual-
itative comparison of saliency maps produced by baseline models and state-of-the-art CNN-based SOD models are provided
in the Appendix F.
In this section, we perform an extension study on analyzing the performance of our trained baseline models on three
video-based SOD benchmark datasets. A brief review of related work on video SOD is outlined. Moreover, experimental
results on both image- and video-based state-of-the-art SOD methods are also presented.
Table 8
Comparison of state-of-the-art SOD models on video datasets. (The upward arrow means higher values represent better results, while downward arrow denotes
lower values indicate better performance.)
in comparison with state-of-the-art image-based DL models over the three benchmark datasets, respectively. The Upernet50
model trained with Lovász-softmax loss and setting batch size to 16 (row of ‘Lovász’ in Table 8) outperforms other baseline
models on DAVIS2016 dataset. In addition, the SE-ResNet50 model (w/a) consistently outperforms other baseline models on
UVSD and VOS datasets.
For video SOD models, PDB [75] outperforms the trained baseline models by a significant margin on UVSD dataset, but
indicates a slightly lower performance on VOS dataset. On the other hand, all evaluated baseline models show lower perfor-
mance on the UVSD dataset. This may be caused by the video sequences with lower resolution and cluttered scenes in the
dataset. In addition, the lack of spatial–temporal saliency modeling in our baseline models also affects the performance. The
inconsistent performance improvement also reflects the generalization ability of different image-based models on different
video datasets. For VOS dataset, frames were annotated discretely with a large interval in a video recording. When dealing
with frames with long-range dependency, models with spatial–temporal modules, e.g., PDB with ConvLSTM and VFCN, may
degenerate to perform single-frame saliency map prediction independently. It can be observed that the performance looks
similar when using models with or without long-range saliency coherence.
For failure cases analysis, since the image-based models are neither trained nor fine-tuned on the video data, they may
not work well on some challenging video sequences, including videos with complex backgrounds, objects surrounded by
multiple objects in clutter scene context, videos with low resolutions. Moreover, failure cases may be caused by the different
distributions between training data (static images) and testing data (video frames), and the lack of a spatial–temporal sal-
iency modeling structure can greatly affect the stability and performance of an image-based SOD model. However, the exper-
imental results also indicate the powerful image-based DL models without training on video data can also achieve
competitive performance in comparison with video-based models. Thus, to further obtain higher accuracy for video-based
SOD task, it motivates us to develop an encoder-decoder model incorporated with spatial–temporal modeling modules
for SOD consistency and stability in our future work.
4.4. Discussion
In this paper, we presented an empirical study on encoder-decoder models for SOD. We also performed a review of key
components in encoder-decoder models for pixel-wise dense prediction tasks. Experimental results indicate that newly dis-
covered baseline models can achieve state-of-the-art performance with respect to F-measures and MAE on both image- and
video-based benchmark datasets when compared with existing competitive DL and non-DL models for SOD. An ablation
study suggests the efficiency of the encoder-decoder model involved with ResNet backbone network, PPM, FPN, and ASPP
modules. Moreover, numerous techniques, including attention mechanisms, loss functions, and evaluation schemes, can
be applied to designing models for achieving higher performance. Based on our study, some research directions in future
work with respect to SOD may focus on the following aspects:
1) Object relation and image understanding: According to our observation, recent works on SOD by leveraging CNN-
based encoder-decoder models have treated SOD as a pixel-level classification task. At present, the relationship
between a salient object and image context, which is a high-level concept, is not well captured in most recent
encoder-decoder models. In addition, the object-to-object, and object-to-context relationship can be treated as impli-
cit saliency prior cues for SOD. It would be interesting to characterize the saliency relationship between local and glo-
bal context from both pixel- or region-level relationships by using recurrent neural network (RNN) or graph
convolutional neural network (GCNN). Moreover, recent advances in visual attention and image understanding also
show promising performance on SOD and semantic segmentation by explicitly modeling semantics from language
information [69]. Techniques in VQA and image caption are exploited to learn the alignment between language
descriptions and visual semantics of salient objects [69,66,104]. Specifically, Qian et al. [66] proposed to detect salient
objects from natural language. A language-ware weakly supervised method was proposed for SOD. Zhang et al. [104]
proposed a CapSal model for SOD by leveraging caption, as an extra semantic, to improve SOD performance in complex
scenarios. These ideas may provide some insights into the provision of speech-to-visual-attention guidance for people
with visual impairment. Thus, it needs further investigation in improving SOD performance and interpretability of
SOD models by considering both implicit and explicit information from the perspective of image understanding in
future directions.
2) Robustness of SOD models: Recent advances have explored the robustness of SOD models against adversarial attacks
and non-saliency cases. Concretely, Fernandez [21] explored the effectiveness of state-of-the-art SOD models in a
complex scene by a comparison of the performance on original natural images and adversarial examples. It has
demonstrated the vulnerability of deep learning-based saliency models to adversarial examples. Li et al. [41] proposed
the first end-to-end trainable framework, ROSA, which successfully launched adversarial attacks to boost the robust-
ness for arbitrary FCN-based SOD models. Moreover, Fan et al. [20] newly proposed a high-quality dataset for detect-
ing salient objects in clutter. Images with salient and non-salient objects are collected to avoid the design bias based
on the assumption that each image contains at least one salient object in a relative low clutter context. Furthermore,
Liu et al. [57] proposed a multi-task framework for both predicting the saliency mask and the existence of a salient
object in an image. Therefore, it is promising to develop robust SOD models from the perspectives of adversarial
attacks, images without a salient object, and salient objects with complex scene context in both image- and video-
based SOD tasks.
854
Y. Ji et al. Information Sciences 546 (2021) 835–857
3) Saliency-assisted single-object segmentation and weakly supervised semantic segmentation: According to our
observation, current SOD methods can achieve higher performance when dealing with simple images. This motivates
us to apply off-the-shelf SOD models to assisting single-object image segmentation tasks for specific application sce-
narios, e.g., clothing segmentation for street and online store clothing images. Moreover, applications also exist of
using saliency detection to facilitate weakly supervised semantic segmentation by regarding saliency as a kind of
backgroundness cue to generate ‘fake’ ground-truth maps [88,90]. Encouraging results demonstrate the potential of
using saliency detection techniques to image parsing under certain applications.
4) Video saliency detection: This is a more challenging task because video data contain more complex scene contexts,
motion cues, cluttered backgrounds, etc. In addition, both spatial and temporal information should be considered in
practice for modeling video sequence data. Encoder-decoder models, which have exhibited promising performance for
static images, can also be transferred to solving video saliency detection by leveraging spatial–temporal information,
such as inter-frame constrains, motion cues (optical flow), etc. It would be interesting to apply an encoder-decoder
model with a well-designed spatial–temporal module for video SOD in future work.
Haijun Zhang played a role in Conceptualization, Funding acquisition, Investigation, Methodology, Project administra-
tion, Resources, Supervision, Writing - review & editing.
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgement
This work was supported in part by the National Key Research and Development Program of China under Grant
2018YFB1003800 and Grant 2018YFB1003805, in part by the National Natural Science Foundation of China under Grant
61972112 and Grant 61832004, and in part by the Shenzhen Science and Technology Program under Grant
JCYJ20170413105929681 and Grant JCYJ20170811161545863.
Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.ins.2020.
09.003.
References
[1] V. Badrinarayanan, A. Kendall, et al, SegNet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Trans. Pattern Anal.
Mach. Intell. 39 (12) (2017) 2481–2495.
[2] D. Bahdanau, K. Cho, et al, Neural machine translation by jointly learning to align and translate, CoRR (abs/1409.0473.).
[3] M. Berman, A.R. Triki, et al., The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural
networks, in: CVPR, 2018, pp. 4413–4421.
[4] A. Borji, M. Cheng, et al, Salient object detection: a survey, Computat. Visual Media 5 (2) (2019) 117–150.
[5] A. Borji, M.-M. Cheng, et al, Salient object detection: a benchmark, IEEE Trans. Image Proc. 24 (12) (2015) 5706–5722.
[6] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 185–207.
[7] C. Chen, S. Li, et al, Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion, IEEE Trans. Image Proc. 26 (7) (2017) 3156–
3170.
[8] L. Chen, G. Papandreou, et al., Semantic image segmentation with deep convolutional nets and fully connected CRFs, CoRR abs/1412.7062.
[9] L. Chen, G. Papandreou, et al., Rethinking atrous convolution for semantic image segmentation, CoRR abs/1706.05587.
[10] L. Chen, H. Zhang, et al., Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, in: CVPR, 2017, pp. 6298–6306.
[11] L. Chen, Y. Zhu, et al., Encoder-decoder with atrous separable convolution for semantic image segmentation, in: ECCV, 2018, pp. 833–851.
[12] L.-C. Chen, G. Papandreou, et al, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,
IEEE Trans. Pattern Anal. Mach. Intell. 40 (4) (2018) 834–848.
[13] Y. Chen, J. Li, et al., Dual path networks, in: NIPS, 2017, pp. 4467–4475.
[14] J. Cheng, L. Dong, et al., Long short-term memory-networks for machine reading, in: EMNLP, 2016, pp. 551–561.
[15] M.-M. Cheng, N.J. Mitra, et al, Global contrast based salient region detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) (2015) 569–582.
[16] F. Chollet, Xception: deep learning with depthwise separable convolutions, in: CVPR, 2017, pp. 1800–1807.
[17] M. Cornia, L. Baraldi, et al, Predicting human eye fixations via an LSTM-based saliency attentive model, IEEE Trans. Image Proc. 27 (10) (2018) 5142–
5154.
[18] J. Dai, H. Qi, et al., Deformable convolutional networks, in: ICCV, 2017, pp. 764–773.
[19] J. Deng, W. Dong, et al., Imagenet: A large-scale hierarchical image database, in: CVPR, 2009, pp. 248–255.
[20] D.-P. Fan, M.-M. Cheng, et al., Salient objects in clutter: bringing salient object detection to the foreground, in: ECCV, 2018, pp. 186–202.
[21] A. Fernandez, On the Salience of Adversarial Examples, in: ISVC, 2019, pp. 221–232.
[22] K. He, X. Zhang, et al., Deep residual learning for image recognition, in: CVPR, 2016, pp. 770–778.
[23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.
855
Y. Ji et al. Information Sciences 546 (2021) 835–857
[24] Q. Hou, M. Cheng, et al, Deeply supervised salient object detection with short connections, IEEE Trans. Pattern Anal. Mach. Intell. 41 (4) (2019) 815–
828.
[25] Q. Hou, J. Liu, et al., Three birds one stone: a unified framework for salient object segmentation, edge detection and skeleton extraction, CoRR abs/
1803.09860.
[26] J. Hu, L. Shen, et al., Squeeze-and-excitation networks, in: CVPR, 2018, pp. 7132–7141.
[27] G. Huang, Z. Liu, et al., Densely Connected Convolutional Networks, in: CVPR, 2017, pp. 2261–2269.
[28] M. Jaderberg, K. Simonyan, et al., Spatial transformer networks, in: NIPS, 2015, pp. 2017–2025.
[29] Y. Ji, H. Zhang, et al, Salient object detection via multi-scale attention CNN, Neurocomputing 322 (2018) 130–140.
[30] X. Jia, B. De Brabandere, et al., Dynamic filter networks, in: NIPS, 2016, pp. 667–675.
[31] B. Jiang, L. Zhang, et al., Saliency detection via absorbing markov chain, in: ICCV, 2013, pp. 1665–1672.
[32] H. Jiang, J. Wang, et al., Salient object detection: a discriminative regional feature integration approach, in: CVPR, 2013, pp. 2083–2090.
[33] T. Ke, J. Hwang, et al., Adaptive affinity fields for semantic segmentation, in: ECCV, 2018, pp. 605–621.
[34] A. Kendall, V. Badrinarayanan, et al., Bayesian SegNet: model uncertainty in deep convolutional encoder-decoder architectures for scene
understanding, in: BMVC, 2017, pp. 1–12.
[35] F. Lateef, Y. Ruichek, Survey on semantic segmentation using deep learning techniques, Neurocomputing 338 (2019) 321–348.
[36] Y. LeCun, Y. Bengio, et al., Deep learning, Nature 521 (7553) (2015) 436.
[37] G. Lee, Y.-W. Tai, et al., Deep saliency with encoded low level distance map and high level features, in: CVPR, 2016, pp. 660–668.
[38] G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: CVPR, 2015, pp. 5455–5463.
[39] G. Li, Y. Yu, Deep contrast learning for salient object detection, in: CVPR, 478–487, 2016a.
[40] G. Li, Y. Yu, Visual saliency detection based on multiscale deep CNN features, IEEE Trans. Image Proc. 25 (11) (2016) 5012–5024.
[41] H. Li, G. Li, et al., ROSA: robust salient object detection against adversarial attacks, CoRR abs/1905.03434.
[42] H. Li, P. Xiong, et al., Pyramid attention network for semantic segmentation, in: BMVC, 2018, p. 285.
[43] J. Li, C. Xia, et al, A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection, IEEE Trans. Image Proc. 27
(1) (2018) 349–364.
[44] X. Li, L. Zhao, et al, DeepSaliency: Multi-task deep neural network model for salient object detection, IEEE Trans. Image Proc. 25 (8) (2016) 3919–3930.
[45] Y. Li, X. Hou, et al., The secrets of salient object segmentation, in: CVPR, 2014, pp. 280–287.
[46] T. Lin, P. Dollár, et al., Feature pyramid networks for object detection, in: CVPR, 2017, pp. 936–944.
[47] T. Lin, P. Goyal, et al., Focal loss for dense object detection, in: ICCV, 2017, pp. 2999–3007.
[48] C. Liu, L. Chen, et al., Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation, in: CVPR, 2019, pp. 82–92.
[49] L. Liu, H. Wang, et al., Crowd counting using deep recurrent spatial-aware network, in: IJCAI, 2018, pp. 849–855.
[50] N. Liu, J. Han, DHSNet: Deep hierarchical saliency network for salient object detection, in: CVPR, 2016, pp. 678–686.
[51] N. Liu, J. Han, et al., PiCANet: Learning pixel-wise contextual attention for saliency detection, in: CVPR, 2018, pp. 3089–3098.
[52] N. Liu, J. Han, et al., Predicting eye fixations using convolutional neural networks, in: CVPR, 2015, pp. 362–370.
[53] S. Liu, S.D. Mello, et al., Learning affinity via spatial propagation networks, in: NIPS, 2017, pp. 1519–1529.
[54] S. Liu, L. Qi, et al., Path aggregation network for instance segmentation, in: CVPR, 2018, pp. 8759–8768.
[55] T. Liu, Z. Yuan, et al, Learning to detect a salient object, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 353–367.
[56] Z. Liu, J. Li, et al, Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation, IEEE Trans. Circ. Syst.
Video Techn. 27 (12) (2017) 2527–2542.
[57] Z. Liu, Q. Xiang, et al, Robust salient object detection for RGB images, Vis. Comput. 36 (9) (2020) 1823–1835.
[58] J. Long, E. Shelhamer, et al., Fully convolutional networks for semantic segmentation, in: CVPR, 2015, pp. 3431–3440.
[59] M. Mancas, V.P. Ferrera, et al, From Human Attention to Computational Attention, vol. 2, Springer, 2016.
[60] F. Milletari, N. Navab, et al., V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 3DV, 2016, pp. 565–571.
[61] V. Mnih, N. Heess, et al., Recurrent models of visual attention, in: NIPS, 2014, pp. 2204–2212.
[62] J. Pan, E. Sayrol, et al., Shallow and deep convolutional networks for saliency prediction, in: CVPR, 2016, pp. 598–606.
[63] J. Park, S. Woo, et al., BAM: Bottleneck Attention Module, in: BMVC, 2018, p. 147.
[64] F. Perazzi, P. Krähenbühl, et al., Saliency filters: Contrast based filtering for salient region detection, in: CVPR, 2012, pp. 733–740.
[65] F. Perazzi, J. Pont-Tuset, et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, in: CVPR, 2016, pp. 724–732.
[66] M. Qian, J. Qi, et al, Language-aware weak supervision for salient object detection, Pattern Recognit. 96 (106955) (2019) 1–11.
[67] X. Qin, Z. Zhang, et al., BASNet: Boundary-aware salient object detection, in: CVPR, 2019, pp. 7479–7489.
[68] Y. Qin, H. Lu, et al., Saliency detection via cellular automata, in: CVPR, 2015, pp. 110–119.
[69] V. Ramanishka, A. Das, et al., Top-down visual saliency guided by captions, in: CVPR, 2017, pp. 7206–7215.
[70] S. Ren, K. He, et al., Faster r-cnn: Towards real-time object detection with region proposal networks, in: NIPS, 2015, pp. 91–99.
[71] O. Ronneberger, P. Fischer, et al., U-net: Convolutional networks for biomedical image segmentation, in: MICCAI, 2015, pp. 234–241.
[72] A. G. Roy, N. Navab, et al., Concurrent spatial and channel ’Squeeze & Excitation’ in fully convolutional networks, in: MICCAI, 2018, pp. 421–429.
[73] A. Shrivastava, A. Gupta, et al., Training region-based object detectors with online hard example mining, in: CVPR, 2016, pp. 761–769.
[74] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556.
[75] H. Song, W. Wang, et al., Pyramid dilated deeper ConvLSTM for video salient object detection, in: ECCV, 2018, pp. 744–760.
[76] C. Szegedy, S. Ioffe, et al., Inception-v4, inception-resnet and the impact of residual connections on learning., in: AAAI, vol. 4, 2017, p. 12.
[77] C. Szegedy, W. Liu, et al., Going deeper with convolutions, in: CVPR, 2015, pp. 1–9.
[78] C. Szegedy, V. Vanhoucke, et al., Rethinking the inception architecture for computer vision, in: CVPR, 2016, pp. 2818–2826.
[79] A. Vaswani, N. Shazeer, et al., Attention is all you need, in: NIPS, 2017, pp. 5998–6008.
[80] L. Wang, H. Lu, et al., Learning to detect salient objects with image-level supervision, in: Proceedings of the CVPR, 2017, pp. 3796–3805.
[81] T. Wang, A. Borji, et al., A stagewise refinement model for detecting salient objects in images, in: CVPR, 2017, pp. 4019–4028.
[82] T. Wang, L. Zhang, et al., Detect globally, refine locally: a novel approach to saliency detection, in: CVPR, 2018, pp. 3127–3135.
[83] W. Wang, Q. Lai, et al., Salient object detection in the deep learning era: an in-depth survey, CoRR abs/1904.09146.
[84] W. Wang, J. Shen, et al., Salient object detection driven by fixation prediction, in: CVPR, 2018, pp. 1711–1720.
[85] W. Wang, J. Shen, et al., Revisiting video saliency: a large-scale benchmark and a new model, in: CVPR, 2018, pp. 4894–4903.
[86] W. Wang, J. Shen, et al, Video salient object detection via fully convolutional networks, IEEE Trans. Image Proc. 27 (1) (2018) 38–49.
[87] X. Wang, R.B. Girshick, et al., Non-Local Neural Networks, in: CVPR, 2018, pp. 7794–7803.
[88] Y. Wei, X. Liang, et al, Stc: A simple to complex framework for weakly-supervised semantic segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 39
(11) (2017) 2314–2320.
[89] Y. Wei, F. Wen, et al., Geodesic saliency using background priors, in: ECCV, 2012, pp. 29–42.
[90] Y. Wei, H. Xiao, et al., Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation, in: CVPR, 2018, pp.
7268–7277.
[91] S. Woo, J. Park, et al., CBAM: Convolutional block attention module, in: ECCV, 2018, pp. 3–19.
[92] T. Xiao, Y. Liu, et al., Unified perceptual parsing for scene understanding, in: ECCV, 2018, pp. 418–434.
[93] S. Xie, Z. Tu, Holistically-nested edge detection, in: ICCV, 2015, pp. 1395–1403.
[94] C. Yang, L. Zhang, et al., Saliency detection via graph-based manifold ranking, in: CVPR, 2013, pp. 3166–3173.
[95] M. Yang, K. Yu, et al., DenseASPP for semantic segmentation in street scenes, in: CVPR, 2018, pp. 3684–3692.
856
Y. Ji et al. Information Sciences 546 (2021) 835–857
[96] C. Yu, J. Wang, et al., BiSeNet: Bilateral segmentation network for real-time semantic segmentation, in: ECCV, 2018, pp. 334–349.
[97] C. Yu, J. Wang, et al., Learning a discriminative feature network for semantic segmentation, in: CVPR, 2018, pp. 1857–1866.
[98] D. Zhang, J. Han, et al., Supervision by fusion: towards unsupervised learning of deep salient object detector, in: ICCV, 2017, pp. 4068–4076.
[99] H. Zhang, I.J. Goodfellow, et al., Self-attention generative adversarial networks, in: ICML, 2019, pp. 7354–7363.
[100] H. Zhang, I.J. Goodfellow, et al., Self-attention generative adversarial networks, in: ICML, 2019, pp. 7354–7363.
[101] J. Zhang, S. Sclaroff, et al., Minimum barrier salient object detection at 80 FPS, in: ICCV, 2015, pp. 1404–1412.
[102] J. Zhang, T. Zhang, et al., Deep unsupervised saliency detection: a multiple noisy labeling perspective, in: CVPR, 2018, pp. 9029–9038.
[103] L. Zhang, J. Dai, et al., A bi-directional message passing model for salient object detection, in: CVPR, 2018, pp. 1741–1750.
[104] L. Zhang, J. Zhang, et al., CapSal: Leveraging captioning to boost semantics for salient object detection, in: CVPR, 2019, pp. 6024–6033.
[105] P. Zhang, D. Wang, et al., Amulet: aggregating multi-level convolutional features for salient object detection, in: ICCV, 2017, pp. 202–211.
[106] P. Zhang, D. Wang, et al., Learning uncertain convolutional features for accurate saliency detection, in: ICCV, 2017, pp. 212–221.
[107] R. Zhang, S. Tang, et al., Global-residual and local-boundary refinement networks for rectifying scene parsing predictions, in: IJCAI, 2017, pp. 3427–
3433.
[108] X. Zhang, T. Wang, et al., Progressive attention guided recurrent network for salient object detection, in: CVPR, 2018, pp. 714–722.
[109] Y. Zhang, D. Zhou, et al., Single-image crowd counting via multi-column convolutional neural network, in: CVPR, 2016, pp.589–597.
[110] H. Zhao, J. Shi, et al., Pyramid scene parsing network, in: CVPR, 2017, pp. 2881–2890.
[111] K. Zhao, S. Gao, et al., Optimizing the F-Measure for threshold-free salient object detection, in: ICCV, 2019, pp. 8848–8856.
[112] R. Zhao, W. Ouyang, et al., Saliency detection by multi-context deep learning, in: CVPR, 2015, pp. 1265–1274.
[113] T. Zhao, X. Wu, Pyramid feature attention network for saliency detection, in: CVPR, 2019, pp. 3085–3094.
[114] W. Zhu, S. Liang, et al., Saliency optimization from robust background detection, in: CVPR, 2014, pp. 2814–2821.
857