2020 Survey
2020 Survey
Abstract—As an essential problem in computer vision, salient object detection (SOD) has attracted an increasing amount of research
attention over the years. Recent advances in SOD are predominantly led by deep learning-based solutions (named deep SOD). To
enable in-depth understanding of deep SOD, in this paper, we provide a comprehensive survey covering various aspects, ranging from
algorithm taxonomy to unsolved issues. In particular, we first review deep SOD algorithms from different perspectives, including
network architecture, level of supervision, learning paradigm, and object-/instance-level detection. Following that, we summarize and
arXiv:1904.09146v5 [cs.CV] 8 Jan 2021
analyze existing SOD datasets and evaluation metrics. Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance of SOD algorithms under different attribute settings,
which has not been thoroughly explored previously, by constructing a novel SOD dataset with rich attribute annotations covering various
salient object types, challenging factors, and scene categories. We further analyze, for the first time in the field, the robustness of SOD
models to random input perturbations and adversarial attacks. We also look into the generalization and difficulty of existing SOD
datasets. Finally, we discuss several open issues of SOD and outline future research directions. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly available at https://github.com/wenguanwang/SODsurvey.
1 I NTRODUCTION
Deep Methods
(Wang et al.) DSS [39] (Liu et al.)
[30] [31] [32] SF [33] HS [35] wCtr [36] DHSNet [38] (Hou et al.) TSPOANet [41]
(Liu et al.) (Achanta et al.) (Cheng et al.) (Perazzi et al.) (Yan et al.) (Zhu et al.) (Liu et al.) (Liu et al.)
2007 2009 2011 2012 2013 2014 2015 2016 2017 2018 2019 time
MCDL [29] MAP [37] [42] (Qi et al.)
(Zhao et al.) (Zhang et al.)
Fig. 1. A brief chronology of SOD. The very first SOD models date back to the work of Liu et al. [30] and Achanta et al. [31]. The first incorporation of
deep learning techniques into SOD models was in 2015. Listed methods are milestones, which are typically highly cited. See §1.1 for more details.
TABLE 1
Summary of previous reviews. For each work, the publication information and coverage are provided. See §1.2 for more detailed descriptions.
The history of SOD is relatively short and can be traced ing that we restrict this survey to single-image SOD methods,
back to [30] and [31]. The rise of SOD has been driven by and leave RGB-D SOD, co-saliency detection, video SOD,
a wide range of object-level computer vision applications. etc., as separate topics.
Instead of FP models only predicting sparse eye fixation
locations, SOD models aim to detect the whole entities of the
1.2 Related Previous Reviews and Surveys
visually attractive objects with precise boundaries. Most tra-
ditional, non-deep SOD models [36], [53] rely on low-level Table 1 lists existing surveys that are related to ours. Among
features and certain heuristics (e.g., color contrast [32], back- them, Borji et al. [44] reviewed SOD methods preceding
ground prior [54]). To obtain uniformly highlighted salient 2015, thus do not refer to recent deep learning-based solu-
objects and clear object boundaries, an over-segmentation tions. Zhang et al. [46] reviewed methods for co-saliency de-
process that generates regions [55], super-pixels [56], [57], or tection, i.e., detecting common salient objects from multiple
object proposals [58] is often integrated into these models. relevant images. Cong et al. [47] reviewed several extended
Please see [44] for a more comprehensive overview. SOD tasks including RGB-D SOD, co-saliency detection
and video SOD. Han et al. [48] looked into several sub-
With the compelling success of deep learning technolo-
directions of object detection, and outlined recent progress
gies in computer vision, more and more deep SOD methods
in objectness detection, SOD, and category-specific object
have begun springing up since 2015. Earlier deep SOD
detection. Borji et al. summarized both heuristic [43] and
models utilized multi-layer perceptron (MLP) classifiers to
deep models [49] for FP. Nguyen et al. [45] focused on
predict the saliency score of deep features extracted from
categorizing the applications of visual saliency (including
each image processing unit [27]–[29]. Later, a more effective
both SOD and FP) in different areas. Finally, a more recently
and efficient form, i.e., fully convolutional network (FCN)-
published survey [50] covers both traditional non-deep SOD
based model, became the mainstream SOD architecture.
methods and deep ones until 2017, and discusses their
Some recent methods [41], [42] also introduced Capsule [59]
relation to several other closely-related research areas, such
into SOD to comprehensively address object property mod-
as special-purpose object detection and segmentation.
eling. A brief chronology of SOD is shown in Fig. 1.
Different from previous SOD surveys, which focus on
Scope of the survey. Despite its short history, research in earlier non-deep learning SOD methods [44], other related
deep SOD has produced hundreds of papers, making it fields [43], [47]–[49], practical applications [45] or a limited
impractical (and fortunately unnecessary) to review all of number of deep SOD models [50], this work systematically
them. Instead, we comprehensively select influential papers and comprehensively reviews recent advances in the field.
published in prestigious journals and conferences. This sur- It features in-depth analyses and discussions on various
vey mainly focuses on the major progress in the last five aspects, many of which, to the best of our knowledge,
years, but for completeness and better readability, some have never been explored in this field. In particular, we
early related works are also included. Due to limitations comprehensively summarize and discuss existing deep SOD
on space and our knowledge, we apologize to those authors methods under several proposed taxonomies (§2); review
whose works are not included in this paper. It is worth not- datasets (§3) and evaluation metrics (§4) with their pros
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
TABLE 2
Taxonomies and representative publications of deep SOD methods. See §2 for more detailed descriptions.
Category Publications
Multi-layer perceptron 1) Super-pixel/patch-based [29], [60], [27], [61]
(MLP)-based 2) Object proposal based [28], [37], [62]
1) Single-stream [63], [64], [65], [66], [67], [68], [69]
Network 2) Multi-stream [70], [71], [72], [73], [74]
Fully convolutional network
Architectures 3) Side-fusion [39], [75], [76], [77], [78], [79], [80], [81], [82]
(FCN)-based
(§2.1) 4) Bottom-up/top-down [38], [83], [84], [85], [86], [87], [40], [88], [89], [90], [91], [92], [93], [94], [95]
5) Branched [96], [97], [98], [99], [100], [101], [102], [103]
Hybrid network-based [104], [105]
Capsule-based [41], [42]
Level of Fully-supervised All others
Supervision 1) Category-level [97], [68], [69], [81]
Un-/Weakly-supervised
(§2.2) 2) Pseudo pixel-level [83], [98], [67], [99]
Single-task learning (STL) All others
1) Salient object subitizing [37], [77], [79]
Learning 2) Fixation prediction [96], [87]
Paradigm Mingle-task learning 3) Image classification [97], [98]
(§2.3) (MTL) 4) Semantic segmentation [63], [103]
5) Contour/edge detection [75], [99], [89], [91], [92], [93], [101], [82], [102]
6) Image captioning [100]
Object-/Instance- Object-level All others
Level (§2.4) Instance-level [37], [70]
and cons; provide a deeper understanding of SOD models depth, we conduct a cross-dataset generalization study
through an attribute-based evaluation (§5.3); discuss the in- that quantitatively reveals the dataset bias.
fluence of input perturbation (§5.4); analyze the robustness 6) Overview of open issues and future directions. We
of deep SOD models to adversarial attacks (§5.5); study the thoroughly look over several essential issues (i.e., model
generalization and difficulty of existing SOD datasets (§5.6); design, dataset collection, etc.), shedding light on poten-
and offer insight into essential open issues, challenges, and tial directions for future research.
future directions (§6). We expect our survey to provide novel These contributions together comprise an exhaustive, up-to-
insight and inspiration that will facilitate the understanding date, and in-depth survey, and differentiate it from previous
of deep SOD, and foster research on the open issues raised. review papers significantly.
The rest of the paper is organized as follows. §2 ex-
plains the proposed taxonomies, each accompanied with
1.3 Our Contributions one or two most representative models. §3 examines the
Our contributions in this paper are summarized as follows: most notable SOD datasets, whereas §4 describes several
widely used SOD metrics. §5 benchmarks several deep SOD
1) A systematic review of deep SOD models from vari- models and provides in-depth analyses. §6 provides further
ous perspectives. We categorize and summarize existing discussions and presents open issues and future research
deep SOD models according to network architecture, directions of the field. Finally, §7 concludes the paper.
level of supervision, learning paradigm, etc. The pro-
posed taxonomies aim to help researchers gain a deeper
understanding of the key features of deep SOD models. 2 D EEP L EARNING BASED SOD M ODELS
2) An attribute-based performance evaluation of SOD Before reviewing recent deep SOD models in details, we
models. We compile a hybrid dataset and provide an- first provide a common formulation of the image-based
notated attributes for object categories, scene categories, SOD problem. Given an input image I ∈ RW×H×3 of size
and challenging factors. By evaluating several represen- W × H , an SOD model f maps the input image I to a
tative SOD models on it, we uncover the strengths and continuous saliency map S = f (I) ∈ [0, 1]W×H. For learning-
weaknesses of deep and non-deep approaches, opening based SOD, the model f is learned through a set of training
up promising directions for future efforts. samples. Given a set of static images I = {In ∈ RW×H×3 }n
3) An analysis of the robustness of SOD models against and corresponding binary SOD ground-truth masks G =
general input perturbations. To study the robustness of {Gn∈ {0, 1}W×H }n , the goal of learning
P is to find f ∈ F that
SOD models, we investigate the effects of various pertur- minimizes the prediction error, i.e., n `(Sn , Gn ), where `
bations on the final performance of deep and non-deep is a certain distance measure (e.g., defined in §4), Sn=f (In ),
SOD models. Some results are somewhat unexpected. and F is the set of potential mapping functions. Deep SOD
4) The first known adversarial attack analysis for SOD methods typically model f through modern deep learning
models. We further examine the robustness of SOD techniques, as will be reviewed later in this section. The
models against intentionally designed perturbations, i.e., ground-truths G can be collected by different methodolo-
adversarial attacks. The specially designed attacks and gies, i.e., direct human-annotation or eye-fixation-guided
evaluations can serve as baselines for further studying labeling, and may have different formats, i.e., pixel-wise or
the robustness and transferability of deep SOD models. bounding-box annotations, which will be discussed in §3.
5) Cross-dataset generalization study. To analyze the gen- In Table 2, we categorize recent deep SOD models ac-
eralization and difficulty of existing SOD datasets in- cording to four taxonomies, considering network architecture
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
TABLE 3
Summary of essential characteristics for popular SOD methods. Here, ‘#Training’ is the number of training images, and ‘CRF’ denotes whether the
predictions are post-processed by conditional random field [106]. See §2 for more detailed descriptions.
MCDL [29] CVPR MLP+super-pixel GoogleNet Fully-Sup. STL Object MSRA10K [107] 8,000
LEGS [28] CVPR MLP+segment - Fully-Sup. STL Object MSRA-B [30]+PASCAL-S [108] 3,000+340
MDF [27] CVPR MLP+segment - Fully-Sup. STL Object MSRA-B [30] 2,500
ELD [60] CVPR MLP+super-pixel VGGNet Fully-Sup. STL Object MSRA10K [107] ∼9,000
DHSNet [38] CVPR FCN VGGNet Fully-Sup. STL Object MSRA10K [107]+DUT-OMRON [56] 6,000+3,500
DCL [104] CVPR FCN VGGNet Fully-Sup. STL Object MSRA-B [30] 2,500 X
RACDNN [64] CVPR FCN VGGNet Fully-Sup. STL Object DUT-OMRON [56]+NJU2000 [109]+RGBD [110] 10,565
2016
SU [96] CVPR FCN VGGNet Fully-Sup. MTL Object MSRA10K [107]+SALICON [111] 10,000+15,000 X
MAP [37] CVPR MLP+obj. prop. VGGNet Fully-Sup. MTL Instance SOS [112] ∼5,500
SSD [62] ECCV MLP+obj. prop. AlexNet Fully-Sup. STL Object MSRA-B [30] 2,500
CRPSD [105] ECCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
RFCN [63] ECCV FCN VGGNet Fully-Sup. MTL Object PASCAL VOC 2010 [113]+MSRA10K [107] 10,103+10,000
MSRNet [70] CVPR FCN VGGNet Fully-Sup. STL Instance MSRA-B [30]+HKU-IS [27] (+ILSO [70]) 2,500+2,500 (+500) X
DSS [39] CVPR FCN VGGNet Fully-Sup. STL Object MSRA-B [30]+HKU-IS [27] 2,500 X
WSS [97] CVPR FCN VGGNet Weakly-Sup. MTL Object ImageNet [114] 456k X
DLS [65] CVPR FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
NLDF [75] CVPR FCN VGGNet Fully-Sup. MTL Object MSRA-B [30] 2,500 X
2017
DSOS [77] ICCV FCN VGGNet Fully-Sup. MTL Object SOS [112] 6,900
Amulet [76] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
FSN [72] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
SBF [83] ICCV FCN VGGNet Un-Sup. STL Object MSRA10K [107] 10,000
SRM [71] ICCV FCN ResNet Fully-Sup. STL Object DUTS [97] 10,553
UCF [66] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
RADF [78] AAAI FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000 X
ASMO [98] AAAI FCN ResNet101 Weakly-Sup. MTL Object MS COCO [115]+MSRA-B [30]+HKU-IS [27] 82,783+2,500+2,500 X
LICNN [68] AAAI FCN VGGNet Weakly-Sup. STL Object ImageNet [114] 456k
BDMP [84] CVPR FCN VGGNet Fully-Sup. STL Object DUTS [97] 10,553
DUS [67] CVPR FCN ResNet101 Un-Sup. MTL Object MSRA-B [30] 2,500
2018
DGRL [85] CVPR FCN ResNet50 Fully-Sup. STL Object DUTS [97] 10,553
PAGR [86] CVPR FCN VGGNet19 Fully-Sup. STL Object DUTS [97] 10,553
RSDNet [79] CVPR FCN ResNet101 Fully-Sup. MTL Object PASCAL-S [108] 425
ASNet [87] CVPR FCN VGGNet Fully-Sup. MTL Object SALICON [111]+MSRA10K [107]+DUT-OMRON [56] 15,000+10,000+5,168
PiCANet [40] CVPR FCN VGGNet/ResNet50 Fully-Sup. STL Object DUTS [97] 10,553 X
C2S-Net [99] ECCV FCN VGGNet Weakly-Sup. MTL Object MSRA10K [107]+Web 10,000+20,000
RAS [88] ECCV FCN VGGNet Fully-Sup. STL Object MSRA-B [30] 2,500
SuperVAE [69] AAAI FCN N/A Un-Sup. STL Object N/A N/A
DEF [74] AAAI FCN ResNet101 Fully-Sup. STL Object DUTS [97] 10,553
AFNet [89] CVPR FCN VGGNet16 Fully-Sup. MTL Object DUTS [97] 10,553
BASNet [90] CVPR FCN ResNet-34 Fully-Sup. STL Object DUTS [97] 10,553
CapSal [100] CVPR FCN ResNet101 Fully-Sup. MTL Object COCO-CapSal [100]/DUTS [97] 5,265/10,553
CPD-R [80] CVPR FCN ResNet50 Fully-Sup. STL Object DUTS [97] 10,553
MLSLNet [91] CVPR FCN VGG16 Fully-Sup. MTL Object DUTS [97] 10,553
† MWS ImageNet DET [114]+MS COCO [115] 456k+82,783
[81] CVPR FCN N/A Weakly-Sup. STL Object
2019
(§2.1), level of supervision (§2.2), learning paradigm (§2.3), and example of regular decomposition, MCDL [29] uses two
whether they works at an object or instance level (§2.4). In the pathways to extract local and global context from two super-
following, each category is elaborated on and exemplified pixel-centered windows of different sizes. The global and
by one or two most representative models. Table 3 summa- local feature vectors are fed into an MLP for classifying
rizes essential characteristics of recent SOD models. background and saliency. In contrast, SuperCNN [61] con-
structs two hand-crafted input feature sequences for each
2.1 Representative Network Architectures for SOD irregular super-pixel, and use two separate CNN columns
Based on the primary network architectures adopted, we to produce saliency scores from the feature sequences, re-
classify deep SOD models into four categories, namely MLP- spectively. Regular image decomposition can accelerate the
based (§2.1.1), FCN-based (§2.1.2), hybrid network-based processing speed, thus most of the methods in this category
(§2.1.3) and Capsule-based (§2.1.4). are based on regular decompostion.
2) Object proposal-based methods leverage object propos-
2.1.1 Multi-Layer Perceptron (MLP)-Based Methods als [27], [28] or bounding-boxes [37], [62] as basic process-
MLP-based methods leverage image subunits (i.e., super- ing units in order to better encode object information. For
pixels/patches [29], [60], [61] and generic object proposals [27], instance, MAP [37] uses a CNN model to generate a set of
[28], [37], [62]) as processing units. They feed deep fea- scored bounding-boxes, then selects an optimized compact
tures extracted from the subunits into an MLP-classifier for subset of bounding-boxes as the salient objects. Note that
saliency score prediction (Fig. 2(a)). this kind of methods typically produce coarse SOD results
1) Super-pixel/patch-based methods use regular (patch) or due to the lack of object boundary information.
nearly-regular (super-pixel) image decomposition. As an Though MLP-based SOD methods greatly outperform
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
Convolutional Layer
Other
SOD Capsule Layer
Task
Output
Data Flow
Fig. 2. Categorization of previous deep SOD models according to the adopted network architecture. (a) MLP-based methods. (b)-(f) FCN-based
methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f)
branch network architectures. (g) Hybrid network-based methods. (h) Capsule-based methods. See §2.1 for more detailed descriptions.
their non-deep counterparts, they cannot fully leverage es- spatial-detail-rich features from lower layers, and produces
sential spatial information and are quite time-consuming, as the finest saliency maps at the top-most layer (Fig. 2(e)),
they need to process all visual subunits one-by-one. which resembles the U-Net [119] for semantic segmentation.
This network architectures is first adopted by PiCANet [40],
2.1.2 Fully Convolutional Network (FCN)-Based Methods which hierarchically embeds global and local pixel-wise at-
To address the limitations of MLP-based methods, recent so- tention modules to selectively attend to informative context.
lutions adopt FCN architecture [117], leading to end-to-end 5) Branched network typically addresses multi-task learn-
spatial saliency representation learning and fast saliency ing for more robust saliency pattern modeling. They have
prediction, within a single feed-forward process. FCN-based a single-input-multiple-output structure, where bottom layers
methods are now dominant in the field. Typical architectures are shared to process a common input and top ones are
can be further classified as: single-stream, multi-stream, side- specialized for different tasks (Fig. 2(f)). For example, C2S-
fusion, bottom-up/top-down, and branched networks. Net [99] is constructed by adding a pre-trained contour
1) Single-stream network is the most standard architecture, detection model [120] to a main SOD branch. Then the two
having a stack of convolutional layers, interleaved with branches are alternately trained for the two tasks, i.e., SOD
pooling and non-linear activation operations (see Fig. 2(b)). and contour detection.
It takes a whole image as input, and directly outputs a
pixel-wise probabilistic map highlighting salient objects.
2.1.3 Hybrid Network-Based Methods
For example, UCF [66] makes use of an encoder-decoder
network architecture for finer-resolution saliency prediction. Some other models combine both MLP- and FCN-based
It incorporates a reformulated dropout in the encoder to subnets to produce edge-preserving results with multi-scale
learn uncertain features, and a hybrid upsampling scheme context (Fig. 2(g)). Combining pixel-level and region-level
in the decoder to avoid checkerboard artifacts. saliency cues is a promising strategy to yield improved per-
2) Multi-stream network, as depicted in Fig. 2(c), typically formance, though it introduces extra computational costs.
consists of multiple network streams to explicitly learn CRPSD [105] consolidates this idea. It combines pixel- and
multi-scale saliency features from multi-resolution inputs. region-level saliency. The former is generated by fusing
Multi-stream outputs are fused to form a final prediction. the last and penultimate side-output features of an FCN,
DCL [104], as one of the earliest attempts towards this while the latter is obtained by applying an existing SOD
direction, contains two streams, which produce pixel- and model [29] to image regions. Only the FCN and fusion layers
region-level SOD estimations, respectively. are trainable.
3) Side-fusion network fuses multi-layer responses of a
backbone network together for SOD prediction, making use 2.1.4 Capsule-Based Methods
of the complementary information of the inherent multi- Recently, Hinton et al. [59] proposed a new family of neural
scale representations of the CNN hierarchy (Fig. 2(d)). Side- networks, named Capsules. Capsules are made up of a group
outputs are typically supervised by the ground-truth, lead- of neurons which accept and output vectors as opposed to
ing to a deep supervision strategy [118]. As a well-known scalar values of CNNs, allowing entity properties to be com-
side-fusion network based SOD model, DSS [39] adds short prehensively modeled. Some researchers have thus been
connections from deeper side-outputs to shallower ones. inspired to explore Capsules in SOD [41], [42] (Fig. 2(h)). For
In this way, higher-level features help lower side-outputs instance, TSPOANet [41] emphasizes part-object relations
to better locate salient regions, and lower-level features can using a two-stream capsule network. The input features of
enrich deeper side-outputs with finer details. capsules are extracted from a CNN, and transformed into
4) Bottom-up/top-down network refines rough saliency low-level capsules. These are then assigned to high-level
maps in the feed-forward pass by gradually incorporating capsules, and finally recognized to be salient or background.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
2.2 Level of Supervision in the sharing of samples among tasks, which alleviates
Based on the type of supervision, deep SOD models can be the lack of data for training heavily parameterized models.
classified into either fully-supervised or weakly-/unsupervised. These are the core motivations of MTL based SOD models,
and branched architectures (see §2.1.2) are usually adopted.
2.2.1 Fully-Supervised Methods 1) Salient object subitizing. The ability of humans to
Most deep SOD models are trained with large-scale pixel- rapidly enumerate a small number of items is known as
level human annotations, which are time-consuming and ex- subitizing [112], [127]. Inspired by this, some works learn
pensive to acquire. Moreover, models trained on fine-labeled salient object subitizing and detection simultaneously [37],
datasets tend to overfit and generalize poorly to real-life [77], [79]. RSDNet [79] represents the latest advance in this
images [67]. Thus, training SOD with weaker annotations direction. It addresses detection, ranking and subitizing of
has become an increasingly popular research direction. salient objects in a unified framework.
2) Fixation prediction aims to predict human eye-fixation
2.2.2 Weakly-/Unsupervised Methods locations in visual scenes. Due to its close relation with
To get rid of laborious manual labeling, several weak super- SOD, learning shared knowledge from these two tasks can
vision forms have been explored in SOD, including image- improve the performance of both. For example, ASNet [87]
level category labels [68], [97], object contours [99], image derives fixation information as a high-level understanding
captions [81] and pseudo ground-truth masks generated by of the scene, from upper network layers. Then, fine-grained
non-learning SOD methods [67], [83], [98]. object-level saliency is progressively optimized under the
1) Category-level supervision. It has been shown that deep guidance of the fixation in a top-down manner. 3) Image
features trained with only image-level labels also provide classification. Image-level tags are valuable for SOD, as they
information on object locations [121], [122], making them provide the category information of dominant objects in the
promising supervision signals for SOD training. WSS [97], images which are very likely to be the salient regions [97].
as a typical example, first pre-trains a two-branch network, Inspired by this, some SOD models learn image classifica-
where one branch is used to predict image labels based on tion as an auxiliary task. For example, ASMO [98] leverages
ImageNet [114], and the other estimates SOD maps. The class activation maps from a neural classifier and saliency
estimated maps are refined by CRF and used to further fine- maps from previous non-learning SOD methods to train the
tune the SOD branch. SOD network, in an iterative manner.
2) Pseudo pixel-level supervision. Though informative, 4) Semantic segmentation is for per-pixel semantic predic-
image-level labels are weak. Some researchers therefore tion. Though SOD is class-agnostic, high-level semantics
instead use traditional non-learning SOD methods [67], [83], play a crucial role in saliency modeling. Thus, the task
[98], or contour information [99], to generate noisy yet finer- of semantic segmentation can also be integrated into SOD
grained cues for training. For instance, SBF [83] fuses weak learning. A recent SOD model, SSNet [103], is developed
saliency maps from a set of prior heuristic SOD models [35], upon this idea. It uses a saliency aggregation module to pre-
[123], [124] at intra- and inter-image levels, to generate dict a saliency score of each category. Then, a segmentation
supervision signals. C2S-Net [99] trains the SOD branch network is used to produce segmentation masks of all the
with the pixel-wise salient object masks generated from the categories. These masks are finally aggregated (according to
outputs of the contour branch [125] using CEDN [120]. The corresponding saliency scores) to produce a SOD map.
contour and SOD branches alternatively update each other 5) Contour/edge detection refers to the task of detecting
and progressively output finer SOD predictions. obvious object boundaries in images, which are informative
of salient objects. Thus, it is also explored in SOD modeling.
2.3 Learning Paradigm For example, PAGE-Net [92] learns an edge detection mod-
ule and embeds edge cues into the main SOD stream in a
From the perspective of learning paradigms, SOD networks
top-down manner, leading to better edge-preserving results.
can be divided into single-task learning (STL) and multi-task
learning (MTL) methods. 6) Image Captioning can provide extra knowledge about
the main content of visual scenes, enabling SOD models to
2.3.1 Single-Task Learning (STL) Based Methods better capture high-level semantics. This has been explored
In machine learning, the standard practice is to learn one in CapSal [100], which incorporates semantic context from a
task at a time [126], i.e., STL. Most deep SOD methods captioning network with local-global visual cues to achieve
belong to this realm of learning, i.e., they utilize supervision improved performance for detecting salient objects.
from a single knowledge domain (SOD or anther related
field such as image classification [68]) for training. 2.4 Object-/Instance-Level SOD
According to whether or not they can identify different
2.3.2 Multi-Task Learning (MTL) Based Methods salient object instances, current deep SOD models can be
Inspired by the human learning process, where knowledge categorized into object-level and instance-level methods.
learned from related tasks can assist the learning of a
new task, MTL [126] aims to improve the performance 2.4.1 Object-Level Methods
of multiple related tasks by learning them simultaneously. Most deep SOD models are object-level methods, i.e., de-
Benefiting from extra knowledge from related tasks, models signed to detect pixels that belong to salient objects without
can gain improved generalizability. An extra advantage lies being aware of individual object instances.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
TABLE 4
Statistics of popular SOD datasets, including the number of images, number of salient objects per image, area ratio of the salient objects in
images, annotation type, image resolution, and existence of fixation data. See §3 for more detailed descriptions.
Dataset Year Publ. #Img. #Obj. Obj. Area(%) SOD Annotation Resolution Fix.
MSRA-A [30] 2007 CVPR 1,000/20,840 1-2 - bounding-box object-level -
MSRA-B [30] 2007 CVPR 5,000 1-2 20.82±10.29 bounding-box object-level, pixel-wise object-level max(w, h) = 400, min(w, h) = 126
Early
SED1 [128] 2007 CVPR 100 1 26.70±14.26 pixel-wise object-level max(w, h) = 465, min(w, h) = 125
SED2 [128] 2007 CVPR 100 2 21.42±18.41 pixel-wise object-level max(w, h) = 300, min(w, h) = 144
ASD [31] 2009 CVPR 1,000 1-2 19.89±9.53 pixel-wise object-level max(w, h) = 400, min(w, h) = 142
Modern&Popular
SOD [129] 2010 CVPR-W 300 1-4+ 27.99±19.36 pixel-wise object-level max(w, h) = 481, min(w, h) = 321
MSRA10K [107] 2015 TPAMI 10,000 1-2 22.21±10.09 pixel-wise object-level max(w, h) = 400, min(w, h) = 144
ECSSD [55] 2015 TPAMI 1,000 1-4+ 23.51±14.02 pixel-wise object-level max(w, h) = 400, min(w, h) = 139
DUT-OMRON [56] 2013 CVPR 5,168 1-4+ 14.85±12.15 pixel-wise object-level max(w, h) = 401, min(w, h) = 139 X
PASCAL-S [108] 2014 CVPR 850 1-4+ 24.23±16.70 pixel-wise object-level max(w, h) = 500, min(w, h) = 139 X
HKU-IS [27] 2015 CVPR 4,447 1-4+ 19.13±10.90 pixel-wise object-level max(w, h) = 500, min(w, h) = 100
DUTS [97] 2017 CVPR 15,572 1-4+ 23.17±15.52 pixel-wise object-level max(w, h) = 500, min(w, h) = 100
SOS [112] 2015 CVPR 6,900 0-4+ 41.22±25.35 object number, bounding-box (train set) max(w, h) = 6132, min(w, h) = 80
MSO [112] 2015 CVPR 1,224 0-4+ 39.51±24.85 object number, bounding-box instance-level max(w, h) = 3888, min(w, h) = 120
Special
ILSO [70] 2017 CVPR 1,000 1-4+ 24.89±12.59 pixel-wise instance-level max(w, h) = 400, min(w, h) = 142
XPIE [130] 2017 CVPR 10,000 1-4+ 19.42±14.39 pixel-wise object-level, geographic information max(w, h) = 500, min(w, h) = 130 X
SOC [131] 2018 ECCV 6,000 0-4+ 21.36±16.88 pixel-wise instance-level, object category, attribute max(w, h) = 849, min(w, h) = 161
COCO-CapSal [100] 2019 CVPR 6,724 1-4+ 23.74±17.00 pixel-wise object-level, image caption max(w, h) = 640, min(w, h) = 480
HRSOD [73] 2019 ICCV 2,010 1-4+ 21.13±15.14 pixel-wise object-level max(w, h) = 10240, min(w, h) = 600
2.4.2 Instance-Level Methods • MSRA-A [30] contains 20,840 images. Each image has
Instance-level SOD methods further identify individual ob- only one noticeable and eye-catching object, annotated by
ject instances in the detected salient regions, which is cru- a bounding-box. As a subset of MSRA-A, MSRA-B has 5,000
cial for practical applications that need finer distinctions, images and less ambiguity w.r.t. the salient object.
such as semantic segmentation [132] and multi-human pars- • SED [128]1 comprises a single-object subset and a two-
ing [133]. As an early attempt, MSRNet [70] performs salient object subset; each has 100 images with mask annotations.
instance detection by decomposing it into three sub-tasks, • ASD [31]2 , also a subset of MSRA-A, has 1,000 images
i.e., pixel-level saliency prediction, salient object contour with pixel-wise ground-truths.
detection and salient instance identification. It jointly per-
forms the first two sub-tasks by integrating deep features 3.3 Popular Modern SOD Datasets
for several different scaled versions of the input image. Recent SOD datasets tend to include more challenging and
The last sub-task is solved by multi-scale combinatorial general scenes with relatively complex backgrounds and
grouping [125] to generate salient object proposals from the multiple salient objects. All have pixel-wise annotations.
detected contours and filter out noisy or overlapping ones. • SOD [129]3 consists of 300 images, constructed from [134].
Many images have more than one salient object that is
similar to the background or touches image boundaries.
3 SOD DATASETS
• MSRA10K [107]4 , also known as THUS10K, contains
With the rapid development of SOD, numerous datasets 10,000 images selected from MSRA-A and covers all the
have been introduced. Table 4 summarizes 19 SOD datasets, images in ASD. Due to its large scale, MSRA10K is widely
which are highly representative and widely used for train- used to train deep SOD models (see Table 3).
ing or benchmarking, or collected with specific properties. • ECSSD [55]5 is composed of 1,000 images with semanti-
cally meaningful but structurally complex natural contents.
• DUT-OMRON [56]6 has 5,168 images of complex back-
3.1 Quick Overview
grounds and diverse content, with pixel-wise annotations.
In an attempt to facilitate understanding of SOD datasets, • PASCAL-S [108]7 comprises 850 challenging images se-
we present some main take-away points of this section. lected from the PASCAL VOC2010 val set [113]. With eye-
• Compared with early datasets [30], [31], [128], recent ones fixation records, non-binary salient-object mask annotations
[27], [56], [97], [107] are typically more advanced with less are provided. Note that the saliency value of a pixel is
center bias, improved complexity, and increased scale. They calculated as the ratio of subjects that select the segment
are thus better-suited for training and evaluation, and likely containing this pixel as salient.
to have longer life-spans. • HKU-IS [27]8 has 4, 447 complex scenes that typically
• Some other recent datasets [70], [73], [100], [112], [130], contain multiple disconnected objects with diverse spatial
[131] are enriched with more diverse annotations (e.g., distributions and similar fore-/background appearances.
subitizing, captioning), representing new trends in the field.
More in-depth discussions regarding generalizability and 1. http://www.wisdom.weizmann.ac.il/∼vision/Seg Evaluation DB
difficulty of several famous datasets will be presented in §5.6. 2. https://ivrlwww.epfl.ch/supplementary material/RK CVPR09/
3. http://elderlab.yorku.ca/SOD/
4. https://mmcheng.net/zh/msra10k/
3.2 Early SOD Datasets 5. http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency
6. http://saliencydetection.net/dut-omron/
Early SOD datasets typically contain simple scenes where 7. http://cbi.gatech.edu/salobj/
1-2 salient objects stand out from a clear background. 8. https://i.cs.hku.hk/∼gbli/deep saliency.html
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
3.5 Discussion
As shown in Table 4, early SOD datasets [30], [31], [128]
MSRA-B SED1 SED2 ASD SOD MSRA10K are comprised of simple images with 1-2 salient objects per
image, and only provide rough bounding-box annotations,
DUT- which are insufficient for reliable evaluations [31], [136]. Per-
ECCSD OMRON PASCAL-S HKU-IS DUTS SOS
formance on these datasets has become saturated. Modern
datasets [27], [55], [56], [97], [107] are typically large-scale
MSO ILSO XPIE SOC CapSal HRSOD
and offer precise pixel-wise ground-truths. The scenes are
more complex and general, and usually contain multiple
Fig. 3. Annotation distributions of SOD datasets (see §3 for details).
salient objects. Some special datasets contain challenging
scenes with background only [112], [131], provide more
fine-grained, instance-level SOD ground-truths [70], [131]
• DUTS [97] is a large-scale dataset, where the 10, 553 or include other annotations such as image captions [100],
9
training images were selected from the ImageNet train/val inspiring new research directions and applications. Fig. 3
set [114], and the 5, 019 test images are from the ImageNet depicts the annotation distributions of 18 SOD datasets.
test set and SUN [135]. Since 2017, SOD models are typically Here are some DUT-
essential conclusions: 1) Some datasets [30],
MSRA-B SED1 SED2 ASD SOD MSRA10K [31],
ECCSD [97], [107]
OMRON have significant center bias; 2) Datasets [27],
trained on DUTS (Table 3).
[70], [100] have more balanced location distributions for
salient objects; and 3) MSO [112] has less center bias, as
PASCAL-S HKU-IS DUTS SOS MSO ILSO XPIE bounding-box
only SOC annotations are provided. We analyze
3.4 Other Special SOD Datasets
the generalizability and difficulty of several famous SOD
In addition to the above “standard” SOD datasets, some datasets in-depth in §5.6.
special ones have also
MSRA-B recently beenSED2
SED1 proposed, leading
ASD to SOD
new research directions.
4 E VALUATION M ETRICS
• SOS [112]10 is created for SOD subitizing [127]. It contains
6,900MSRA10K
images (training test set: 1,380). Each
set: 5,520, DUT-OMRON
ECCSD image This
PASCAL-S section reviews popular object-level SOD evalua-
HKU-IS
is labeled as containing 0, 1, 2, 3 or 4+ salient objects. tion metrics, i.e., Precision-Recall (PR), F-measure [31],
• MSO [112]11 is a subset of SOS-test [112], covering 1,224 Mean Absolute Error (MAE) [33], weighted Fβ measure
images. DUTS
It has a more MSO XPIE of
ILSO of the number
balanced distribution (Fbw)SOC [137], Structural measure (S-measure) [138], and
salient objects. Each object has a bounding-box annotation. Enhanced-alignment measure (E-measure) [139].
• ILSO [70]12 contains 1,000 images with precise instance-
level annotations and coarse contour labeling. 4.1 Quick Overview
• XPIE [130]13 has 10,000 images with pixel-wise labels. It To better understand the characteristics of different metrics,
has three subsets: Set-P has 625 images of places-of-interest a quick overview of the main conclusions for this section are
with geographic information; Set-I 8,799 images with object provided as follows.
tags; and Set-E 576 images with eye-fixation records. • PR, F-measure, MAE, and Fbw address pixel-wise errors,
• SOC [131]14 consists of 6,000 images with 80 common while S-measure and E-measure consider structure cues.
categories. Half of the images contain salient objects, while • Among pixel-level metrics, PR, F-measure, and Fbw fail to
the remaining have none. Each image containing salient consider true negative pixels, while MAE can remedy this.
objects is annotated with an instance-level ground-truth • Among structured metrics, S-measure is more favored than
mask, object category, and challenging factors. The non- E-measure, as SOD addresses continuous saliency estimates.
salient object subset has 783 texture images and 2,217 real- • Considering popularity, advantages and completeness, F-
scene images. measure, S-measure and MAE are the most recommended
• COCO-CapSal [100]15 is built from COCO [115] and and are thus used for our performance benchmarking in §5.2.
SALICON [111]. Salient objects were first roughly localized
using the mouse-click data in SALICON, then precisely 4.2 Metric Details
annotated according to the instance masks in COCO. The
• PR is calculated based on the binarized salient object mask
dataset has 5,265 and 1,459 images for training and testing,
and ground-truth:
respectively.
• HRSOD [73]16 is the first high-resolution dataset for SOD. TP TP
Precision = , Recall = , (1)
It contains 1,610 training and 400 testing images collected TP + FP TP + FN
from websites. Pixel-wise ground-truths are provided. where TP, TN, FP, FN denote true-positive, true-negative,
false-positive, and false-negative, respectively. A set of
9. http://saliencydetection.net/duts/ thresholds ([0 − 255]) is applied to binarize the prediction.
10. http://cs-people.bu.edu/jmzhang/sos.html Each threshold produces a pair of precision/recall values to
11. http://cs-people.bu.edu/jmzhang/sos.html form a PR curve for describing model performance.
12. http://www.sysu-hcp.net/instance-level-salient-object-segmentation/
• F-measure [31] comprehensively considers both precision
13. http://cvteam.net/projects/CVPR17-ELE/ELE.html
and recall by computing the weighted harmonic mean:
14. http://mmcheng.net/SOCBenchmark/
15. https://github.com/yi94code/HRSOD (1 + β 2 )Precision × Recall
16. https://github.com/zhangludl/code-and-dataset-for-CapSal Fβ = . (2)
β 2 Precision + Recall
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
Empirically, β 2 is set to 0.3 [31] to put more emphasis on 5 B ENCHMARKING AND E MPIRICAL A NALYSIS
precision. Instead of plotting the whole F-measure curve, This section provides empirical analyses to shed light on
some methods only report maximal Fβ , or binarize the pre- some key challenges in the field. Specifically, with our large-
dicted saliency map by an adaptive threshold, i.e., twice the scale benchmarking (§5.2), we first conduct an attribute-
mean value of the saliency prediction, and report mean F. based study to better understand the benefits and limita-
• MAE [33] measures the average pixel-wise absolute error tions of current arts (§5.3). Then, we study the robustness of
between normalized saliency prediction map S ∈ [0, 1]W×H SOD models against input perturbations, i.e., random exerted
and binary ground-truth mask G ∈ {0, 1}W×H : noises (§5.4) and manually designed adversarial samples
1 XW XH (§5.5). Finally, we quantitatively assess the generalizability
MAE = |G(i, j) − S(i, j)|. (3)
W ×H i=1 j=1 and difficulty of current mainstream SOD datasets (§5.6).
• Fbw [137] intuitively generalizes F-measure by alternating
the way of calculating precision and recall. It extends the 5.1 Quick Overview
four basic quantities TP, TN, FP and FN to real values, and
assigns different weights (ω ) to different errors at different For ease of understanding, we compile important observa-
locations, considering the neighborhood information: tions and conclusions from subsequent experiments below.
• Overall benchmarks (§5.2). As shown in Table 5, deep
(1 + β 2 )Precisionω × Recallω SOD models significantly outperform heuristic ones, and
Fβω = . (4)
β 2 Precisionω + Recallω the performance on some datasets [27], [55] has become
• S-measure [138] evaluates the structural similarity be- saturated. [82], [93], [101], [102] are current state-of-the-arts.
tween the real-valued saliency map and the binary ground- • Attribute-based analysis (§5.3). Results in Table 7 reveal
truth. It considers object-aware (So ) and region-aware (Sr ) that deep methods show significant advantages in detecting
structure similarities: semantic-rich objects, such as animal. Both deep and non-
deep methods face difficulties with small salient objects. For
S = α × So + (1 − α) × Sr , (5)
application scenarios, indoor scenes pose great challenges,
where α is empirically set to 0.5. highlighting potential directions for future efforts.
• E-measure [139] considers global means of the image and • Robustness against random perturbations (§5.4). As shown in
local pixel matching simultaneously: Table 9, surprisingly, deep methods are more sensitive than
1 XW XH heuristic ones to random input perturbations. Both types
QS = φS (i, j), (6) of methods demonstrate more robustness against Rotation,
W ×H i=1 j=1
where φS is the enhanced alignment matrix, reflecting the while being fragile towards Gaussian blur and Gaussian noise.
correlation between S and G after subtracting their global • Adversarial attack (§5.5). Table 10 suggests that adversarial
means, respectively. attacks cause drastic degradation in performance for deep
SOD models, and are even worse than that of random
perturbations. However, attacks rarely transfers between
4.3 Discussion
different SOD networks.
These measures are typically based on pixel-wise errors • Generalizability and difficulty of datasets (§5.6). Table 11
while ignoring structural similarities, with S-measure and shows that DUTS-train [97] is a good choice for training
E-measure being the only exceptions. F-measure and E- deep SOD models as it has the best generalizability, while
measure are designed for assessing binarized saliency pre- SOC [131], DUT-OMRON [56], and DUTS-test [97] are more
diction maps, while PR, MAE, Fbw, and S-measure are for suitable for evaluation due to their difficulty.
non-binary map evaluation.
Among pixel-level metrics, the PR curve is classic. How-
ever, precision and recall cannot fully assess the quality of 5.2 Performance Benchmarking
saliency predictions, since high-precision predictions may Table 5 shows the performances of 44 state-of-the-art deep
only highlight a part of salient objects, while high-recall SOD models and three top-performing classic methods
predictions are typically meaningless if all the pixels are (suggested by [44]) on six most popular modern datasets.
predicted as being salient. In general, a high-recall response The performance is measured by three metrics, i.e., maximal
may come at the expense of reduced precision, and vice Fβ , S-measure and MAE, as recommended in §4.3. All the
versa. F-measure and Fbw are thus used to consider pre- benchmarked models are representative, and have publicly
cision and recall simultaneously. However, overlap-based available implementations or saliency prediction results. For
metrics (i.e., PR, F-measure, and Fbw) do not consider the performance benchmarking, we either use saliency maps
true negative saliency assignments, i.e., the pixels correctly provided by the authors or run their official codes. It is
marked as non-salient. Thus, these metrics favor methods worth mentioning that, for some methods, our benchmark-
that successfully assign high saliency to salient pixels but ing results are inconsistent with their reported scores. There
fail to detect non-salient regions [50]. MAE can remedy this, are several reasons. First, our community long lacked an
but it performs poorly when salient objects are small. For the open, universally-adopted evaluation tool, while there are
structure-/image-level metrics, S-measure is more popular many implementation factors would influence the eval-
than E-measure, as SOD focuses on continuous predictions. uation scores, such as input image resolution, threshold
Considering the popularity and characteristics of existing step, etc. Second, some methods [66], [69], [74], [76], [85],
metrics and completeness of evaluation, F-measure (maximal [100] use mean F-measure instead of maximal F-measure for
Fβ ), S-measure and MAE are our top recommendations. performance evaluation. Third, for some methods [39], [76],
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
TABLE 5
Benchmarking results of 44 state-of-the-art deep SOD models and 3 top-performing classic SOD methods on 6 famous datasets (§5.2). Here
max F, S, and M indicate maximal Fβ , S-measure, and MAE, respectively. The three best scores are marked in red, blue, and green, respectively.
Dataset ECSSD [55] DUT-OMRON [56] PASCAL-S [108] HKU-IS [27] DUTS-test [97] SOD [129]
Metric max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓
∗ HS [35]
2015 2013-14
.673 .685 .228 .561 .633 .227 .569 .624 .262 .652 .674 .215 .504 .601 .243 .756 .711 .222
∗ DRFI [53] .751 .732 .170 .623 .696 .150 .639 .658 .207 .745 .740 .145 .600 .676 .155 .658 .619 .228
∗ wCtr [36] .684 .714 .165 .541 .653 .171 .599 .656 .196 .695 .729 .138 .522 .639 .176 .615 .638 .213
MCDL [29] .816 .803 .101 .670 .752 .089 .706 .721 .143 .787 .786 .092 .634 .713 .105 .689 .651 .182
LEGS [28] .805 .786 .118 .631 .714 .133 ‡ ‡ ‡ .736 .742 .119 .612 .696 .137 .685 .658 .197
MDF [27] .797 .776 .105 .643 .721 .092 .704 .696 .142 .839 .810 .129 .657 .728 .114 .736 .674 .160
ELD [60] .849 .841 .078 .677 .751 .091 .782 .799 .111 .868 .868 .063 .697 .754 .092 .717 .705 .155
DHSNet [38] .893 .884 .060 ‡ ‡ ‡ .799 .810 .092 .875 .870 .053 .776 .818 .067 .790 .749 .129
2016
DCL [104] .882 .868 .075 .699 .771 .086 .787 .796 .113 .885 .877 .055 .742 .796 .149 .786 .747 .195
MAP [37] .556 .611 .213 .448 .598 .159 .521 .593 .207 .552 .624 .182 .453 .583 .181 .509 .557 .236
CRPSD [105] .915 .895 .048 - - - .864 .852 .064 .906 .885 .043 - - - - - -
RFCN [63] .875 .852 .107 .707 .764 .111 .800 .798 .132 .881 .859 .089 .755 .859 .090 .769 .794 .170
MSRNet [70] .900 .895 .054 .746 .808 .073 .828 .838 .081 ‡ ‡ ‡ .804 .839 .061 .802 .779 .113
DSS [39] .906 .882 .052 .737 .790 .063 .805 .798 .093 ‡ ‡ ‡ .796 .824 .057 .805 .751 .122
† WSS [97] .879 .811 .104 .725 .730 .110 .804 .744 .139 .878 .822 .079 .878 .822 .079 .807 .675 .170
DLS [65] .826 .806 .086 .644 .725 .090 .712 .723 .130 .807 .799 .069 - - - - - -
2017
NLDF [75] .889 .875 .063 .699 .770 .080 .795 .805 .098 .888 .879 .048 .777 .816 .065 .808 .889 .125
Amulet [76] .905 .894 .059 .715 .780 .098 .805 .818 .100 .887 .886 .051 .750 .804 .085 .773 .757 .142
FSN [72] .897 .884 .053 .736 .802 .066 .800 .804 .093 .884 .877 .044 .761 .808 .066 .781 .755 .127
SBF [83] .833 .832 .091 .649 .748 .110 .726 .758 .133 .821 .829 .078 .657 .743 .109 .740 .708 .159
SRM [71] .905 .895 .054 .725 .798 .069 .817 .834 .084 .893 .887 .046 .798 .836 .059 .792 .741 .128
UCF [66] .890 .883 .069 .698 .760 .120 .787 .805 .115 .874 .875 .062 .742 .782 .112 .763 .753 .165
RADF [78] .911 .894 .049 .761 .817 .055 .800 .802 .097 .902 .888 .039 .792 .826 .061 .804 .757 .126
BDMP [84] .917 .911 .045 .734 .809 .064 .830 .845 .074 .910 .907 .039 .827 .862 .049 .806 .786 .108
DGRL [85] .916 .906 .043 .741 .810 .063 .830 .839 .074 .902 .897 .037 .805 .842 .050 .802 .771 .105
PAGR [86] .904 .889 .061 .707 .775 .071 .814 .822 .089 .897 .887 .048 .817 .838 .056 .761 .716 .147
2018
RSDNet [79] .880 .788 .173 .715 .644 .178 ‡ ‡ ‡ .871 .787 .156 .798 .720 .161 .790 .668 .226
ASNet [87] .925 .915 .047 ‡ ‡ ‡ .848 .861 .070 .912 .906 .041 .806 .843 .061 .801 .762 .121
PiCANet [40] .929 .916 .035 .767 .825 .054 .838 .846 .064 .913 .905 .031 .840 .863 .040 .814 .776 .096
† C2S-Net [99] .902 .896 .053 .722 .799 .072 .827 .839 .081 .887 .889 .046 .784 .831 .062 .786 .760 .124
RAS [88] .908 .893 .056 .753 .814 .062 .800 .799 .101 .901 .887 .045 .807 .839 .059 .810 .764 .124
AFNet [89] .924 .913 .042 .759 .826 .057 .844 .849 .070 .910 .905 .036 .838 .867 .046 .809 .774 .111
BASNet [90] .931 .916 .037 .779 .836 .057 .835 .838 .076 .919 .909 .032 .838 .866 .048 .805 .769 .114
CapSal [100] .813 .826 .077 .535 .674 .101 .827 .837 .073 .842 .851 .057 .772 .818 .061 .669 .694 .148
CPD [80] .926 .918 .037 .753 .825 .056 .833 .848 .071 .911 .905 .034 .840 .869 .043 .814 .767 .112
MLSLNet [91] .917 .911 .045 .734 .809 .064 .835 .844 .074 .910 .907 .039 .828 .862 .049 .806 .786 .108
† MWS [81] .859 .827 .099 .676 .756 .108 .753 .768 .134 .835 .818 .086 .720 .759 .092 .772 .700 .170
PAGE-Net [92] .926 .910 .037 .760 .819 .059 .829 .835 .073 .910 .901 .031 .816 .848 .048 .795 .763 .108
2019
PS [94] .930 .918 .041 .789 .837 .061 .837 .850 .071 .913 .907 .038 .835 .865 .048 .824 .800 .103
PoolNet [93] .937 .926 .035 .762 .831 .054 .858 .865 .065 .923 .919 .030 .865 .886 .037 .831 .788 .106
BANet-R [101] .939 .924 .035 .782 .832 .059 .847 .852 .070 .923 .913 .032 .858 .879 .040 .842 .791 .106
EGNet-R [82] .936 .925 .037 .777 .841 .053 .841 .852 .074 .924 .918 .031 .866 .887 .039 .854 .802 .099
HRSOD-DH [73] .911 .888 .052 .692 .762 .065 .810 .817 .079 .890 .877 .042 .800 .824 .050 .735 .705 .139
JDFPR [95] .915 .907 .049 .755 .821 .057 .827 .841 .082 .905 .903 .039 .792 .836 .059 .792 .763 .123
SCRN [102] .937 .927 .037 .772 .836 .056 .856 .869 .063 .921 .916 .034 .864 .885 .040 .826 .787 .107
SSNet [103] .889 .867 .046 .708 .773 .056 .793 .807 .072 .876 .854 .041 .769 .784 .049 .713 .700 .118
TSPOANet [41] .919 .907 .047 .749 .818 .061 .830 .842 .078 .909 .902 .039 .828 .860 .049 .810 .772 .118
∗ Non-deep learning model. † Weakly-supervised model. Bounding-box output. ‡ Training on subset. - Results not available.
the evaluation scores of finally released saliency maps are PoolNet [93], BANet [101], EGNet [82], and SCRN [102] as
inconsistent with the ones reported in papers. We hope the four state-of-the-art methods, which consistently show
that our performance benchmarking, publicly released eval- promising performance over diverse datasets.
uation tools and SOD maps could help our community
build an open and standardized evaluation system and 5.3 Attribute-Based Study
ensure consistency and procedural correctness for results
and conclusions produced by different parties. Although the community has witnessed the great advances
made by deep SOD models, it is still unclear under which
Not surprisingly, data-driven models greatly outperform specific aspects these models perform well. As there are
conventional heuristic ones, due to their strong learning numerous factors affecting the performance of a SOD algo-
ability for visually salient pattern modeling. In addition, rithm, such as object/scene category, occlusion, etc., it is cru-
the performance has gradually increased since 2015, demon- cial to evaluate the performance under different scenarios.
strating well the advancement of deep learning techniques. This can help reveal the strengths and weaknesses of deep
However, after 2018, the rate of improvement began de- SOD models, identify pending challenges, and highlight
crasing, calling for more effective model designs and new future research directions towards more robust algorithms.
machine learning technologies. We also find that the per-
formances tend to be saturated on older SOD datasets 5.3.1 Hybrid Benchmark Dataset with Attribute Annotations
such as ECSSD [55] and HKU-IS [27]. Hence, among the To enable a deeper analysis and understanding of the per-
44 famous deep SOD models, we would like to nominate formance of an algorithm, it is essential to identify the
Natural Urban Natural Indoor Urban
Fig. 4. Sample images from the hybrid benchmark consisting of images randomly selected from 6 SOD datasets. Salient regions are uniformly
highlighted. Corresponding attributes are listed. See §5.3 for more detailed descriptions.
TABLE 7
Attribute-based study w.r.t. salient object categories, challenges and scene categories. (·) indicates the percentage of images with a specific
attribute. ND-avg indicates the average score of three heuristic models: HS [35], DRFI [53] and wCtr [36]. D-avg indicates the average score of
three deep learning models: DGRL [85], PAGR [86] and PiCANet [40]. Best in red, and worst with underline. See §5.3 for more details.
TABLE 8
Attribute statistics of top and bottom 100 images based on F-measure. (·) indicates the percentage of the images with a specific attribute. ND-avg
indicates the average results of three heuristic models: HS [35], DRFI [53] and wCtr [36]. D-avg indicates the average results of three deep
models: DGRL [85], PAGR [86] and PiCANet [40]. Two largest changes in red if positive, and blue if negative. See §5.3 for more details.
Original Gaussian blur Gaussian noise Rotation Gray Original Gaussian blur
sigma=2 sigma=4 var=0.01 var=0.08 15° -15° sigma=2 sigma=4 var=0.01 var=0.08 15° -15°
Image
Image
GT
GT
HS
HS
0.844 0.797 0.791 0.836 0.785 0.808 0.795 0.742 0.579 0.655 0.590 0.541 0.573 0.742 0.653 0.516
DRFI
DRFI
0.816 0.830 0.839 0.844 0.806 0.806 0.819 0.799 0.533 0.571 0.547 0.574 0.571 0.608 0.627 0.515
wCrt
wCrt
0.863 0.837 0.836 0.831 0.773 0.764 0.806 0.654 0.670 0.618 0.627 0.564 0.612 0.578 0.643 0.471
0.922 0.851 0.819 0.898 0.700 0.878 0.857 0.900 0.951 0.565 0.471 0.731 0.532 0.791 0.896 0.759
0.880 0.848 0.821 0.927 0.714 0.905 0.865 0.898 0.927 0.823 0.471 0.645 0.471 0.820 0.888 0.890
0.945 0.872 0.835 0.924 0.840 0.926 0.890 0.883 0.947 0.814 0.471 0.860 0.471 0.861 0.848 0.879
Fig. 5. Examples of saliency prediction under various input perturbations. The max F values are denoted in red. See §5.4 for more details.
Image
Image
completely different predictions [141]. Though intensively
Pert.
studied in classification tasks, adversarial attacks in SOD are Pert.
GT+
GT+
GT+
GT+
ous adversarial perturbations to fool SOD modules and then 0.968 0.143 0.952 0.960 0.944 0.485 0.930 0.926
cheat the surveillance systems. Besides, SOD has benefited
many commercial projects such as photo editing [20], and 0.934 0.899 0.259 0.863 0.969 0.967 0.327 0.933
image/video compression [145]. The adversarial attacks
0.990 0.989 0.989 0.438 0.972 0.971 0.968 0.349
launched by hackers on the embedded SOD modules would
inevitably affect the functioning of commercial products and
Fig. 6. Examples of SOD prediction under adversarial perturbations
impacting users, causing losses for the developers and com- of different target networks. The perturbations are magnified by 10 for
Pert.
Pert.
GT+
GT+
panies. Therefore, studying the robustness of SOD models better visualization. Red for max F. See §5.5 for details.
is crucial for defending these applications against malicious
SRM DGRL PiCANet
PiCANet DGRL SRM
attacks. In this section, we study the robustness against 0.970 0.434 0.913 0.897 0.945 0.423 0.935 0.931
224×224
Encoder Decoder
conv5-out conv4-out conv3-out
Attack from SRM [71] DGRL [85] PiCANet [40]
Binary Cross-entropy Loss
None .817 .831 .848
SRM [71] .263 .780 .842
DGRL [85] .778 .248 .844 Fig. 7. Network architecture of the SOD model used in cross-dataset
PiCANet [40] .772 .799 .253 generalization evaluation. See §5.6 for more detailed descriptions.
TABLE 11
5.5.2 Transferability Across Networks Results for cross-dataset generalization experiment. Max F↑ for
Previous research has revealed that adversarial perturba- saliency prediction when training on one dataset (rows) and testing on
tions can be transferred across networks, i.e. adversarial another (columns). “Self” refers to training and testing on the same
dataset (same as diagonal). “Mean Others” indicates average
examples targeting one model can mislead another with- performance on all except self. See §5.6 for details.
out any modification [148]. This transferability is widely
used for black-box attacks against real-world systems. To Test on: MSRA- ECSSD DUT-OM HKU- DUTS SOC Mean Percent
Self
investigate the transferability of perturbations for deep SOD Train on: 10K [107] [55] RON [56] IS [27] [97] [131] others drop↓
MSRA10K [107] .875 .818 .660 .849 .671 .617 .875 .723 17%
models, we use the adversarial perturbation computed on
ECSSD [55] .844 .831 .630 .833 .646 .616 .831 .714 14%
one SOD model to attack another. DUT-OMRON [56] .795 .752 .673 .779 .623 .567 .673 .703 -5%
Table 10 shows the experimental results for the three HKU-IS [27] .857 .838 .695 .880 .719 .639 .880 .750 15%
DUTS [97] .857 .834 .647 .860 .665 .654 .665 .770 -16%
models under investigation (SRM [71], DGRL [85] and Pi-
SOC [131] .700 .670 .517 .666 .514 .593 .593 .613 -3%
CANet [40]). While the DAG attack leads to severe perfor- Mean others .821 .791 .637 .811 .640 .614 - - -
mance drops for the targeted model (see the diagonal), it
causes much less degradation to other models, i.e., the trans-
ferability between models of different network structures is [93], [101], [102]. As shown in Fig. 7, the encoder part is
weak for SOD task, which is similar to the transferability ob- borrowed from VGG16 [151], and the decoder consists of
served for semantic segmentation, as analyzed in [146]. This three convolutional layers that gradually refine the saliency
may be because the gradient directions of different models prediction. We pick six representative datasets [27], [55],
are orthogonal to each other [149], so the gradient-based [56], [97], [107], [131]. For each dataset, we train the SOD
attack in the experiment transfers poorly to non-targeted model with 800 randomly selected training images and test
models. However, adversarial images generated from an it on 200 other validation images. Please note that a total of
ensemble of multiple models might generate non-targeted 1, 000 is the maximum possible number of images consider-
adversarial instances with better transferability [149], which ing the size of the smallest selected dataset, ECSSD [55].
would be a great threat to deep SOD models.
Table 11 summarizes the results of cross-dataset gener-
alization, measured by max F. Each column corresponds to
5.6 Cross-Dataset Generalization Evaluation the performance when training on all the datasets separately
Datasets are responsible for much of the recent progress in and testing on one. Each row indicates training on one
SOD, not just as sources for training deep models, but also as dataset and testing on all of them. Since our training/testing
means for measuring and comparing performance. Datasets protocol is different from the one used in the benchmarks
are collected with the goal of representing the visual world, mentioned in previous sections, the actual performance
and to summarize the algorithm as a single number (i.e., numbers are not meaningful. Rather, it is the relative perfor-
benchmark score). A concern thus arises: it is necessary mance difference that matters. Not surprisingly, we observe
to evaluate how well a particular dataset represents the that the best results are achieved when training and testing
real world; or, more specifically, to quantitatively measuring on the same dataset. By looking at the numbers across each
the dataset’s generalization ability. Unfortunately, previous column, we can determine how easy a dataset is for models
studies [44] are quite limited – mainly concerning the de- trained on the other datasets. By looking at the numbers
grees of center bias in different SOD datasets. Here, we fol- across one row, we can determine how good a dataset is
low [150] to assess how general SOD datasets are. We study at generalizing to the others. We find that SOC [131] is the
the generalization and difficulty of several mainstream SOD most difficult dataset (lowest column, Mean others 0.614).
datasets by performing a cross-dataset analysis, i.e., train- MSRA10K [107] appears to be the easiest one (highest col-
ing on one dataset, and testing on the others. We expect umn, Mean others 0.811), and generalizes the worst (highest
our experiments to stimulate discussion in the community row, Percent drop 17%). DUTS [97] is shown to have the best
regarding this essential but largely neglected issue. generalization ability (lowest row, Percent drop −16%).
We first train a typical SOD model on one dataset, and Based on these analyses, we would make the following
then explore how well it generalizes to a representative set recommendations for SOD datasets: 1) For training deep
of other datasets, compared with its performance on the models, DUTS [97] is a good choice because it has the
“native” test set. Specifically, we implement the typical SOD best generalizability. 2) For testing, SOC [131] is good for
model as a bottom-up/top-down structure, which has been assessing the worst-case performances, since it is the most
the most standard and popular SOD architecture these years challenging dataset. DUT-OMRON [56] and DUTS-test [97]
and is the basis of many current top-performing models [82], deserve more considerations as they are also very difficult.
(c) Object level
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
6 M ORE D ISCUSSIONS
Our previous systematic review and empirical studies char-
acterized the models (§2), datasets (§3), metrics (§4), and
challenges (§5) of deep SOD. Here we further posit active
research directions, and outline several open issues.
most important person in a crowded room. This is also believed that parameters trained on ImageNet can serve as a
evidenced by the experiments in §5.3, which show that good starting point to accelerate the convergence of training
deep models face great difficulties in complex (CS ), indoor and prevent overfitting on smaller-scale SOD datasets. Be-
(Indoor) or multi-object (MO) scenes. In other words, deep sides pre-training deep SOD models on the de facto dataset,
SOD models, though good at semantic modeling, require ImageNet, another option is to leverage self-supervised
higher-level image understanding. Exploring more pow- learning techniques [164] to learn effective visual features
erful network designs that explicitly reason the relative from a vast amount of unlabeled images/videos. The visual
saliency and revisiting classic cognitive theories are both features can be learned through various pretext tasks like
promising directions to overcome this issue. image inpainting [165], colorization [166], clustering [167],
etc., and can be generalized to other vision tasks. Fine-tuning
6.4 Linking SOD to Visual Fixations the SOD models on parameters trained from self-supervised
The strong correlation between eye movements (implicit learning is promising to yield better performance compared
saliency) and explicit object saliency has been explored to the ImageNet initialization.
throughout history [44], [108], [161]–[163]. However, despite
the deep connections between the problems of FP and SOD, 6.7 Efficient SOD for Real-World Application
the major computational models of the two tasks remain Current top-leading deep SOD models are designed to be
largely distinct; only a few SOD models consider both complicated in order to achieve increased learning capac-
tasks simultaneously [72], [87], [96]. This is mainly due ity and improved performance. However, more ingenu-
to the overemphasis on the specific setting of SOD and ous and light-weight architectures are required to fulfill
the design bias of current SOD datasets, which overlooks the requirements of mobile and embedded applications,
the connection to eye fixations during data annotation. As such as robotics, autonomous driving, augmented reality,
stated in [108], such dataset design bias not only creates a etc. The degradation of accuracy and generalization ability
discomforting disconnection between FP and SOD, but also caused by model scale deduction should be minimal. To
further misleads the algorithm designing. Exploring classic facilitate the application of SOD in real-world scenarios,
visual attention theories in SOD is a promising and crucial it is possible to utilize model compression [168] or knowl-
direction which could make SOD models more consistent edge distillation [169], [170] techniques to develop compact
with the visual processing of human visual system and and fast SOD models with competitive performance. Such
provide better explainability. In addition, the ultimate goal compression techniques have already been shown effective
of visual saliency modeling is to understand the underlying in improving generalization ability and alleviating under-
rationale of the visual attention mechanism. However, with fitting for training efficient object detection models [171].
the current focus on exploring more powerful neural net-
work architectures and beating the latest benchmark num- 7 C ONCLUSION
bers on different datasets, have we perhaps lost sight of the
original purpose? The solution to these problems requires In this paper we present, to the best of our knowledge, the
dense collaborations between the FP and SOD communities. first comprehensive review of SOD focusing on deep learn-
ing techniques. We first provide novel testimonies for cat-
egorizing deep SOD models from several distinct perspec-
6.5 Learning SOD in a Weakly-/Unsupervised Manner
tives, including network architecture, level of supervision,
Deep SOD methods are typically trained in a fully- etc. We then cover the contemporary literature on popular
supervised manner with a plethora of finely-annotated SOD datasets and evaluation criteria, providing a thorough
pixel-level ground-truths. However, it is highly costly and performance benchmarking of major SOD methods and
time-consuming to construct a large-scale, well-annotated offering recommendations for several datasets and metrics
SOD dataset. Though some efforts have been made to that can be used to consistently assess different models.
achieve SOD with limited supervision, i.e., by leveraging Next, we consider several previously under-explored issues
category-level labels [68], [69], [97] or pseudo pixel-wise an- related to benchmarking and baselines. In particular, we
notations [67], [81], [83], [98], [99], there is still a notable gap study the strengths and weaknesses of deep and non-deep
with the fully-supervised counterparts. In contrast, humans SOD models by compiling and annotating a new dataset
usually learn with little or even no supervision. Since the and evaluating several representative models on it, reveal-
ultimate goal of visual saliency modeling is to understand ing promising directions for future efforts. We also study
the visual attention mechanism, learning SOD in an weakly- the robustness of SOD methods by analyzing the effects of
/unsupervised manner would be of great value to both the various perturbations on the final performance. Moreover,
research community and real-world applications. Further, it for the first time in the field, we investigate the robustness
would also help us understand which factors truly drive our of deep SOD models to maliciously designed adversarial
attention mechanism and saliency pattern understanding. perturbations and the transferability of these adversarial ex-
Given the massive number of algorithmic breakthroughs amples, providing baselines for future research. In addition,
over the past few years, we can expect a flurry of innovation we analyze the generalization and difficulty of existing SOD
towards this promising direction. datasets through a cross-dataset generalization study, and
quantitatively reveal the dataset bias. We finally introduce
6.6 Pre-training with Self-Supervised Visual Features several open issues and challenges of SOD in the deep learn-
Current deep SOD methods are typically built on ImageNet- ing era, providing insightful discussions and identifying a
pretrained networks, and fine-tuned on SOD datasets. It is number of potentially fruitful directions forward.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
In conclusion, SOD has achieved notable progress thanks [21] S. Avidan and A. Shamir, “Seam carving for content-aware image
to the striking development of deep learning techniques. resizing,” in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 10.
[22] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze
However, there are still under-explored problems on achiev- sensing using saliency maps,” in Proc. IEEE Conf. Comput. Vis.
ing more efficient model designs, training, and inference Pattern Recognit., 2010, pp. 2667–2674.
for both academic research and real-world applications. We [23] A. Borji and L. Itti, “Defending yarbus: Eye movements reveal
expect this survey to provide an effective way to understand observers’ task,” Journal of Vision, vol. 14, no. 3, pp. 29–29, 2014.
[24] A. Karpathy, S. Miller, and L. Fei-Fei, “Object discovery in 3d
current state-of-the-arts and, more importantly, insight for
scenes via shape analysis,” in Proc. IEEE Conf. Robot. Autom., 2013,
the future exploration of SOD. pp. 2088–2095.
[25] S. Frintrop, G. M. Garcı́a, and A. B. Cremers, “A cognitive
approach for object discovery,” in Proc. IEEE Conf. Comput. Vis.
R EFERENCES Pattern Recognit., 2014, pp. 2329–2334.
[26] A. M. Treisman and G. Gelade, “A feature-integration theory of
[1] J.-Y. Zhu, J. Wu, Y. Xu, E. Chang, and Z. Tu, “Unsupervised object
attention,” Cognitive psychology, vol. 12, no. 1, pp. 97–136, 1980.
class discovery via saliency-guided multiple class learning,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 4, pp. 862–875, 2015. [27] G. Li and Y. Yu, “Visual saliency based on multiscale deep
[2] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
feature learning for scene classification,” IEEE Trans. Geosci. Re- pp. 5455–5463.
mote Sens., vol. 53, no. 4, pp. 2175–2184, 2015. [28] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for
[3] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, saliency detection via local estimation and global search,” in Proc.
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3183–3192.
caption generation with visual attention,” in Proc. ACM Int. Conf. [29] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection
Mach. Learn., 2015, pp. 2048–2057. by multi-context deep learning,” in Proc. IEEE Conf. Comput. Vis.
[4] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, Pattern Recognit., 2015, pp. 1265–1274.
J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to [30] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learning
visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern to detect a salient object,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 1473–1482. Recognit., 2007, pp. 1–8.
[5] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human [31] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-
attention in visual question answering: Do humans and deep tuned salient region detection,” in Proc. IEEE Conf. Comput. Vis.
networks look at the same regions?” Computer Vision and Image Pattern Recognit., 2009, pp. 1597–1604.
Understanding, vol. 163, pp. 90–100, 2017. [32] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,
[6] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based “Global contrast based salient region detection,” Proc. IEEE Conf.
saliency detection and its application in object recognition.” IEEE Comput. Vis. Pattern Recognit., 2011.
Trans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 769–779, 2014. [33] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency
[7] D. Zhang, D. Meng, L. Zhao, and J. Han, “Bridging saliency filters: Contrast based filtering for salient region detection,” in
detection to weakly supervised object detection based on self- Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2012, pp.
paced curriculum learning,” in International Joint Conferences on 733–740.
Artificial Intelligence, 2016. [34] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng,
[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware video “Salient object detection: A discriminative regional feature inte-
object segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., gration approach,” Int. J. Comput. Vis., vol. 123, no. 2, pp. 251–268,
vol. 40, no. 1, pp. 20–33, 2018. 2017.
[9] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid [35] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,”
dilated deeper convlstm for video salient object detection,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1155–
Proc. Eur. Conf. Comput. Vis., 2018. 1162.
[10] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, [36] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from
“Object region mining with adversarial erasing: A simple classi- robust background detection,” in Proc. IEEE Conf. Comput. Vis.
fication to semantic segmentation approach,” in Proc. IEEE Conf. Pattern Recognit., 2014, pp. 2814–2821.
Comput. Vis. Pattern Recognit., 2017. [37] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech,
[11] X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic “Unconstrained salient object detection via proposal subset opti-
segmentation by iteratively mining common object features,” in mization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018. pp. 5733–5742.
[12] G. Sun, W. Wang, J. Dai, and L. Van Gool, “Mining cross-image
[38] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network
semantics for weakly supervised semantic segmentation,” in
for salient object detection,” in Proc. IEEE Conf. Comput. Vis.
Proc. Eur. Conf. Comput. Vis., 2020, pp. 347–365.
Pattern Recognit., 2016, pp. 678–686.
[13] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience
[39] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply
learning for person re-identification,” in Proc. IEEE Conf. Comput.
supervised salient object detection with short connections,” in
Vis. Pattern Recognit., 2013, pp. 3586–3593.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3203–
[14] S. Bi, G. Li, and Y. Yu, “Person re-identification using multiple
3212.
experts with random subspaces,” Journal of Image and Graphics,
vol. 2, no. 2, 2014. [40] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise
[15] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model contextual attention for saliency detection,” in Proc. IEEE Conf.
for video summarization,” in Proc. ACM Int. Conf. Multimedia, Comput. Vis. Pattern Recognit., 2018, pp. 3089–3098.
2002, pp. 533–542. [41] Y. Liu, Q. Zhang, D. Zhang, and J. Han, “Employing deep part-
[16] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing object relationships for salient object detection,” in Proc. IEEE Int.
visual data using bidirectional similarity,” in Proc. IEEE Conf. Conf. Comput. Vis., 2019, pp. 1232–1241.
Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [42] Q. Qi, S. Zhao, J. Shen, and K.-M. Lam, “Multi-scale capsule
[17] J. Han, E. J. Pauwels, and P. De Zeeuw, “Fast saliency-aware attention-based salient object detection with multi-crossed layer
multi-modality image fusion,” Neurocomputing, vol. 111, pp. 70– connections,” in IEEE International Conference on Multimedia and
80, 2013. Expo, 2019, pp. 1762–1767.
[18] P. L. Rosin and Y.-K. Lai, “Artistic minimal rendering with lines [43] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,”
and blocks,” Graphical Models, vol. 75, no. 4, pp. 208–229, 2013. IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 185–207,
[19] W. Wang, J. Shen, Y. Yu, and K.-L. Ma, “Stereoscopic thumbnail 2013.
creation via efficient stereo saliency detection,” IEEE Trans. Visu- [44] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detec-
alization and Comput. Graphics, vol. 23, no. 8, pp. 2014–2027, 2016. tion: A benchmark,” IEEE Trans. Image Process., vol. 24, no. 12, pp.
[20] W. Wang, J. Shen, and H. Ling, “A deep network solution for 5706–5722, 2015.
attention and aesthetics aware photo cropping,” IEEE Trans. [45] T. V. Nguyen, Q. Zhao, and S. Yan, “Attentive systems: A survey,”
Pattern Anal. Mach. Intell., 2018. Int. J. Comput. Vis., vol. 126, no. 1, pp. 86–110, 2018.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
[46] D. Zhang, H. Fu, J. Han, A. Borji, and X. Li, “A review of co- [71] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise
saliency detection algorithms: fundamentals, applications, and refinement model for detecting salient objects in images,” in Proc.
challenges,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 4, p. 38, IEEE Int. Conf. Comput. Vis., 2017, pp. 4039–4048.
2018. [72] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment:
[47] R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang, Finding the salient objects in images via two-stream fixation-
“Review of visual saliency detection with comprehensive infor- semantic CNNs,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp.
mation,” IEEE Trans. Circuits Syst. Video Technol., 2018. 1050–1058.
[48] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced [73] Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards
deep-learning techniques for salient and category-specific object high-resolution salient object detection,” in Proc. IEEE Int. Conf.
detection: a survey,” IEEE Signal Processing Magazine, vol. 35, Comput. Vis., 2019, pp. 7234–7243.
no. 1, pp. 84–100, 2018. [74] Y. Zhuge, Y. Zeng, and H. Lu, “Deep embedding features for
[49] A. Borji, “Saliency prediction in the deep learning era: Successes salient object detection,” in AAAI Conference on Artificial Intelli-
and limitations,” IEEE Trans. Pattern Anal. Mach. Intell., 2019. gence, vol. 33, 2019, pp. 9340–9347.
[50] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object [75] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin,
detection: A survey,” Computational Visual Media, pp. 1–34, 2019. “Non-local deep features for salient object detection,” in Proc.
[51] C. Koch and S. Ullman, “Shifts in selective visual attention: IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6593–6601.
Towards the underlying neural circuitry,” Human neurobiology, [76] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet:
vol. 4, no. 4, p. 219, 1985. Aggregating multi-level convolutional features for salient object
[52] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 202–211.
attention for rapid scene analysis,” IEEE Trans. Pattern Anal. [77] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into
Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998. salient object subitizing and detection,” in Proc. IEEE Int. Conf.
[53] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient Comput. Vis., 2017, pp. 1059–1067.
object detection: A discriminative regional feature integration [78] X. Hu, L. Zhu, J. Qin, C.-W. Fu, and P.-A. Heng, “Recurrently
approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, aggregating deep features for salient object detection.” in AAAI
pp. 2083–2090. Conference on Artificial Intelligence, 2018.
[54] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using [79] M. Amirul Islam, M. Kalash, and N. D. B. Bruce, “Revisiting
background priors,” Proc. Eur. Conf. Comput. Vis., pp. 29–42, 2012. salient object detection: Simultaneous detection, ranking, and
[55] J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency subitizing of multiple salient objects,” in Proc. IEEE Conf. Comput.
detection on extended cssd,” IEEE Trans. Pattern Anal. Mach. Vis. Pattern Recognit., 2018.
Intell., vol. 38, no. 4, pp. 717–729, 2015. [80] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast
[56] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. H. Yang, “Saliency and accurate salient object detection,” in Proc. IEEE Conf. Comput.
detection via graph-based manifold ranking,” in Proc. IEEE Conf. Vis. Pattern Recognit., 2019, pp. 3907–3916.
Comput. Vis. Pattern Recognit., 2013, pp. 3166–3173. [81] Y. Zeng, Y. Zhuge, H. Lu, L. Zhang, M. Qian, and Y. Yu, “Multi-
[57] W. Wang, J. Shen, L. Shao, and F. Porikli, “Correspondence driven source weak supervision for saliency detection,” in Proc. IEEE
saliency transfer,” IEEE Trans. Image Process., vol. 25, no. 11, pp. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6074–6083.
5025–5034, 2016. [82] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng,
[58] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Y. Tang, “Egnet: Edge guidance network for salient object detection,” in
“Video saliency detection using object proposals,” IEEE Trans. Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8779–8788.
Cybernetics, 2017. [83] D. Zhang, J. Han, and Y. Zhang, “Supervision by fusion: Towards
[59] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- unsupervised learning of deep salient object detector,” in Proc.
encoders,” in Proc. Int. Conf. Artificial Neural Netw., 2011, pp. 44– IEEE Int. Conf. Comput. Vis., vol. 1, no. 2, 2017, p. 3.
51. [84] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional
[60] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low message passing model for salient object detection,” in Proc. IEEE
level distance map and high level features,” in Proc. IEEE Conf. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1741–1750.
Comput. Vis. Pattern Recognit., 2016, pp. 660–668. [85] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji,
[61] S. He, R. W. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: “Detect globally, refine locally: A novel approach to saliency
A superpixelwise convolutional neural network for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
detection,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 330–344, 2015. pp. 3127–3135.
[62] J. Kim and V. Pavlovic, “A shape-based approach for salient [86] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive
object detection using deep learning,” in Proc. Eur. Conf. Comput. attention guided recurrent network for salient object detection,”
Vis., 2016, pp. 455–470. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 714–
[63] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency 722.
detection with recurrent fully convolutional networks,” in Proc. [87] W. Wang, J. Shen, X. Dong, and A. Borji, “Salient object detection
Eur. Conf. Comput. Vis., 2016, pp. 825–841. driven by fixation prediction,” in Proc. IEEE Conf. Comput. Vis.
[64] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks Pattern Recognit., 2018, pp. 1171–1720.
for saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [88] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient
Recognit., 2016, pp. 3668–3677. object detection,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 236–
[65] P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for 252.
salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [89] M. Feng, H. Lu, and E. Ding, “Attentive feedback network for
Recognit., 2017, pp. 540–549. boundary-aware salient object detection,” in Proc. IEEE Conf.
[66] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning Comput. Vis. Pattern Recognit., 2019, pp. 1623–1632.
uncertain convolutional features for accurate saliency detection,” [90] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jager-
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 212–221. sand, “Basnet: Boundary-aware salient object detection,” in Proc.
[67] J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley, “Deep IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7479–7489.
unsupervised saliency detection: A multiple noisy labeling per- [91] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding,
spective,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, “A mutual learning method for salient object detection with
pp. 9029–9038. intertwined multi-supervision,” in Proc. IEEE Conf. Comput. Vis.
[68] C. Cao, Y. Hunag, Z. Wang, L. Wang, N. Xu, and T. Tan, “Lat- Pattern Recognit., 2019, pp. 8150–8159.
eral inhibition-inspired convolutional neural network for visual [92] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object
attention and saliency detection,” in AAAI Conference on Artificial detection with pyramid attention and salient edges,” in Proc.
Intelligence, 2018. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1448–1457.
[69] B. Li, Z. Sun, and Y. Guo, “Supervae: Superpixelwise variational [93] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple
autoencoder for salient object detection,” in AAAI Conference on pooling-based design for real-time salient object detection,” in
Artificial Intelligence, 2019, pp. 8569–8576. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3917–
[70] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object 3926.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [94] W. Wang, J. Shen, M.-M. Cheng, and L. Shao, “An iterative
2017, pp. 247–256. and cooperative top-down and bottom-up inference network for
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [119] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
Recognit., 2019, pp. 5968–5977. networks for biomedical image segmentation,” in International
[95] Y. Xu, D. Xu, X. Hong, W. Ouyang, R. Ji, M. Xu, and G. Zhao, Conference on Medical Image Computing and Computer-Assisted In-
“Structured modeling of joint deep feature and prediction refine- tervention, 2015, pp. 234–241.
ment for salient object detection,” in Proc. IEEE Int. Conf. Comput. [120] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object
Vis., 2019, pp. 3789–3798. contour detection with a fully convolutional encoder-decoder
[96] S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
R. Venkatesh Babu, “Saliency unified: A deep architecture pp. 193–202.
for simultaneous eye fixation prediction and salient object [121] J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn corre-
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., spondence?” in Proc. Advances Neural Inf. Process. Syst., 2014, pp.
2016, pp. 5781–5790. 1601–1609.
[97] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, [122] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Learning to detect salient objects with image-level supervision,” “Learning deep features for discriminative localization,” in Proc.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.
[98] G. Li, Y. Xie, and L. Lin, “Weakly supervised salient object [123] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech,
detection using image labels,” in AAAI Conference on Artificial “Minimum barrier salient object detection at 80 fps,” in Proc. IEEE
Intelligence, 2018. Int. Conf. Comput. Vis., 2015, pp. 1404–1412.
[99] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowl- [124] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency
edge transfer for salient object detection,” in Proc. Eur. Conf. detection: a boolean map approach,” IEEE Trans. Pattern Anal.
Comput. Vis., 2018, pp. 370–385. Mach. Intell., no. 5, pp. 889–902, 2016.
[100] L. Zhang, J. Zhang, Z. Lin, H. Lu, and Y. He, “Capsal: Leveraging [125] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
captioning to boost semantics for salient object detection,” in “Multiscale combinatorial grouping,” in Proc. IEEE Conf. Comput.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6024– Vis. Pattern Recognit., 2014, pp. 328–335.
6033. [126] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1,
[101] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance: pp. 41–75, 1997.
Boundary-aware salient object detection,” in Proc. IEEE Int. Conf. [127] E. L. Kaufman, M. W. Lord, T. W. Reese, and J. Volkmann,
Comput. Vis., 2019, pp. 3799–3808. “The discrimination of visual number,” The American Journal of
[102] Z. Wu, L. Su, and Q. Huang, “Stacked cross refinement network Psychology, vol. 62, no. 4, pp. 498–525, 1949.
for edge-aware salient object detection,” in Proc. IEEE Int. Conf. [128] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image segmenta-
Comput. Vis., 2019, pp. 7264–7273. tion by probabilistic bottom-up aggregation and cue integration,”
[103] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang, “Joint learning of saliency in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
detection and weakly supervised semantic segmentation,” in [129] V. Movahedi and J. H. Elder, “Design and perceptual validation
Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7223–7233. of performance measures for salient object segmentation,” in
[104] G. Li and Y. Yu, “Deep contrast learning for salient object detec- Proc. IEEE Conf. Comput. Vis. Pattern Recognit. - Workshops, 2010.
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. [130] C. Xia, J. Li, X. Chen, A. Zheng, and Y. Zhang, “What is and
478–487. what is not a salient object? learning salient object detector
[105] Y. Tang and X. Wu, “Saliency detection via combining region- by ensembling linear exemplar regressors,” in Proc. IEEE Conf.
level and pixel-level predictions with cnns,” in Proc. Eur. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4321–4329.
Comput. Vis., 2016, pp. 809–825. [131] D.-P. Fan, M.-M. Cheng, J.-J. Liu, S.-H. Gao, Q. Hou, and A. Borji,
[106] P. Krähenbühl and V. Koltun, “Efficient inference in fully con- “Salient objects in clutter: Bringing salient object detection to the
nected crfs with gaussian edge potentials,” in Proc. Advances foreground,” in The Proc. Eur. Conf. Comput. Vis., 2018.
Neural Inf. Process. Syst., 2011, pp. 109–117. [132] R. Fan, Q. Hou, M.-M. Cheng, G. Yu, R. R. Martin, and S.-M. Hu,
[107] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. “Associating inter-image salient instances for weakly supervised
Hu, “Global contrast based salient region detection,” IEEE Trans. semantic segmentation,” in The Proc. Eur. Conf. Comput. Vis., 2018,
Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015. pp. 367–383.
[108] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets [133] J. Zhao, J. Li, H. Liu, S. Yan, and J. Feng, “Fine-grained multi-
of salient object segmentation,” in Proc. IEEE Conf. Comput. Vis. human parsing,” Int. J. Comput. Vis., pp. 1–19, 2019.
Pattern Recognit., 2014, pp. 280–287. [134] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of hu-
[109] R. Ju, Y. Liu, T. Ren, L. Ge, and G. Wu, “Depth-aware salient man segmented natural images and its application to evaluating
object detection using anisotropic center-surround difference,” segmentation algorithms and measuring ecological statistics,” in
Signal Processing: Image Communication, vol. 38, pp. 115–126, 2015. Proc. IEEE Int. Conf. Comput. Vis., vol. 2, 2001, pp. 416–423.
[110] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “Rgbd salient [135] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun
object detection: a benchmark and algorithms,” in Proc. Eur. Conf. database: Large-scale scene recognition from abbey to zoo,” in
Comput. Vis., 2014, pp. 92–109. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–
[111] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency in 3492.
context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, [136] Z. Wang and B. Li, “A two-stage approach to saliency detection
pp. 1072–1080. in images,” in Proc. IEEE Conf. Acoust. Speech Signal Process., 2008,
[112] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, pp. 965–968.
B. Price, and R. Mech, “Salient object subitizing,” in Proc. IEEE [137] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate fore-
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4045–4054. ground maps?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[113] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and 2014, pp. 248–255.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [138] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010. measure: A new way to evaluate foreground maps,” in Proc. IEEE
[114] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Int. Conf. Comput. Vis., 2017.
“Imagenet: A large-scale hierarchical image database,” in Proc. [139] D.-P. Fan, M.-M. Cheng, J.-J. Liu, S.-H. Gao, Q. Hou, and A. Borji,
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. “Enhanced-alignment measure for binary foreground map eval-
[115] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, uation,” in International Joint Conferences on Artificial Intelligence,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in 2018.
context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [140] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross,
[116] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas- and A. Sorkine-Hornung, “A benchmark dataset and evaluation
sification with deep convolutional neural networks,” in Proc. methodology for video object segmentation,” in Proc. IEEE Conf.
Advances Neural Inf. Process. Syst., 2012, pp. 1097–1105. Comput. Vis. Pattern Recognit., 2016, pp. 724–732.
[117] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- [141] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-
works for semantic segmentation,” in Proc. IEEE Conf. Comput. fellow, and R. Fergus, “Intriguing properties of neural networks,”
Vis. Pattern Recognit., 2015, pp. 3431–3440. in Proc. Int. Conf. Learn. Representations, 2014.
[118] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. [142] C. Li, R. Cong, J. Hou, S. Zhang, Y. Qian, and S. Kwong, “Nested
IEEE Int. Conf. Comput. Vis., 2015, pp. 1395–1403. network with two-stream pyramid for salient object detection in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., [168] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model com-
vol. 57, no. 11, pp. 9156–9166, 2019. pression,” in Proceedings of SIGKDD international conference on
[143] I. Mehmood, M. Sajjad, W. Ejaz, and S. W. Baik, “Saliency- Knowledge discovery and data mining, 2006, pp. 535–541.
directed prioritization of visual data in wireless surveillance [169] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in
networks,” Information Fusion, vol. 24, pp. 16–30, 2015. a neural network,” in Proc. Advances Neural Inf. Process. Syst. -
[144] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation workshops, 2014.
for autonomous driving with deep densely connected mrfs,” in [170] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 669–677. Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint
[145] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal arXiv:1412.6550, 2014.
saliency detection model and its applications in image and video [171] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning
compression,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 185– efficient object detection models with knowledge distillation,” in
198, 2009. Proc. Advances Neural Inf. Process. Syst., 2017, pp. 742–751.
[146] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adver-
sarial examples for semantic segmentation and object detection,” Wenguan Wang received his Ph.D. degree from Beijing Institute of
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1369– Technology in 2018. He is currently a postdoc scholar at ETH Zurich,
1378. Switzerland. From 2016 to 2018, he was a visiting Ph.D. student in
[147] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard, “Robustness University of California, Los Angeles. From 2018 to 2019, he was a
of classifiers: from adversarial to random noise,” in Proc. Advances senior scientist at Inception Institute of Artificial Intelligence, UAE. His
Neural Inf. Process. Syst., 2016, pp. 1632–1640. current research interests include computer vision and deep learning.
[148] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in
machine learning: from phenomena to black-box attacks using
Qiuxia Lai received the B.E. and M.S. degrees in the School of Au-
adversarial samples,” arXiv preprint arXiv:1605.07277, 2016.
tomation from Huazhong University of Science and Technology in 2013
[149] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable and 2016, respectively. She is currently pursuing the Ph.D. degree in
adversarial examples and black-box attacks,” in Proc. Int. Conf. The Chinese University of Hong Kong. Her research interests include
Learn. Representations, 2017. image/video processing and deep learning.
[150] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1521–
1528. Huazhu Fu (SM’18) received the Ph.D. degree from Tianjin University,
[151] K. Simonyan and A. Zisserman, “Very deep convolutional net- China, in 2013. He was a Research Fellow with Nanyang Technological
works for large-scale image recognition,” in Proc. Int. Conf. Learn. University, Singapore for two years. From 2015 to 2018, he was a
Representations, 2015. Research Scientist with the Institute for Infocomm Research, Agency
for Science, Technology and Research, Singapore. He is currently a
[152] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-
Senior Scientist with Inception Institute of Artificial Intelligence, UAE. His
ment learning,” in Proc. Int. Conf. Learn. Representations, 2017.
research interests include computer vision and medical image analysis.
[153] M. Berman, A. Rannen Triki, and M. B. Blaschko, “The lovász-
He is an Associate Editor of IEEE TMI and IEEE Access.
softmax loss: A tractable surrogate for the optimization of the
intersection-over-union measure in neural networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4413–4421. Jianbing Shen (M’11-SM’12) is a Professor with the School of Com-
[154] T.-W. Ke, J.-J. Hwang, Z. Liu, and S. X. Yu, “Adaptive affinity puter Science, Beijing Institute of Technology. He has published about
fields for semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 100 journal and conference papers such as TPAMI, CVPR, and ICCV.
2018, pp. 587–602. He has obtained many honors including the Fok Ying Tung Education
[155] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup, “Conditional Foundation from Ministry of Education, the Program for Beijing Excellent
computation in neural networks for faster models,” in Proc. Int. Youth Talents from Beijing Municipal Education Commission, and the
Conf. Learn. Representations, 2016. Program for New Century Excellent Talents from Ministry of Education.
[156] A. Veit and S. Belongie, “Convolutional networks with adaptive His research interests include computer vision and deep learning. He is
inference graphs,” in Proc. Eur. Conf. Comput. Vis., 2018. an Associate Editor of IEEE TNNLS, IEEE TIP and Neurocomputing.
[157] A. Zlateski, R. Jaroensri, P. Sharma, and F. Durand, “On the
importance of label quality for semantic segmentation,” in Proc. Haibin Ling received the PhD degree from University of Maryland in
IEEE Conf. Comput. Vis. Pattern Recognit., 2018. 2006. From 2000 to 2001, he was an assistant researcher at Microsoft
[158] M. Jiang, J. Xu, and Q. Zhao, “Saliency in crowd,” in Proc. Eur. Research Asia. From 2006 to 2007, he worked as a postdoc at Univer-
Conf. Comput. Vis., 2014, pp. 17–32. sity of California Los Angeles. After that, he joined Siemens Corporate
[159] Q. Zheng, J. Jiao, Y. Cao, and R. W. Lau, “Task-driven webpage Research as a research scientist. Since 2008, he has been with Temple
saliency,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 287–302. University where he is now an Associate Professor. He received the Best
[160] A. Palazzi, F. Solera, S. Calderara, S. Alletto, and R. Cucchiara, Student Paper Award at the ACM UIST in 2003, and the NSF CAREER
“Learning where to attend like a human driver,” in IEEE Intelli- Award in 2014. He is an Associate Editor of IEEE TPAMI, PR, and CVIU,
gent Vehicles Symposium, 2017, pp. 920–925. and served as Area Chairs for CVPR 2014, 2016 and 2019.
[161] A. K. Mishra, Y. Aloimonos, L. F. Cheong, and A. Kassim, “Active
visual segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., Ruigang Yang is currently a full professor of Computer Science at
vol. 34, no. 4, pp. 639–653, 2012. the University of Kentucky. His research interests span over computer
[162] C. M. Masciocchi, S. Mihalas, D. Parkhurst, and E. Niebur, vision and computer graphics, in particular in 3D reconstruction and
“Everyone knows what is interesting: Salient locations which 3D data analysis. He has received a number of awards, including the
should be fixated,” Journal of Vision, vol. 9, no. 11, pp. 25–25, US National Science Foundation Faculty Early Career Development
2009. (CAREER) Program Award in 2004, and the best Demonstration Award
[163] A. Borji, “What is a salient object? A dataset and a baseline model at CVPR 2007. He is currently an associate editor of IEEE TPAMI.
for salient object detection,” IEEE Trans. Image Process., vol. 24,
no. 2, pp. 742–756, 2015.
[164] L. Jing and Y. Tian, “Self-supervised visual feature learning with
deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach.
Intell., 2020.
[165] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
“Context encoders: Feature learning by inpainting,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
[166] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a
proxy task for visual understanding,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 6874–6883.
[167] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus-
tering for unsupervised learning of visual features,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 132–149.