0% found this document useful (0 votes)

35 views

2020 Survey

This document provides an in-depth survey of salient object detection (SOD) in the deep learning era. It reviews various deep SOD algorithms and analyzes datasets, evaluation metrics, and state-of-the-art models. It also investigates under-explored issues like robustness and transferability. The survey aims to enable comprehensive understanding of recent advances in SOD and outlines open questions.

Uploaded by

DERMECHE Hakima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

2020 Survey

Uploaded by

DERMECHE Hakima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Salient Object Detection in the

Deep Learning Era: An In-depth Survey
Wenguan Wang, Member, IEEE, Qiuxia Lai, Huazhu Fu, Senior Member, IEEE,
Jianbing Shen, Senior Member, IEEE, Haibin Ling, and Ruigang Yang, Senior Member, IEEE

Abstract—As an essential problem in computer vision, salient object detection (SOD) has attracted an increasing amount of research
attention over the years. Recent advances in SOD are predominantly led by deep learning-based solutions (named deep SOD). To
enable in-depth understanding of deep SOD, in this paper, we provide a comprehensive survey covering various aspects, ranging from
algorithm taxonomy to unsolved issues. In particular, we first review deep SOD algorithms from different perspectives, including
network architecture, level of supervision, learning paradigm, and object-/instance-level detection. Following that, we summarize and
arXiv:1904.09146v5 [cs.CV] 8 Jan 2021

analyze existing SOD datasets and evaluation metrics. Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance of SOD algorithms under different attribute settings,
which has not been thoroughly explored previously, by constructing a novel SOD dataset with rich attribute annotations covering various
salient object types, challenging factors, and scene categories. We further analyze, for the first time in the field, the robustness of SOD
models to random input perturbations and adversarial attacks. We also look into the generalization and difficulty of existing SOD
datasets. Finally, we discuss several open issues of SOD and outline future research directions. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly available at https://github.com/wenguanwang/SODsurvey.

Index Terms—Salient Object Detection, Deep Learning, Benchmark, Image Saliency.

1 I NTRODUCTION

S ALIENT object detection (SOD) aims at highlighting vi-

sually salient object regions in images. Here, ‘visually
salient’ describes the property of an object or a region to
by predicting eye fixation positions in visual scenes, SOD
differs in that it aims to detect the whole attentive object
regions. Since the renaissance of deep learning techniques,
attract human observers’ attention. SOD is driven by and significant improvement for SOD has been achieved in
applied to a wide spectrum of object-level applications in var- recent years, thanks to the powerful representation learning
ious areas. In computer vision, representative applications methods. Since the first introduction in 2015 [27]–[29], deep
include image understanding [1], [2], image captioning [3]– learning-based SOD (or deep SOD) algorithms have quickly
[5], object detection [6], [7], unsupervised video object seg- shown superior performance over traditional solutions, and
mentation [8], [9], semantic segmentation [10]–[12], person have continued to improve the state-of-the-art.
re-identification [13], [14], and video summarization [15], This paper provides a comprehensive and in-depth sur-
[16]. In computer graphics, SOD also plays an essential role vey of SOD in the deep learning era. In addition to taxonom-
in various tasks, including non-photorealistic rendering[17], ically reviewing existing deep SOD methods, it provides
[18], image cropping [19], [20], image retargeting [21], etc. in-depth analyses of representative datasets and evaluation
Several applications in robotics, such as human-robot inter- metrics, and investigates crucial but largely under-explored
action [22], [23], and object discovery [24], [25], also benefit issues, such as the robustness and transferability of deep
from SOD for better scene/object understanding. SOD models, their strengths and weaknesses under certain
Though inspired by eye fixation prediction (FP) [26], scenarios (i.e., scene/salient object categories, challenging
which originated from cognitive and psychology research factors), as well as the generalizability and difficulty of SOD
communities to investigate the human attention mechanism datasets. The saliency maps used for benchmarking, our
constructed dataset, and evaluation codes are available at
https://github.com/wenguanwang/SODsurvey.
• W. Wang is with ETH Zurich, Switzerland. (Email: wenguan-
[email protected])
• Q. Lai is with the Department of Computer Science and Engineering,
the Chinese University of Hong Kong, Hong Kong, China. (Email:
1.1 History and Scope
[email protected]) Humans are able to quickly allocate attention to important
• H. Fu is with Inception Institute of Artificial Intelligence, UAE. (Email:
[email protected])
regions in visual scenes. Understanding and modeling such
• J. Shen is with Beijing Laboratory of Intelligent Information Technology, an astonishing ability, i.e., visual attention or visual saliency,
School of Computer Science, Beijing Institute of Technology, China. is a fundamental research problem in psychology, neuro-
(Email: [email protected]) biology, cognitive science, and computer vision. There are
• H. Ling is with the Department of Computer and Information Sciences,
Temple University, Philadelphia, PA, USA. (Email: [email protected]) two categories of computational models for visual saliency,
• R. Yang is with the University of Kentucky, Lexington, KY 40507. (Email: namely FP and SOD. FP originated from cognitive and
[email protected]) psychology communities [26], [51], [52], and targets at pre-
• Corresponding author: Jianbing Shen
dicting where people look in images.
MCDL MAP
(Zhao et al.) (Zhang et al.)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Traditional Methods Heuristic FCN-based

DRFI [34]
Very begining MLP-based PiCANet [40] Capsule-based

Deep Methods
(Wang et al.) DSS [39] (Liu et al.)
[30] [31] [32] SF [33] HS [35] wCtr [36] DHSNet [38] (Hou et al.) TSPOANet [41]
(Liu et al.) (Achanta et al.) (Cheng et al.) (Perazzi et al.) (Yan et al.) (Zhu et al.) (Liu et al.) (Liu et al.)

2007 2009 2011 2012 2013 2014 2015 2016 2017 2018 2019 time
MCDL [29] MAP [37] [42] (Qi et al.)
(Zhao et al.) (Zhang et al.)

Fig. 1. A brief chronology of SOD. The very first SOD models date back to the work of Liu et al. [30] and Achanta et al. [31]. The first incorporation of
deep learning techniques into SOD models was in 2015. Listed methods are milestones, which are typically highly cited. See §1.1 for more details.

TABLE 1
Summary of previous reviews. For each work, the publication information and coverage are provided. See §1.2 for more detailed descriptions.

Title Year Venue Description

State-of-the-Art in Visual Attention Modeling [43] 2013 TPAMI This paper reviews visual attention (i.e. fixation prediction) models before 2013.
Salient Object Detection: A Benchmark [44] 2015 TIP This paper benchmarks 29 heuristic SOD models and 10 FP methods over 7 datasets.
Attentive Systems: A Survey [45] 2017 IJCV This paper reviews applications that utilize visual saliency cues.
A Review of Co-Saliency Detection Algorithms:
2018 TIST This paper reviews the fundamentals, challenges, and applications of co-saliency detection.
Fundamentals, Applications, and Challenges [46]
Review of Visual Saliency Detection with Comprehen-
2018 TCSVT This paper reviews RGB-D SOD, co-saliency detection and video SOD.
sive Information [47]
Advanced Deep-Learning Techniques for Salient and This paper reviews several sub-directions of object detection, namely objectness detection, SOD
2018 SPM
Category-Specific Object Detection: A Survey [48] and category-specific object detection.
Saliency Prediction in the Deep Learning Era: Successes
2019 TPAMI This paper reviews image and video fixation prediction models and analyzes specific questions.
and Limitations [49]
This paper reviews 65 heuristic and 21 deep SOD models up to 2017 and discusses closely related
Salient Object Detection: A Survey [50] 2019 CVM
areas like object detection, fixation prediction, segmentation, etc.

The history of SOD is relatively short and can be traced ing that we restrict this survey to single-image SOD methods,
back to [30] and [31]. The rise of SOD has been driven by and leave RGB-D SOD, co-saliency detection, video SOD,
a wide range of object-level computer vision applications. etc., as separate topics.
Instead of FP models only predicting sparse eye fixation
locations, SOD models aim to detect the whole entities of the
1.2 Related Previous Reviews and Surveys
visually attractive objects with precise boundaries. Most tra-
ditional, non-deep SOD models [36], [53] rely on low-level Table 1 lists existing surveys that are related to ours. Among
features and certain heuristics (e.g., color contrast [32], back- them, Borji et al. [44] reviewed SOD methods preceding
ground prior [54]). To obtain uniformly highlighted salient 2015, thus do not refer to recent deep learning-based solu-
objects and clear object boundaries, an over-segmentation tions. Zhang et al. [46] reviewed methods for co-saliency de-
process that generates regions [55], super-pixels [56], [57], or tection, i.e., detecting common salient objects from multiple
object proposals [58] is often integrated into these models. relevant images. Cong et al. [47] reviewed several extended
Please see [44] for a more comprehensive overview. SOD tasks including RGB-D SOD, co-saliency detection
and video SOD. Han et al. [48] looked into several sub-
With the compelling success of deep learning technolo-
directions of object detection, and outlined recent progress
gies in computer vision, more and more deep SOD methods
in objectness detection, SOD, and category-specific object
have begun springing up since 2015. Earlier deep SOD
detection. Borji et al. summarized both heuristic [43] and
models utilized multi-layer perceptron (MLP) classifiers to
deep models [49] for FP. Nguyen et al. [45] focused on
predict the saliency score of deep features extracted from
categorizing the applications of visual saliency (including
each image processing unit [27]–[29]. Later, a more effective
both SOD and FP) in different areas. Finally, a more recently
and efficient form, i.e., fully convolutional network (FCN)-
published survey [50] covers both traditional non-deep SOD
based model, became the mainstream SOD architecture.
methods and deep ones until 2017, and discusses their
Some recent methods [41], [42] also introduced Capsule [59]
relation to several other closely-related research areas, such
into SOD to comprehensively address object property mod-
as special-purpose object detection and segmentation.
eling. A brief chronology of SOD is shown in Fig. 1.
Different from previous SOD surveys, which focus on
Scope of the survey. Despite its short history, research in earlier non-deep learning SOD methods [44], other related
deep SOD has produced hundreds of papers, making it fields [43], [47]–[49], practical applications [45] or a limited
impractical (and fortunately unnecessary) to review all of number of deep SOD models [50], this work systematically
them. Instead, we comprehensively select influential papers and comprehensively reviews recent advances in the field.
published in prestigious journals and conferences. This sur- It features in-depth analyses and discussions on various
vey mainly focuses on the major progress in the last five aspects, many of which, to the best of our knowledge,
years, but for completeness and better readability, some have never been explored in this field. In particular, we
early related works are also included. Due to limitations comprehensively summarize and discuss existing deep SOD
on space and our knowledge, we apologize to those authors methods under several proposed taxonomies (§2); review
whose works are not included in this paper. It is worth not- datasets (§3) and evaluation metrics (§4) with their pros
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

TABLE 2
Taxonomies and representative publications of deep SOD methods. See §2 for more detailed descriptions.

Category Publications
Multi-layer perceptron 1) Super-pixel/patch-based [29], [60], [27], [61]
(MLP)-based 2) Object proposal based [28], [37], [62]
1) Single-stream [63], [64], [65], [66], [67], [68], [69]
Network 2) Multi-stream [70], [71], [72], [73], [74]
Fully convolutional network
Architectures 3) Side-fusion [39], [75], [76], [77], [78], [79], [80], [81], [82]
(FCN)-based
(§2.1) 4) Bottom-up/top-down [38], [83], [84], [85], [86], [87], [40], [88], [89], [90], [91], [92], [93], [94], [95]
5) Branched [96], [97], [98], [99], [100], [101], [102], [103]
Hybrid network-based [104], [105]
Capsule-based [41], [42]
Level of Fully-supervised All others
Supervision 1) Category-level [97], [68], [69], [81]
Un-/Weakly-supervised
(§2.2) 2) Pseudo pixel-level [83], [98], [67], [99]
Single-task learning (STL) All others
1) Salient object subitizing [37], [77], [79]
Learning 2) Fixation prediction [96], [87]
Paradigm Mingle-task learning 3) Image classification [97], [98]
(§2.3) (MTL) 4) Semantic segmentation [63], [103]
5) Contour/edge detection [75], [99], [89], [91], [92], [93], [101], [82], [102]
6) Image captioning [100]
Object-/Instance- Object-level All others
Level (§2.4) Instance-level [37], [70]

and cons; provide a deeper understanding of SOD models depth, we conduct a cross-dataset generalization study
through an attribute-based evaluation (§5.3); discuss the in- that quantitatively reveals the dataset bias.
fluence of input perturbation (§5.4); analyze the robustness 6) Overview of open issues and future directions. We
of deep SOD models to adversarial attacks (§5.5); study the thoroughly look over several essential issues (i.e., model
generalization and difficulty of existing SOD datasets (§5.6); design, dataset collection, etc.), shedding light on poten-
and offer insight into essential open issues, challenges, and tial directions for future research.
future directions (§6). We expect our survey to provide novel These contributions together comprise an exhaustive, up-to-
insight and inspiration that will facilitate the understanding date, and in-depth survey, and differentiate it from previous
of deep SOD, and foster research on the open issues raised. review papers significantly.
The rest of the paper is organized as follows. §2 ex-
plains the proposed taxonomies, each accompanied with
1.3 Our Contributions one or two most representative models. §3 examines the
Our contributions in this paper are summarized as follows: most notable SOD datasets, whereas §4 describes several
widely used SOD metrics. §5 benchmarks several deep SOD
1) A systematic review of deep SOD models from vari- models and provides in-depth analyses. §6 provides further
ous perspectives. We categorize and summarize existing discussions and presents open issues and future research
deep SOD models according to network architecture, directions of the field. Finally, §7 concludes the paper.
level of supervision, learning paradigm, etc. The pro-
posed taxonomies aim to help researchers gain a deeper
understanding of the key features of deep SOD models. 2 D EEP L EARNING BASED SOD M ODELS
2) An attribute-based performance evaluation of SOD Before reviewing recent deep SOD models in details, we
models. We compile a hybrid dataset and provide an- first provide a common formulation of the image-based
notated attributes for object categories, scene categories, SOD problem. Given an input image I ∈ RW×H×3 of size
and challenging factors. By evaluating several represen- W × H , an SOD model f maps the input image I to a
tative SOD models on it, we uncover the strengths and continuous saliency map S = f (I) ∈ [0, 1]W×H. For learning-
weaknesses of deep and non-deep approaches, opening based SOD, the model f is learned through a set of training
up promising directions for future efforts. samples. Given a set of static images I = {In ∈ RW×H×3 }n
3) An analysis of the robustness of SOD models against and corresponding binary SOD ground-truth masks G =
general input perturbations. To study the robustness of {Gn∈ {0, 1}W×H }n , the goal of learning
P is to find f ∈ F that
SOD models, we investigate the effects of various pertur- minimizes the prediction error, i.e., n `(Sn , Gn ), where `
bations on the final performance of deep and non-deep is a certain distance measure (e.g., defined in §4), Sn=f (In ),
SOD models. Some results are somewhat unexpected. and F is the set of potential mapping functions. Deep SOD
4) The first known adversarial attack analysis for SOD methods typically model f through modern deep learning
models. We further examine the robustness of SOD techniques, as will be reviewed later in this section. The
models against intentionally designed perturbations, i.e., ground-truths G can be collected by different methodolo-
adversarial attacks. The specially designed attacks and gies, i.e., direct human-annotation or eye-fixation-guided
evaluations can serve as baselines for further studying labeling, and may have different formats, i.e., pixel-wise or
the robustness and transferability of deep SOD models. bounding-box annotations, which will be discussed in §3.
5) Cross-dataset generalization study. To analyze the gen- In Table 2, we categorize recent deep SOD models ac-
eralization and difficulty of existing SOD datasets in- cording to four taxonomies, considering network architecture
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

TABLE 3
Summary of essential characteristics for popular SOD methods. Here, ‘#Training’ is the number of training images, and ‘CRF’ denotes whether the
predictions are post-processed by conditional random field [106]. See §2 for more detailed descriptions.

Level of Learning Obj.-/Inst.-

Methods Publ. Architecture Backbone Training Dataset #Training CRF
Supervision Paradigm Level SOD
SuperCNN [61] IJCV MLP+super-pixel - Fully-Sup. STL Object ECSSD [55] 800
2015

MCDL [29] CVPR MLP+super-pixel GoogleNet Fully-Sup. STL Object MSRA10K [107] 8,000
LEGS [28] CVPR MLP+segment - Fully-Sup. STL Object MSRA-B [30]+PASCAL-S [108] 3,000+340
MDF [27] CVPR MLP+segment - Fully-Sup. STL Object MSRA-B [30] 2,500
ELD [60] CVPR MLP+super-pixel VGGNet Fully-Sup. STL Object MSRA10K [107] ∼9,000
DHSNet [38] CVPR FCN VGGNet Fully-Sup. STL Object MSRA10K [107]+DUT-OMRON [56] 6,000+3,500
DCL [104] CVPR FCN VGGNet Fully-Sup. STL Object MSRA-B [30] 2,500 X
RACDNN [64] CVPR FCN VGGNet Fully-Sup. STL Object DUT-OMRON [56]+NJU2000 [109]+RGBD [110] 10,565
2016

SU [96] CVPR FCN VGGNet Fully-Sup. MTL Object MSRA10K [107]+SALICON [111] 10,000+15,000 X
MAP [37] CVPR MLP+obj. prop. VGGNet Fully-Sup. MTL Instance SOS [112] ∼5,500
SSD [62] ECCV MLP+obj. prop. AlexNet Fully-Sup. STL Object MSRA-B [30] 2,500
CRPSD [105] ECCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
RFCN [63] ECCV FCN VGGNet Fully-Sup. MTL Object PASCAL VOC 2010 [113]+MSRA10K [107] 10,103+10,000
MSRNet [70] CVPR FCN VGGNet Fully-Sup. STL Instance MSRA-B [30]+HKU-IS [27] (+ILSO [70]) 2,500+2,500 (+500) X
DSS [39] CVPR FCN VGGNet Fully-Sup. STL Object MSRA-B [30]+HKU-IS [27] 2,500 X
WSS [97] CVPR FCN VGGNet Weakly-Sup. MTL Object ImageNet [114] 456k X
DLS [65] CVPR FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
NLDF [75] CVPR FCN VGGNet Fully-Sup. MTL Object MSRA-B [30] 2,500 X
2017

DSOS [77] ICCV FCN VGGNet Fully-Sup. MTL Object SOS [112] 6,900
Amulet [76] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
FSN [72] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
SBF [83] ICCV FCN VGGNet Un-Sup. STL Object MSRA10K [107] 10,000
SRM [71] ICCV FCN ResNet Fully-Sup. STL Object DUTS [97] 10,553
UCF [66] ICCV FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000
RADF [78] AAAI FCN VGGNet Fully-Sup. STL Object MSRA10K [107] 10,000 X
ASMO [98] AAAI FCN ResNet101 Weakly-Sup. MTL Object MS COCO [115]+MSRA-B [30]+HKU-IS [27] 82,783+2,500+2,500 X
LICNN [68] AAAI FCN VGGNet Weakly-Sup. STL Object ImageNet [114] 456k
BDMP [84] CVPR FCN VGGNet Fully-Sup. STL Object DUTS [97] 10,553
DUS [67] CVPR FCN ResNet101 Un-Sup. MTL Object MSRA-B [30] 2,500
2018

DGRL [85] CVPR FCN ResNet50 Fully-Sup. STL Object DUTS [97] 10,553
PAGR [86] CVPR FCN VGGNet19 Fully-Sup. STL Object DUTS [97] 10,553
RSDNet [79] CVPR FCN ResNet101 Fully-Sup. MTL Object PASCAL-S [108] 425
ASNet [87] CVPR FCN VGGNet Fully-Sup. MTL Object SALICON [111]+MSRA10K [107]+DUT-OMRON [56] 15,000+10,000+5,168
PiCANet [40] CVPR FCN VGGNet/ResNet50 Fully-Sup. STL Object DUTS [97] 10,553 X
C2S-Net [99] ECCV FCN VGGNet Weakly-Sup. MTL Object MSRA10K [107]+Web 10,000+20,000
RAS [88] ECCV FCN VGGNet Fully-Sup. STL Object MSRA-B [30] 2,500
SuperVAE [69] AAAI FCN N/A Un-Sup. STL Object N/A N/A
DEF [74] AAAI FCN ResNet101 Fully-Sup. STL Object DUTS [97] 10,553
AFNet [89] CVPR FCN VGGNet16 Fully-Sup. MTL Object DUTS [97] 10,553
BASNet [90] CVPR FCN ResNet-34 Fully-Sup. STL Object DUTS [97] 10,553
CapSal [100] CVPR FCN ResNet101 Fully-Sup. MTL Object COCO-CapSal [100]/DUTS [97] 5,265/10,553
CPD-R [80] CVPR FCN ResNet50 Fully-Sup. STL Object DUTS [97] 10,553
MLSLNet [91] CVPR FCN VGG16 Fully-Sup. MTL Object DUTS [97] 10,553
† MWS ImageNet DET [114]+MS COCO [115] 456k+82,783
[81] CVPR FCN N/A Weakly-Sup. STL Object
2019

+ImageNet [116]+DUTS [97] +300,000+10,553

PAGE-Net [92] CVPR FCN VGGNet16 Fully-Sup. MTL Object MSRA10K [107] 10,000 X
PS [94] CVPR FCN ResNet50 Fully-Sup. STL Object MSRA10K [107] 10,000 X
PoolNet [93] CVPR FCN ResNet50 Fully-Sup. STL/MTL Object DUTS [97] 10,553
BANet [101] ICCV FCN ResNet50 Fully-Sup. MTL Object DUTS [97] 10,553
EGNet [82] ICCV FCN VGGNet/ResNet Fully-Sup. MTL Object DUTS [97] 10,553
HRSOD [73] ICCV FCN VGGNet Fully-Sup. STL Object DUTS [97]/HRSOD [73]+DUTS [97] 10,553/12,163
JDFPR [95] ICCV FCN VGG Fully-Sup. STL Object MSRA-B [30] 2,500 X
SCRN [102] ICCV FCN ResNet50 Fully-Sup. MTL Object DUTS [97] 10,553
SSNet [103] ICCV FCN Desenet169 Fully-Sup. MTL Object PASCAL VOC 2012 [113]+DUTS [97] 1,464+10,553 X
TSPOANet [41] ICCV Capsule FLNet Fully-Sup. STL Object DUTS [97] 10,553

(§2.1), level of supervision (§2.2), learning paradigm (§2.3), and example of regular decomposition, MCDL [29] uses two
whether they works at an object or instance level (§2.4). In the pathways to extract local and global context from two super-
following, each category is elaborated on and exemplified pixel-centered windows of different sizes. The global and
by one or two most representative models. Table 3 summa- local feature vectors are fed into an MLP for classifying
rizes essential characteristics of recent SOD models. background and saliency. In contrast, SuperCNN [61] con-
structs two hand-crafted input feature sequences for each
2.1 Representative Network Architectures for SOD irregular super-pixel, and use two separate CNN columns
Based on the primary network architectures adopted, we to produce saliency scores from the feature sequences, re-
classify deep SOD models into four categories, namely MLP- spectively. Regular image decomposition can accelerate the
based (§2.1.1), FCN-based (§2.1.2), hybrid network-based processing speed, thus most of the methods in this category
(§2.1.3) and Capsule-based (§2.1.4). are based on regular decompostion.
2) Object proposal-based methods leverage object propos-
2.1.1 Multi-Layer Perceptron (MLP)-Based Methods als [27], [28] or bounding-boxes [37], [62] as basic process-
MLP-based methods leverage image subunits (i.e., super- ing units in order to better encode object information. For
pixels/patches [29], [60], [61] and generic object proposals [27], instance, MAP [37] uses a CNN model to generate a set of
[28], [37], [62]) as processing units. They feed deep fea- scored bounding-boxes, then selects an optimized compact
tures extracted from the subunits into an MLP-classifier for subset of bounding-boxes as the salient objects. Note that
saliency score prediction (Fig. 2(a)). this kind of methods typically produce coarse SOD results
1) Super-pixel/patch-based methods use regular (patch) or due to the lack of object boundary information.
nearly-regular (super-pixel) image decomposition. As an Though MLP-based SOD methods greatly outperform
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

[0.1, 0.9] [0.8, 0.2]

Score

Super-pixel/Patch/ Input Image

Object Proposal

Convolutional Layer

(a) (b) (c) (d) Fully-Connected Layer

Other
SOD Capsule Layer
Task

Output

Data Flow

(e) (f) (g) (h)

Fig. 2. Categorization of previous deep SOD models according to the adopted network architecture. (a) MLP-based methods. (b)-(f) FCN-based
methods, mainly using (b) single-stream network, (c) multi-stream network, (d) side-out fusion network, (e) bottom-up/top-down network, and (f)
branch network architectures. (g) Hybrid network-based methods. (h) Capsule-based methods. See §2.1 for more detailed descriptions.

their non-deep counterparts, they cannot fully leverage es- spatial-detail-rich features from lower layers, and produces
sential spatial information and are quite time-consuming, as the finest saliency maps at the top-most layer (Fig. 2(e)),
they need to process all visual subunits one-by-one. which resembles the U-Net [119] for semantic segmentation.
This network architectures is first adopted by PiCANet [40],
2.1.2 Fully Convolutional Network (FCN)-Based Methods which hierarchically embeds global and local pixel-wise at-
To address the limitations of MLP-based methods, recent so- tention modules to selectively attend to informative context.
lutions adopt FCN architecture [117], leading to end-to-end 5) Branched network typically addresses multi-task learn-
spatial saliency representation learning and fast saliency ing for more robust saliency pattern modeling. They have
prediction, within a single feed-forward process. FCN-based a single-input-multiple-output structure, where bottom layers
methods are now dominant in the field. Typical architectures are shared to process a common input and top ones are
can be further classified as: single-stream, multi-stream, side- specialized for different tasks (Fig. 2(f)). For example, C2S-
fusion, bottom-up/top-down, and branched networks. Net [99] is constructed by adding a pre-trained contour
1) Single-stream network is the most standard architecture, detection model [120] to a main SOD branch. Then the two
having a stack of convolutional layers, interleaved with branches are alternately trained for the two tasks, i.e., SOD
pooling and non-linear activation operations (see Fig. 2(b)). and contour detection.
It takes a whole image as input, and directly outputs a
pixel-wise probabilistic map highlighting salient objects.
2.1.3 Hybrid Network-Based Methods
For example, UCF [66] makes use of an encoder-decoder
network architecture for finer-resolution saliency prediction. Some other models combine both MLP- and FCN-based
It incorporates a reformulated dropout in the encoder to subnets to produce edge-preserving results with multi-scale
learn uncertain features, and a hybrid upsampling scheme context (Fig. 2(g)). Combining pixel-level and region-level
in the decoder to avoid checkerboard artifacts. saliency cues is a promising strategy to yield improved per-
2) Multi-stream network, as depicted in Fig. 2(c), typically formance, though it introduces extra computational costs.
consists of multiple network streams to explicitly learn CRPSD [105] consolidates this idea. It combines pixel- and
multi-scale saliency features from multi-resolution inputs. region-level saliency. The former is generated by fusing
Multi-stream outputs are fused to form a final prediction. the last and penultimate side-output features of an FCN,
DCL [104], as one of the earliest attempts towards this while the latter is obtained by applying an existing SOD
direction, contains two streams, which produce pixel- and model [29] to image regions. Only the FCN and fusion layers
region-level SOD estimations, respectively. are trainable.
3) Side-fusion network fuses multi-layer responses of a
backbone network together for SOD prediction, making use 2.1.4 Capsule-Based Methods
of the complementary information of the inherent multi- Recently, Hinton et al. [59] proposed a new family of neural
scale representations of the CNN hierarchy (Fig. 2(d)). Side- networks, named Capsules. Capsules are made up of a group
outputs are typically supervised by the ground-truth, lead- of neurons which accept and output vectors as opposed to
ing to a deep supervision strategy [118]. As a well-known scalar values of CNNs, allowing entity properties to be com-
side-fusion network based SOD model, DSS [39] adds short prehensively modeled. Some researchers have thus been
connections from deeper side-outputs to shallower ones. inspired to explore Capsules in SOD [41], [42] (Fig. 2(h)). For
In this way, higher-level features help lower side-outputs instance, TSPOANet [41] emphasizes part-object relations
to better locate salient regions, and lower-level features can using a two-stream capsule network. The input features of
enrich deeper side-outputs with finer details. capsules are extracted from a CNN, and transformed into
4) Bottom-up/top-down network refines rough saliency low-level capsules. These are then assigned to high-level
maps in the feed-forward pass by gradually incorporating capsules, and finally recognized to be salient or background.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

2.2 Level of Supervision in the sharing of samples among tasks, which alleviates
Based on the type of supervision, deep SOD models can be the lack of data for training heavily parameterized models.
classified into either fully-supervised or weakly-/unsupervised. These are the core motivations of MTL based SOD models,
and branched architectures (see §2.1.2) are usually adopted.
2.2.1 Fully-Supervised Methods 1) Salient object subitizing. The ability of humans to
Most deep SOD models are trained with large-scale pixel- rapidly enumerate a small number of items is known as
level human annotations, which are time-consuming and ex- subitizing [112], [127]. Inspired by this, some works learn
pensive to acquire. Moreover, models trained on fine-labeled salient object subitizing and detection simultaneously [37],
datasets tend to overfit and generalize poorly to real-life [77], [79]. RSDNet [79] represents the latest advance in this
images [67]. Thus, training SOD with weaker annotations direction. It addresses detection, ranking and subitizing of
has become an increasingly popular research direction. salient objects in a unified framework.
2) Fixation prediction aims to predict human eye-fixation
2.2.2 Weakly-/Unsupervised Methods locations in visual scenes. Due to its close relation with
To get rid of laborious manual labeling, several weak super- SOD, learning shared knowledge from these two tasks can
vision forms have been explored in SOD, including image- improve the performance of both. For example, ASNet [87]
level category labels [68], [97], object contours [99], image derives fixation information as a high-level understanding
captions [81] and pseudo ground-truth masks generated by of the scene, from upper network layers. Then, fine-grained
non-learning SOD methods [67], [83], [98]. object-level saliency is progressively optimized under the
1) Category-level supervision. It has been shown that deep guidance of the fixation in a top-down manner. 3) Image
features trained with only image-level labels also provide classification. Image-level tags are valuable for SOD, as they
information on object locations [121], [122], making them provide the category information of dominant objects in the
promising supervision signals for SOD training. WSS [97], images which are very likely to be the salient regions [97].
as a typical example, first pre-trains a two-branch network, Inspired by this, some SOD models learn image classifica-
where one branch is used to predict image labels based on tion as an auxiliary task. For example, ASMO [98] leverages
ImageNet [114], and the other estimates SOD maps. The class activation maps from a neural classifier and saliency
estimated maps are refined by CRF and used to further fine- maps from previous non-learning SOD methods to train the
tune the SOD branch. SOD network, in an iterative manner.
2) Pseudo pixel-level supervision. Though informative, 4) Semantic segmentation is for per-pixel semantic predic-
image-level labels are weak. Some researchers therefore tion. Though SOD is class-agnostic, high-level semantics
instead use traditional non-learning SOD methods [67], [83], play a crucial role in saliency modeling. Thus, the task
[98], or contour information [99], to generate noisy yet finer- of semantic segmentation can also be integrated into SOD
grained cues for training. For instance, SBF [83] fuses weak learning. A recent SOD model, SSNet [103], is developed
saliency maps from a set of prior heuristic SOD models [35], upon this idea. It uses a saliency aggregation module to pre-
[123], [124] at intra- and inter-image levels, to generate dict a saliency score of each category. Then, a segmentation
supervision signals. C2S-Net [99] trains the SOD branch network is used to produce segmentation masks of all the
with the pixel-wise salient object masks generated from the categories. These masks are finally aggregated (according to
outputs of the contour branch [125] using CEDN [120]. The corresponding saliency scores) to produce a SOD map.
contour and SOD branches alternatively update each other 5) Contour/edge detection refers to the task of detecting
and progressively output finer SOD predictions. obvious object boundaries in images, which are informative
of salient objects. Thus, it is also explored in SOD modeling.
2.3 Learning Paradigm For example, PAGE-Net [92] learns an edge detection mod-
ule and embeds edge cues into the main SOD stream in a
From the perspective of learning paradigms, SOD networks
top-down manner, leading to better edge-preserving results.
can be divided into single-task learning (STL) and multi-task
learning (MTL) methods. 6) Image Captioning can provide extra knowledge about
the main content of visual scenes, enabling SOD models to
2.3.1 Single-Task Learning (STL) Based Methods better capture high-level semantics. This has been explored
In machine learning, the standard practice is to learn one in CapSal [100], which incorporates semantic context from a
task at a time [126], i.e., STL. Most deep SOD methods captioning network with local-global visual cues to achieve
belong to this realm of learning, i.e., they utilize supervision improved performance for detecting salient objects.
from a single knowledge domain (SOD or anther related
field such as image classification [68]) for training. 2.4 Object-/Instance-Level SOD
According to whether or not they can identify different
2.3.2 Multi-Task Learning (MTL) Based Methods salient object instances, current deep SOD models can be
Inspired by the human learning process, where knowledge categorized into object-level and instance-level methods.
learned from related tasks can assist the learning of a
new task, MTL [126] aims to improve the performance 2.4.1 Object-Level Methods
of multiple related tasks by learning them simultaneously. Most deep SOD models are object-level methods, i.e., de-
Benefiting from extra knowledge from related tasks, models signed to detect pixels that belong to salient objects without
can gain improved generalizability. An extra advantage lies being aware of individual object instances.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

TABLE 4
Statistics of popular SOD datasets, including the number of images, number of salient objects per image, area ratio of the salient objects in
images, annotation type, image resolution, and existence of fixation data. See §3 for more detailed descriptions.

Dataset Year Publ. #Img. #Obj. Obj. Area(%) SOD Annotation Resolution Fix.
MSRA-A [30] 2007 CVPR 1,000/20,840 1-2 - bounding-box object-level -
MSRA-B [30] 2007 CVPR 5,000 1-2 20.82±10.29 bounding-box object-level, pixel-wise object-level max(w, h) = 400, min(w, h) = 126
Early

SED1 [128] 2007 CVPR 100 1 26.70±14.26 pixel-wise object-level max(w, h) = 465, min(w, h) = 125
SED2 [128] 2007 CVPR 100 2 21.42±18.41 pixel-wise object-level max(w, h) = 300, min(w, h) = 144
ASD [31] 2009 CVPR 1,000 1-2 19.89±9.53 pixel-wise object-level max(w, h) = 400, min(w, h) = 142
Modern&Popular

SOD [129] 2010 CVPR-W 300 1-4+ 27.99±19.36 pixel-wise object-level max(w, h) = 481, min(w, h) = 321
MSRA10K [107] 2015 TPAMI 10,000 1-2 22.21±10.09 pixel-wise object-level max(w, h) = 400, min(w, h) = 144
ECSSD [55] 2015 TPAMI 1,000 1-4+ 23.51±14.02 pixel-wise object-level max(w, h) = 400, min(w, h) = 139
DUT-OMRON [56] 2013 CVPR 5,168 1-4+ 14.85±12.15 pixel-wise object-level max(w, h) = 401, min(w, h) = 139 X
PASCAL-S [108] 2014 CVPR 850 1-4+ 24.23±16.70 pixel-wise object-level max(w, h) = 500, min(w, h) = 139 X
HKU-IS [27] 2015 CVPR 4,447 1-4+ 19.13±10.90 pixel-wise object-level max(w, h) = 500, min(w, h) = 100
DUTS [97] 2017 CVPR 15,572 1-4+ 23.17±15.52 pixel-wise object-level max(w, h) = 500, min(w, h) = 100
SOS [112] 2015 CVPR 6,900 0-4+ 41.22±25.35 object number, bounding-box (train set) max(w, h) = 6132, min(w, h) = 80
MSO [112] 2015 CVPR 1,224 0-4+ 39.51±24.85 object number, bounding-box instance-level max(w, h) = 3888, min(w, h) = 120
Special

ILSO [70] 2017 CVPR 1,000 1-4+ 24.89±12.59 pixel-wise instance-level max(w, h) = 400, min(w, h) = 142
XPIE [130] 2017 CVPR 10,000 1-4+ 19.42±14.39 pixel-wise object-level, geographic information max(w, h) = 500, min(w, h) = 130 X
SOC [131] 2018 ECCV 6,000 0-4+ 21.36±16.88 pixel-wise instance-level, object category, attribute max(w, h) = 849, min(w, h) = 161
COCO-CapSal [100] 2019 CVPR 6,724 1-4+ 23.74±17.00 pixel-wise object-level, image caption max(w, h) = 640, min(w, h) = 480
HRSOD [73] 2019 ICCV 2,010 1-4+ 21.13±15.14 pixel-wise object-level max(w, h) = 10240, min(w, h) = 600

2.4.2 Instance-Level Methods • MSRA-A [30] contains 20,840 images. Each image has
Instance-level SOD methods further identify individual ob- only one noticeable and eye-catching object, annotated by
ject instances in the detected salient regions, which is cru- a bounding-box. As a subset of MSRA-A, MSRA-B has 5,000
cial for practical applications that need finer distinctions, images and less ambiguity w.r.t. the salient object.
such as semantic segmentation [132] and multi-human pars- • SED [128]1 comprises a single-object subset and a two-
ing [133]. As an early attempt, MSRNet [70] performs salient object subset; each has 100 images with mask annotations.
instance detection by decomposing it into three sub-tasks, • ASD [31]2 , also a subset of MSRA-A, has 1,000 images
i.e., pixel-level saliency prediction, salient object contour with pixel-wise ground-truths.
detection and salient instance identification. It jointly per-
forms the first two sub-tasks by integrating deep features 3.3 Popular Modern SOD Datasets
for several different scaled versions of the input image. Recent SOD datasets tend to include more challenging and
The last sub-task is solved by multi-scale combinatorial general scenes with relatively complex backgrounds and
grouping [125] to generate salient object proposals from the multiple salient objects. All have pixel-wise annotations.
detected contours and filter out noisy or overlapping ones. • SOD [129]3 consists of 300 images, constructed from [134].
Many images have more than one salient object that is
similar to the background or touches image boundaries.
3 SOD DATASETS
• MSRA10K [107]4 , also known as THUS10K, contains
With the rapid development of SOD, numerous datasets 10,000 images selected from MSRA-A and covers all the
have been introduced. Table 4 summarizes 19 SOD datasets, images in ASD. Due to its large scale, MSRA10K is widely
which are highly representative and widely used for train- used to train deep SOD models (see Table 3).
ing or benchmarking, or collected with specific properties. • ECSSD [55]5 is composed of 1,000 images with semanti-
cally meaningful but structurally complex natural contents.
• DUT-OMRON [56]6 has 5,168 images of complex back-
3.1 Quick Overview
grounds and diverse content, with pixel-wise annotations.
In an attempt to facilitate understanding of SOD datasets, • PASCAL-S [108]7 comprises 850 challenging images se-
we present some main take-away points of this section. lected from the PASCAL VOC2010 val set [113]. With eye-
• Compared with early datasets [30], [31], [128], recent ones fixation records, non-binary salient-object mask annotations
[27], [56], [97], [107] are typically more advanced with less are provided. Note that the saliency value of a pixel is
center bias, improved complexity, and increased scale. They calculated as the ratio of subjects that select the segment
are thus better-suited for training and evaluation, and likely containing this pixel as salient.
to have longer life-spans. • HKU-IS [27]8 has 4, 447 complex scenes that typically
• Some other recent datasets [70], [73], [100], [112], [130], contain multiple disconnected objects with diverse spatial
[131] are enriched with more diverse annotations (e.g., distributions and similar fore-/background appearances.
subitizing, captioning), representing new trends in the field.
More in-depth discussions regarding generalizability and 1. http://www.wisdom.weizmann.ac.il/∼vision/Seg Evaluation DB
difficulty of several famous datasets will be presented in §5.6. 2. https://ivrlwww.epfl.ch/supplementary material/RK CVPR09/
3. http://elderlab.yorku.ca/SOD/
4. https://mmcheng.net/zh/msra10k/
3.2 Early SOD Datasets 5. http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency
6. http://saliencydetection.net/dut-omron/
Early SOD datasets typically contain simple scenes where 7. http://cbi.gatech.edu/salobj/
1-2 salient objects stand out from a clear background. 8. https://i.cs.hku.hk/∼gbli/deep saliency.html
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

3.5 Discussion
As shown in Table 4, early SOD datasets [30], [31], [128]
MSRA-B SED1 SED2 ASD SOD MSRA10K are comprised of simple images with 1-2 salient objects per
image, and only provide rough bounding-box annotations,
DUT- which are insufficient for reliable evaluations [31], [136]. Per-
ECCSD OMRON PASCAL-S HKU-IS DUTS SOS
formance on these datasets has become saturated. Modern
datasets [27], [55], [56], [97], [107] are typically large-scale
MSO ILSO XPIE SOC CapSal HRSOD
and offer precise pixel-wise ground-truths. The scenes are
more complex and general, and usually contain multiple
Fig. 3. Annotation distributions of SOD datasets (see §3 for details).
salient objects. Some special datasets contain challenging
scenes with background only [112], [131], provide more
fine-grained, instance-level SOD ground-truths [70], [131]
• DUTS [97] is a large-scale dataset, where the 10, 553 or include other annotations such as image captions [100],
9

training images were selected from the ImageNet train/val inspiring new research directions and applications. Fig. 3
set [114], and the 5, 019 test images are from the ImageNet depicts the annotation distributions of 18 SOD datasets.
test set and SUN [135]. Since 2017, SOD models are typically Here are some DUT-
essential conclusions: 1) Some datasets [30],
MSRA-B SED1 SED2 ASD SOD MSRA10K [31],
ECCSD [97], [107]
OMRON have significant center bias; 2) Datasets [27],
trained on DUTS (Table 3).
[70], [100] have more balanced location distributions for
salient objects; and 3) MSO [112] has less center bias, as
PASCAL-S HKU-IS DUTS SOS MSO ILSO XPIE bounding-box
only SOC annotations are provided. We analyze
3.4 Other Special SOD Datasets
the generalizability and difficulty of several famous SOD
In addition to the above “standard” SOD datasets, some datasets in-depth in §5.6.
special ones have also
MSRA-B recently beenSED2
SED1 proposed, leading
ASD to SOD
new research directions.
4 E VALUATION M ETRICS
• SOS [112]10 is created for SOD subitizing [127]. It contains
6,900MSRA10K
images (training test set: 1,380). Each
set: 5,520, DUT-OMRON
ECCSD image This
PASCAL-S section reviews popular object-level SOD evalua-
HKU-IS
is labeled as containing 0, 1, 2, 3 or 4+ salient objects. tion metrics, i.e., Precision-Recall (PR), F-measure [31],
• MSO [112]11 is a subset of SOS-test [112], covering 1,224 Mean Absolute Error (MAE) [33], weighted Fβ measure
images. DUTS
It has a more MSO XPIE of
ILSO of the number
balanced distribution (Fbw)SOC [137], Structural measure (S-measure) [138], and
salient objects. Each object has a bounding-box annotation. Enhanced-alignment measure (E-measure) [139].
• ILSO [70]12 contains 1,000 images with precise instance-
level annotations and coarse contour labeling. 4.1 Quick Overview
• XPIE [130]13 has 10,000 images with pixel-wise labels. It To better understand the characteristics of different metrics,
has three subsets: Set-P has 625 images of places-of-interest a quick overview of the main conclusions for this section are
with geographic information; Set-I 8,799 images with object provided as follows.
tags; and Set-E 576 images with eye-fixation records. • PR, F-measure, MAE, and Fbw address pixel-wise errors,
• SOC [131]14 consists of 6,000 images with 80 common while S-measure and E-measure consider structure cues.
categories. Half of the images contain salient objects, while • Among pixel-level metrics, PR, F-measure, and Fbw fail to
the remaining have none. Each image containing salient consider true negative pixels, while MAE can remedy this.
objects is annotated with an instance-level ground-truth • Among structured metrics, S-measure is more favored than
mask, object category, and challenging factors. The non- E-measure, as SOD addresses continuous saliency estimates.
salient object subset has 783 texture images and 2,217 real- • Considering popularity, advantages and completeness, F-
scene images. measure, S-measure and MAE are the most recommended
• COCO-CapSal [100]15 is built from COCO [115] and and are thus used for our performance benchmarking in §5.2.
SALICON [111]. Salient objects were first roughly localized
using the mouse-click data in SALICON, then precisely 4.2 Metric Details
annotated according to the instance masks in COCO. The
• PR is calculated based on the binarized salient object mask
dataset has 5,265 and 1,459 images for training and testing,
and ground-truth:
respectively.
• HRSOD [73]16 is the first high-resolution dataset for SOD. TP TP
Precision = , Recall = , (1)
It contains 1,610 training and 400 testing images collected TP + FP TP + FN
from websites. Pixel-wise ground-truths are provided. where TP, TN, FP, FN denote true-positive, true-negative,
false-positive, and false-negative, respectively. A set of
9. http://saliencydetection.net/duts/ thresholds ([0 − 255]) is applied to binarize the prediction.
10. http://cs-people.bu.edu/jmzhang/sos.html Each threshold produces a pair of precision/recall values to
11. http://cs-people.bu.edu/jmzhang/sos.html form a PR curve for describing model performance.
12. http://www.sysu-hcp.net/instance-level-salient-object-segmentation/
• F-measure [31] comprehensively considers both precision
13. http://cvteam.net/projects/CVPR17-ELE/ELE.html
and recall by computing the weighted harmonic mean:
14. http://mmcheng.net/SOCBenchmark/
15. https://github.com/yi94code/HRSOD (1 + β 2 )Precision × Recall
16. https://github.com/zhangludl/code-and-dataset-for-CapSal Fβ = . (2)
β 2 Precision + Recall
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

Empirically, β 2 is set to 0.3 [31] to put more emphasis on 5 B ENCHMARKING AND E MPIRICAL A NALYSIS
precision. Instead of plotting the whole F-measure curve, This section provides empirical analyses to shed light on
some methods only report maximal Fβ , or binarize the pre- some key challenges in the field. Specifically, with our large-
dicted saliency map by an adaptive threshold, i.e., twice the scale benchmarking (§5.2), we first conduct an attribute-
mean value of the saliency prediction, and report mean F. based study to better understand the benefits and limita-
• MAE [33] measures the average pixel-wise absolute error tions of current arts (§5.3). Then, we study the robustness of
between normalized saliency prediction map S ∈ [0, 1]W×H SOD models against input perturbations, i.e., random exerted
and binary ground-truth mask G ∈ {0, 1}W×H : noises (§5.4) and manually designed adversarial samples
1 XW XH (§5.5). Finally, we quantitatively assess the generalizability
MAE = |G(i, j) − S(i, j)|. (3)
W ×H i=1 j=1 and difficulty of current mainstream SOD datasets (§5.6).
• Fbw [137] intuitively generalizes F-measure by alternating
the way of calculating precision and recall. It extends the 5.1 Quick Overview
four basic quantities TP, TN, FP and FN to real values, and
assigns different weights (ω ) to different errors at different For ease of understanding, we compile important observa-
locations, considering the neighborhood information: tions and conclusions from subsequent experiments below.
• Overall benchmarks (§5.2). As shown in Table 5, deep
(1 + β 2 )Precisionω × Recallω SOD models significantly outperform heuristic ones, and
Fβω = . (4)
β 2 Precisionω + Recallω the performance on some datasets [27], [55] has become
• S-measure [138] evaluates the structural similarity be- saturated. [82], [93], [101], [102] are current state-of-the-arts.
tween the real-valued saliency map and the binary ground- • Attribute-based analysis (§5.3). Results in Table 7 reveal
truth. It considers object-aware (So ) and region-aware (Sr ) that deep methods show significant advantages in detecting
structure similarities: semantic-rich objects, such as animal. Both deep and non-
deep methods face difficulties with small salient objects. For
S = α × So + (1 − α) × Sr , (5)
application scenarios, indoor scenes pose great challenges,
where α is empirically set to 0.5. highlighting potential directions for future efforts.
• E-measure [139] considers global means of the image and • Robustness against random perturbations (§5.4). As shown in
local pixel matching simultaneously: Table 9, surprisingly, deep methods are more sensitive than
1 XW XH heuristic ones to random input perturbations. Both types
QS = φS (i, j), (6) of methods demonstrate more robustness against Rotation,
W ×H i=1 j=1

where φS is the enhanced alignment matrix, reflecting the while being fragile towards Gaussian blur and Gaussian noise.
correlation between S and G after subtracting their global • Adversarial attack (§5.5). Table 10 suggests that adversarial
means, respectively. attacks cause drastic degradation in performance for deep
SOD models, and are even worse than that of random
perturbations. However, attacks rarely transfers between
4.3 Discussion
different SOD networks.
These measures are typically based on pixel-wise errors • Generalizability and difficulty of datasets (§5.6). Table 11
while ignoring structural similarities, with S-measure and shows that DUTS-train [97] is a good choice for training
E-measure being the only exceptions. F-measure and E- deep SOD models as it has the best generalizability, while
measure are designed for assessing binarized saliency pre- SOC [131], DUT-OMRON [56], and DUTS-test [97] are more
diction maps, while PR, MAE, Fbw, and S-measure are for suitable for evaluation due to their difficulty.
non-binary map evaluation.
Among pixel-level metrics, the PR curve is classic. How-
ever, precision and recall cannot fully assess the quality of 5.2 Performance Benchmarking
saliency predictions, since high-precision predictions may Table 5 shows the performances of 44 state-of-the-art deep
only highlight a part of salient objects, while high-recall SOD models and three top-performing classic methods
predictions are typically meaningless if all the pixels are (suggested by [44]) on six most popular modern datasets.
predicted as being salient. In general, a high-recall response The performance is measured by three metrics, i.e., maximal
may come at the expense of reduced precision, and vice Fβ , S-measure and MAE, as recommended in §4.3. All the
versa. F-measure and Fbw are thus used to consider pre- benchmarked models are representative, and have publicly
cision and recall simultaneously. However, overlap-based available implementations or saliency prediction results. For
metrics (i.e., PR, F-measure, and Fbw) do not consider the performance benchmarking, we either use saliency maps
true negative saliency assignments, i.e., the pixels correctly provided by the authors or run their official codes. It is
marked as non-salient. Thus, these metrics favor methods worth mentioning that, for some methods, our benchmark-
that successfully assign high saliency to salient pixels but ing results are inconsistent with their reported scores. There
fail to detect non-salient regions [50]. MAE can remedy this, are several reasons. First, our community long lacked an
but it performs poorly when salient objects are small. For the open, universally-adopted evaluation tool, while there are
structure-/image-level metrics, S-measure is more popular many implementation factors would influence the eval-
than E-measure, as SOD focuses on continuous predictions. uation scores, such as input image resolution, threshold
Considering the popularity and characteristics of existing step, etc. Second, some methods [66], [69], [74], [76], [85],
metrics and completeness of evaluation, F-measure (maximal [100] use mean F-measure instead of maximal F-measure for
Fβ ), S-measure and MAE are our top recommendations. performance evaluation. Third, for some methods [39], [76],
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

TABLE 5
Benchmarking results of 44 state-of-the-art deep SOD models and 3 top-performing classic SOD methods on 6 famous datasets (§5.2). Here
max F, S, and M indicate maximal Fβ , S-measure, and MAE, respectively. The three best scores are marked in red, blue, and green, respectively.

Dataset ECSSD [55] DUT-OMRON [56] PASCAL-S [108] HKU-IS [27] DUTS-test [97] SOD [129]
Metric max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓ max F↑ S↑ M↓
∗ HS [35]
2015 2013-14

.673 .685 .228 .561 .633 .227 .569 .624 .262 .652 .674 .215 .504 .601 .243 .756 .711 .222
∗ DRFI [53] .751 .732 .170 .623 .696 .150 .639 .658 .207 .745 .740 .145 .600 .676 .155 .658 .619 .228
∗ wCtr [36] .684 .714 .165 .541 .653 .171 .599 .656 .196 .695 .729 .138 .522 .639 .176 .615 .638 .213
MCDL [29] .816 .803 .101 .670 .752 .089 .706 .721 .143 .787 .786 .092 .634 .713 .105 .689 .651 .182
LEGS [28] .805 .786 .118 .631 .714 .133 ‡ ‡ ‡ .736 .742 .119 .612 .696 .137 .685 .658 .197
MDF [27] .797 .776 .105 .643 .721 .092 .704 .696 .142 .839 .810 .129 .657 .728 .114 .736 .674 .160
ELD [60] .849 .841 .078 .677 .751 .091 .782 .799 .111 .868 .868 .063 .697 .754 .092 .717 .705 .155
DHSNet [38] .893 .884 .060 ‡ ‡ ‡ .799 .810 .092 .875 .870 .053 .776 .818 .067 .790 .749 .129
2016

DCL [104] .882 .868 .075 .699 .771 .086 .787 .796 .113 .885 .877 .055 .742 .796 .149 .786 .747 .195
MAP [37] .556 .611 .213 .448 .598 .159 .521 .593 .207 .552 .624 .182 .453 .583 .181 .509 .557 .236
CRPSD [105] .915 .895 .048 - - - .864 .852 .064 .906 .885 .043 - - - - - -
RFCN [63] .875 .852 .107 .707 .764 .111 .800 .798 .132 .881 .859 .089 .755 .859 .090 .769 .794 .170
MSRNet [70] .900 .895 .054 .746 .808 .073 .828 .838 .081 ‡ ‡ ‡ .804 .839 .061 .802 .779 .113
DSS [39] .906 .882 .052 .737 .790 .063 .805 .798 .093 ‡ ‡ ‡ .796 .824 .057 .805 .751 .122
† WSS [97] .879 .811 .104 .725 .730 .110 .804 .744 .139 .878 .822 .079 .878 .822 .079 .807 .675 .170
DLS [65] .826 .806 .086 .644 .725 .090 .712 .723 .130 .807 .799 .069 - - - - - -
2017

NLDF [75] .889 .875 .063 .699 .770 .080 .795 .805 .098 .888 .879 .048 .777 .816 .065 .808 .889 .125
Amulet [76] .905 .894 .059 .715 .780 .098 .805 .818 .100 .887 .886 .051 .750 .804 .085 .773 .757 .142
FSN [72] .897 .884 .053 .736 .802 .066 .800 .804 .093 .884 .877 .044 .761 .808 .066 .781 .755 .127
SBF [83] .833 .832 .091 .649 .748 .110 .726 .758 .133 .821 .829 .078 .657 .743 .109 .740 .708 .159
SRM [71] .905 .895 .054 .725 .798 .069 .817 .834 .084 .893 .887 .046 .798 .836 .059 .792 .741 .128
UCF [66] .890 .883 .069 .698 .760 .120 .787 .805 .115 .874 .875 .062 .742 .782 .112 .763 .753 .165
RADF [78] .911 .894 .049 .761 .817 .055 .800 .802 .097 .902 .888 .039 .792 .826 .061 .804 .757 .126
BDMP [84] .917 .911 .045 .734 .809 .064 .830 .845 .074 .910 .907 .039 .827 .862 .049 .806 .786 .108
DGRL [85] .916 .906 .043 .741 .810 .063 .830 .839 .074 .902 .897 .037 .805 .842 .050 .802 .771 .105
PAGR [86] .904 .889 .061 .707 .775 .071 .814 .822 .089 .897 .887 .048 .817 .838 .056 .761 .716 .147
2018

RSDNet [79] .880 .788 .173 .715 .644 .178 ‡ ‡ ‡ .871 .787 .156 .798 .720 .161 .790 .668 .226
ASNet [87] .925 .915 .047 ‡ ‡ ‡ .848 .861 .070 .912 .906 .041 .806 .843 .061 .801 .762 .121
PiCANet [40] .929 .916 .035 .767 .825 .054 .838 .846 .064 .913 .905 .031 .840 .863 .040 .814 .776 .096
† C2S-Net [99] .902 .896 .053 .722 .799 .072 .827 .839 .081 .887 .889 .046 .784 .831 .062 .786 .760 .124
RAS [88] .908 .893 .056 .753 .814 .062 .800 .799 .101 .901 .887 .045 .807 .839 .059 .810 .764 .124
AFNet [89] .924 .913 .042 .759 .826 .057 .844 .849 .070 .910 .905 .036 .838 .867 .046 .809 .774 .111
BASNet [90] .931 .916 .037 .779 .836 .057 .835 .838 .076 .919 .909 .032 .838 .866 .048 .805 .769 .114
CapSal [100] .813 .826 .077 .535 .674 .101 .827 .837 .073 .842 .851 .057 .772 .818 .061 .669 .694 .148
CPD [80] .926 .918 .037 .753 .825 .056 .833 .848 .071 .911 .905 .034 .840 .869 .043 .814 .767 .112
MLSLNet [91] .917 .911 .045 .734 .809 .064 .835 .844 .074 .910 .907 .039 .828 .862 .049 .806 .786 .108
† MWS [81] .859 .827 .099 .676 .756 .108 .753 .768 .134 .835 .818 .086 .720 .759 .092 .772 .700 .170
PAGE-Net [92] .926 .910 .037 .760 .819 .059 .829 .835 .073 .910 .901 .031 .816 .848 .048 .795 .763 .108
2019

PS [94] .930 .918 .041 .789 .837 .061 .837 .850 .071 .913 .907 .038 .835 .865 .048 .824 .800 .103
PoolNet [93] .937 .926 .035 .762 .831 .054 .858 .865 .065 .923 .919 .030 .865 .886 .037 .831 .788 .106
BANet-R [101] .939 .924 .035 .782 .832 .059 .847 .852 .070 .923 .913 .032 .858 .879 .040 .842 .791 .106
EGNet-R [82] .936 .925 .037 .777 .841 .053 .841 .852 .074 .924 .918 .031 .866 .887 .039 .854 .802 .099
HRSOD-DH [73] .911 .888 .052 .692 .762 .065 .810 .817 .079 .890 .877 .042 .800 .824 .050 .735 .705 .139
JDFPR [95] .915 .907 .049 .755 .821 .057 .827 .841 .082 .905 .903 .039 .792 .836 .059 .792 .763 .123
SCRN [102] .937 .927 .037 .772 .836 .056 .856 .869 .063 .921 .916 .034 .864 .885 .040 .826 .787 .107
SSNet [103] .889 .867 .046 .708 .773 .056 .793 .807 .072 .876 .854 .041 .769 .784 .049 .713 .700 .118
TSPOANet [41] .919 .907 .047 .749 .818 .061 .830 .842 .078 .909 .902 .039 .828 .860 .049 .810 .772 .118
∗ Non-deep learning model. † Weakly-supervised model. Bounding-box output. ‡ Training on subset. - Results not available.

the evaluation scores of finally released saliency maps are PoolNet [93], BANet [101], EGNet [82], and SCRN [102] as
inconsistent with the ones reported in papers. We hope the four state-of-the-art methods, which consistently show
that our performance benchmarking, publicly released eval- promising performance over diverse datasets.
uation tools and SOD maps could help our community
build an open and standardized evaluation system and 5.3 Attribute-Based Study
ensure consistency and procedural correctness for results
and conclusions produced by different parties. Although the community has witnessed the great advances
made by deep SOD models, it is still unclear under which
Not surprisingly, data-driven models greatly outperform specific aspects these models perform well. As there are
conventional heuristic ones, due to their strong learning numerous factors affecting the performance of a SOD algo-
ability for visually salient pattern modeling. In addition, rithm, such as object/scene category, occlusion, etc., it is cru-
the performance has gradually increased since 2015, demon- cial to evaluate the performance under different scenarios.
strating well the advancement of deep learning techniques. This can help reveal the strengths and weaknesses of deep
However, after 2018, the rate of improvement began de- SOD models, identify pending challenges, and highlight
crasing, calling for more effective model designs and new future research directions towards more robust algorithms.
machine learning technologies. We also find that the per-
formances tend to be saturated on older SOD datasets 5.3.1 Hybrid Benchmark Dataset with Attribute Annotations
such as ECSSD [55] and HKU-IS [27]. Hence, among the To enable a deeper analysis and understanding of the per-
44 famous deep SOD models, we would like to nominate formance of an algorithm, it is essential to identify the
Natural Urban Natural Indoor Urban

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

NatObj Artifact, NatObj Animal, NatObj

OV, OC, CS OC, CS, CT MO, HO, OV, OC, CS, CT
Natural Indoor Natural

Artifact NatObj Animal Artifact Human Human, Artifact

HO, CS, CT, SO HO, OC, CT MO, HO, OC, BC, CT OV, LO HO, OC, CS, BC, LO MO, HO, OC, CS, CT, SO
Natural Natural Natural Urban Urban Natural

Fig. 4. Sample images from the hybrid benchmark consisting of images randomly selected from 6 SOD datasets. Salient regions are uniformly
highlighted. Corresponding attributes are listed. See §5.3 for more detailed descriptions.

TABLE 7
Attribute-based study w.r.t. salient object categories, challenges and scene categories. (·) indicates the percentage of images with a specific
attribute. ND-avg indicates the average score of three heuristic models: HS [35], DRFI [53] and wCtr [36]. D-avg indicates the average score of
three deep learning models: DGRL [85], PAGR [86] and PiCANet [40]. Best in red, and worst with underline. See §5.3 for more details.

Salient object categories Challenges Scene categories

Metric Method Human Animal Artifact NatObj MO HO OV OC CS BC CT SO LO Indoor Urban Natural
(26.61) (38.44) (45.67) (10.56) (11.39) (66.39) (28.72) (46.50) (40.44) (47.22) (74.11) (21.61) (12.61) (20.28) (22.22) (57.50)
∗ HS
[35] .587 .650 .636 .704 .663 .637 .631 .645 .558 .647 .629 .493 .737 .594 .627 .650
∗ DRFI
[53] .635 .692 .673 .713 .674 .688 .658 .675 .599 .662 .677 .566 .747 .609 .661 .697
∗ wCtr [36] .557 .621 .624 .682 .639 .625 .605 .620 .522 .612 .606 .469 .689 .578 .613 .618
DGRL [85] .820 .881 .830 .728 .783 .846 .829 .830 .781 .842 .834 .724 .873 .800 .848 .840
max F↑
PAGR [86] .834 .890 .787 .725 .743 .819 .778 .809 .770 .797 .822 .760 .802 .788 .796 .828
PiCANet [40] .840 .897 .846 .669 .791 .861 .843 .845 .797 .848 .850 .763 .889 .806 .862 .859
∗ ND-avg .593 .654 .644 .700 .659 .650 .631 .647 .560 .640 .637 .509 .724 .594 .634 .655
D-avg .831 .889 .821 .708 .772 .842 .817 .828 .783 .829 .836 .749 .855 .798 .836 .842
∗ Non-deep learning model.

TABLE 6 images with attribute annotations are shown in Fig. 4. Please

Descriptions of attributes that often bring difficulties to SOD (see §5.3). note that this benchmark will also be used in §5.4 and §5.5.
For the baselines in our attribute-based analysis, we
Attr Description choose the three top-performing heuristic models again, i.e.,
MO Multiple Objects. There exist more than two salient objects. HS [35], DRFI [53] and wCtr [36], and three recent famous
HO Heterogeneous Object. Salient object regions have distinct
colors or illuminations. deep methods, i.e., DGRL [85], PAGR [86] and PiCANet [40].
OV Out-of-View. Salient objects are partially clipped by image All three deep models are trained on DUTS-train [97] and
boundaries. have publicly released implementations.
OC Occlusion. Salient objects are occluded by other objects.
CS Complex Scene. Background regions contain confusing 5.3.2 Analysis
objects or rich details.
In Table 7, we report the performance on subsets of our
BC Background Clutter. Foreground and background regions
around the salient object boundaries have similar colors (χ2 hybrid dataset characterized by a particular attribute. To
between RGB histograms less than 0.9). provide better insight, in Table 8, we select images with
CT Complex Topology. Salient objects have complex shapes, e.g., the best-100 and worst-100 model predictions, and compare
thin parts or holes. the portion distributions of attributes w.r.t. the ones over
SO Small Object. Ratio between salient object area and image is
less than 0.1. the whole dataset. Below are some important observations
LO Large Object. Ratio between salient object area and image is drawn from these experiments.
larger than 0.5. • ‘Easy’ and ‘hard’ object categories. Deep and non-deep
SOD models view object categories differently (Table 7).
For the deep methods (D-avg), NatObj is clearly the most
key factors and circumstances influencing it [140]. To this challenging one which is probably due to its small number
end, we construct a hybrid benchmark with rich attribute of training samples and complex topologies. Animal appears
annotations. It consists of 1,800 images randomly selected to be the easiest, which can be attributed to its significant se-
from six SOD datasets (300 for each), namely SOD [129], mantics. By contrast, traditional methods (ND-avg) struggle
ECSSD [55], DUT-OMRON [56], PASCAL-S [108], HKU- with Human, revealing their limitations in capturing high-
IS [27] and DUTS test set [97]. Inspired by [108], [140], we level semantics. We are surprised to find that the deep mod-
annotate each image with an extensive set of attributes cov- els significantly outperform the non-deep ones over almost
ering typical object types, challenging factors and diverse all the object categories, except NatObj. This demonstrates
scene categories. Specifically, the annotated salient objects the value of heuristic assumptions in certain scenes and
are categorized into Human, Animal, Artifact and NatObj the potential of embedding human prior knowledge into
(Natural Objects), where NatObj includes natural objects current deep learning schemes.
such as fruit, plant, mountains, icebergs, lakes, etc. The • Most and least challenging factors. Table 7 shows that,
challenging factors describe specific situations that often interestingly, both deep and non-deep methods handle LO
bring difficulties to SOD, such as occlusions, background well. In addition, both types of methods face difficulties
clutter, and complex shapes (see Table 6). The image scenes with SO, highlighting a promising direction for future
include Indoor, Urban and Natural, where the last two indi- efforts. Besides, we find that CS and MO are challenging for
cate different outdoor environments. It is worth mentioning deep models, showing that current solutions still fall short
that the attributes are not mutually exclusive. Some sample at determining the relative importance of different objects.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

TABLE 8
Attribute statistics of top and bottom 100 images based on F-measure. (·) indicates the percentage of the images with a specific attribute. ND-avg
indicates the average results of three heuristic models: HS [35], DRFI [53] and wCtr [36]. D-avg indicates the average results of three deep
models: DGRL [85], PAGR [86] and PiCANet [40]. Two largest changes in red if positive, and blue if negative. See §5.3 for more details.

Salient object categories Challenges Scene categories

Method Cases Human Animal Artifact NatObj MO HO OV OC CS BC CT SO LO Indoor Urban Natural
(26.61) (38.44) (45.67) (10.56) (11.39) (66.39) (28.72) (46.50) (40.44) (47.22) (74.11) (21.61) (12.61) (20.28) (22.22) (57.50)
Best (%) 13.00 25.00 46.00 27.00 5.00 61.00 12.00 26.00 10.00 20.00 63.00 5.00 18.00 17.00 6.00 12.00
change -13.61 -13.44 +0.33 +14.44 -6.39 -5.39 -16.72 -20.50 -30.44 -27.22 -11.11 -16.61 +5.39 -3.28 -16.22 -45.50
ND-avg
Worst (%) 36.00 30.00 41.00 5.00 6.00 54.00 15.00 34.00 70.00 31.00 71.00 76.00 0.00 22.00 37.00 37.00
change +9.39 -8.44 -4.67 -5.56 -5.39 -12.39 -13.72 -12.50 +29.56 -16.22 -3.11 +54.39 -12.61 +1.72 +14.78 -20.50
Best (%) 24.00 30.00 49.00 17.00 3.00 69.00 33.00 28.00 26.00 35.00 49.00 2.00 18.00 24.00 23.00 53.00
change -2.61 -8.44 +3.33 +6.44 -8.39 +2.61 +4.28 -18.50 -14.44 -12.22 -25.11 -19.61 +5.39 +3.72 +0.78 -4.50
D-avg
Worst (%) 30.00 10.00 49.00 33.00 20.00 52.00 28.00 46.00 70.00 42.00 59.00 50.00 3.00 32.00 23.00 45.00
change +3.39 -28.44 +3.33 +22.44 +8.61 -14.39 -0.72 -0.50 +29.56 -5.22 -15.11 +28.39 -9.61 +11.72 +0.78 -12.50

TABLE 9 same content. However, the recently introduced adversarial

Input perturbation study on the hybrid benchmark. ND-avg indicates examples, i.e. maliciously constructed inputs that fool ma-
the average score of three heuristic models: HS [35], DRFI [53] and
wCtr [36]. D-avg indicates the average score of three deep learning chine learning models, can degrade the performance of deep
models: SRM [71], DGRL [85] and PiCANet [40]. Best in red and worst image classifiers significantly. Current deep SOD models
with underline. See §5.4 for more details. likely face a similar challenge. Therefore, in this section, we
examine the robustness of SOD models by comparing their
Gaus. blur Gaus. noise outputs for randomly perturbed inputs, such as noisy or
Rotation
Metric Method Original ( σ= ) ( var= ) Gray
2 4 0.01 0.08 15◦ −15◦ blurred images. Then, in §5.5, we will study the robustness
∗ HS
[35] .600 -.012 -.096 -.022 -.057 +.015 +.009 -.104 to manually designed adversarial examples.
∗ DRFI
[53] .670 -.040 -.103 -.035 -.120 -.009 -.009 -.086
∗ wCtr [36]
The input perturbations investigated include Gaussian
.611 +.006 -.000 -.024 -.136 -.004 -.003 -.070
SRM [71] .817 -.090 -.229 -.025 -.297 -.028 -.029 -.042 blur, Gaussian noise, Rotation, and Gray. For blurring, we
max F↑
DGRL [85] .831 -.088 -.365 -.050 -.402 -.031 -.022 -.026 employ Gaussian blur kernels with a sigma of 2 or 4. For
PiCANet [40] .848 -.048 -.175 -.014 -.148 -.005 -.008 -.039
∗ ND-avg .627 -.015 -.066 -.027 -.104 -.000 -.001 -.087 noise, we select two variance values, i.e., 0.01 and 0.08, to
D-avg .832 -.075 -.256 -.041 -.282 -.021 -.020 -.037 cover both tiny and medium magnitudes. For rotation, we
∗ Non-deep learning model. rotate the images by +15◦ and −15◦ , respectively, and cut
out the largest box with the original aspect ratio. The gray
images are generated using the Matlab rgb2gray function.
• Most and least difficult scenes. Deep and heuristic As in §5.3, we include three popular heuristic mod-
methods perform similarly when faced with different scenes els [35], [36], [53] and three deep methods [40], [71], [85]
(Table 7). For both types of methods, Natural is the easiest, in our experiments. Table 9 shows the results. Overall,
which is reasonable as the scenes are typically simple. Fur- compared with deep models, heuristic methods are less
ther, though both contain numerous objects, Indoor is more sensitive towards input perturbations. The compactness and
challenging than Urban as it often suffers from highly un- abstractness of superpixels likely explains much of this.
evenly distributed illumination and more complex scenes. Specifically, heuristic methods are rarely affected by Rota-
Our experiments also show that the utility of SOD models tion, but perform worse under strong Gaussian blur, strong
in real, and especially complex, environments is still limited. Gaussian noise and Gray. Deep methods suffer the most
• Additional advantages of deep models. As shown in under Gaussian blur and strong Gaussian noise, which may
Table 7, deep models achieve great improvements on seman- be caused by the damage to shallow-layer features. Deep
tically rich objects (Human, Animal and Artifact), demonstrat- methods are relatively robust against Rotation, revealing
ing advantages in semantic modeling. This is verified again the rotation invariance of DNNs brought by the pooling
by their good performance on complex object shapes (HO, operation. Interestingly, we further find that, among the
OV , OC , CT ). Deep models also narrow the gap between three deep models, PiCANet [40] demonstrates excellent
different scene categories (Indoor v.s. Natural), indicating an robustness against a wide range of input perturbations,
improved robustness against various backgrounds. including Gaussian blur, Gaussian noise, and Rotation. We
• Best and worst predictions. From Table 8, in addition to attribute this to its effective non-local operation. This reveals
similar conclusions drawn from Table 7, some unique and that effective network designs can improve the robustness
interesting observations can be made. First, for deep meth- to random perturbations.
ods, NatObj spans a large range of challenge, containing
both the simplest and hardest samples. Thus, future efforts 5.5 Robustness Against Manually Designed Input Per-
should pay more attention to the hard samples in NatObj. In turbations
addition, after considering data distribution bias, CS is the
most challenging factor for deep models. Given the significant concerns with model robustness to
random perturbations, this section presents an analysis fo-
cusing specifically on manually designed adversarial per-
5.4 Robustness Against General Input Perturbations turbations. Recent years have witnessed great advance in
The robustness of a model lies in its stability against corrupt SOD driven by the progress of deep learning. However,
inputs. Intuitively, the outputs of a robust SOD model whether the deep SOD models are as powerful as they
should be repeatable on slightly different images with the seem is a question to worth pondering. Meanwhile, DNNs
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

Original Gaussian blur Gaussian noise Rotation Gray Original Gaussian blur
sigma=2 sigma=4 var=0.01 var=0.08 15° -15° sigma=2 sigma=4 var=0.01 var=0.08 15° -15°

Image
Image

GT
GT

HS
HS

0.844 0.797 0.791 0.836 0.785 0.808 0.795 0.742 0.579 0.655 0.590 0.541 0.573 0.742 0.653 0.516
DRFI

DRFI
0.816 0.830 0.839 0.844 0.806 0.806 0.819 0.799 0.533 0.571 0.547 0.574 0.571 0.608 0.627 0.515

wCrt
wCrt

0.863 0.837 0.836 0.831 0.773 0.764 0.806 0.654 0.670 0.618 0.627 0.564 0.612 0.578 0.643 0.471

SRM DGRL PiCANet

PiCANet DGRL SRM

0.922 0.851 0.819 0.898 0.700 0.878 0.857 0.900 0.951 0.565 0.471 0.731 0.532 0.791 0.896 0.759

0.880 0.848 0.821 0.927 0.714 0.905 0.865 0.898 0.927 0.823 0.471 0.645 0.471 0.820 0.888 0.890

0.945 0.872 0.835 0.924 0.840 0.926 0.890 0.883 0.947 0.814 0.471 0.860 0.471 0.861 0.848 0.879

Fig. 5. Examples of saliency prediction under various input perturbations. The max F values are denoted in red. See §5.4 for more details.

Attack Attack Attack Attack Attack Attack

have been previously found to be susceptible to adversarial Original SRM DGRL PiCANet Original SRM DGRL PiCANet
attacks, where visually imperceptible perturbations lead to

Image
Image
completely different predictions [141]. Though intensively

Pert.
studied in classification tasks, adversarial attacks in SOD are Pert.

GT+
GT+

rarely explored. As SOD has been integrated as a critical

SRM DGRL PiCANet Image Pert.

Image PiCANet DGRL SRM

part in many security systems and commercial projects,

0.922 0.488 0.899 0.901 0.951 0.588 0.871 0.883
SOD models also have potential risks of being attacked.
Specifically, SOD plays a significant role in many security 0.880 0.881 0.471 0.873 0.927 0.853 0.469 0.843
systems, for detecting the candidates of interest targets from
remote sensing images [142], video surveillance data [143], 0.945 0.920 0.941 0.471 0.947 0.935 0.925 0.520
or sensor signals of autonomous vehicles [144]. In such
situation, examining the robustness of SOD models is rather
Pert.

GT+
GT+

important because the insecurity of SOD modules may

cause severe losses, e.g., the criminals may use inconspicu-

SRM DGRL PiCANet Image

Image PiCANet DGRL SRM

ous adversarial perturbations to fool SOD modules and then 0.968 0.143 0.952 0.960 0.944 0.485 0.930 0.926
cheat the surveillance systems. Besides, SOD has benefited
many commercial projects such as photo editing [20], and 0.934 0.899 0.259 0.863 0.969 0.967 0.327 0.933
image/video compression [145]. The adversarial attacks
0.990 0.989 0.989 0.438 0.972 0.971 0.968 0.349
launched by hackers on the embedded SOD modules would
inevitably affect the functioning of commercial products and
Fig. 6. Examples of SOD prediction under adversarial perturbations
impacting users, causing losses for the developers and com- of different target networks. The perturbations are magnified by 10 for

Pert.
Pert.

GT+
GT+

panies. Therefore, studying the robustness of SOD models better visualization. Red for max F. See §5.5 for details.
is crucial for defending these applications against malicious
SRM DGRL PiCANet
PiCANet DGRL SRM

attacks. In this section, we study the robustness against 0.970 0.434 0.913 0.897 0.945 0.423 0.935 0.931

adversarial attacks and transferability of adversarial exam-

from0.964
the hybrid
0.958 benchmark.
0.424 0.956 The values
0.933 for the
0.864 three0.923
0.325 models
ples targeting different SOD models. Our observations are
are 3.54×10−3 , 3.57×10−3 , and 3.51×10−3 , respectively.
expected to shed light on adversarial attacks and defenses 0.974 0.971 0.973 0.620
for SOD, providing a better understanding of vulnerabilities Exemplar adversarial cases0.965 0.964
are shown in0.964
Fig. 6.0.689
As can
of deep SOD models and improving the robustness of SOD be seen, the adversarial attacks can prevent the SOD models
involved practical applications. from producing reliable salient object candidates. Quantita-
tive results are listed in Table 10. The underlined entries of
5.5.1 Robustness of SOD Against Adversarial Attacks Table 10 reveal that the three deep SOD models investigated
For measuring the robustness of deep SOD models, we are vulnerable to adversarial perturbations of the inputs.
adopt and modify an adversarial attack algorithm designed However, as can be observed by comparing Tables 9 and 10,
for semantic segmentation, i.e., Dense Adversary Generation the models are more robust to random input perturbations.
(DAG) [146]. We choose three representative deep models, These differences in robustness might be interpretated by
i.e., SRM [71], DGRL [85] and PiCANet [40] for our study. the the distance from the inputs to the decision boundary in
The experiment is conducted on the hybrid benchmark high dimensional space. The intentionally designed adver-
introduced in §5.3. Following [146], we measure the percep- sarial inputs often lie closer to the decision boundary than
tibility of the adversarial examples by computing the aver- the random inputs [147], and can thus more easily cause
age perceptibility of the adversarial perturbations generated pixel-wise misclassification.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

TABLE 10 Prediction Ground-truth Map

Results for adversarial attack experiments. Max F↑ on the hybrid
benchmark is presented when exerting adversarial perturbations from
different models. Worst results are underline. See §5.5 for details. 56×56
112×112
28×28 28×28 28×28 28×28 28×28

224×224
Encoder Decoder
conv5-out conv4-out conv3-out
Attack from SRM [71] DGRL [85] PiCANet [40]
Binary Cross-entropy Loss
None .817 .831 .848
SRM [71] .263 .780 .842
DGRL [85] .778 .248 .844 Fig. 7. Network architecture of the SOD model used in cross-dataset
PiCANet [40] .772 .799 .253 generalization evaluation. See §5.6 for more detailed descriptions.

TABLE 11
5.5.2 Transferability Across Networks Results for cross-dataset generalization experiment. Max F↑ for
Previous research has revealed that adversarial perturba- saliency prediction when training on one dataset (rows) and testing on
tions can be transferred across networks, i.e. adversarial another (columns). “Self” refers to training and testing on the same
dataset (same as diagonal). “Mean Others” indicates average
examples targeting one model can mislead another with- performance on all except self. See §5.6 for details.
out any modification [148]. This transferability is widely
used for black-box attacks against real-world systems. To Test on: MSRA- ECSSD DUT-OM HKU- DUTS SOC Mean Percent
Self
investigate the transferability of perturbations for deep SOD Train on: 10K [107] [55] RON [56] IS [27] [97] [131] others drop↓
MSRA10K [107] .875 .818 .660 .849 .671 .617 .875 .723 17%
models, we use the adversarial perturbation computed on
ECSSD [55] .844 .831 .630 .833 .646 .616 .831 .714 14%
one SOD model to attack another. DUT-OMRON [56] .795 .752 .673 .779 .623 .567 .673 .703 -5%
Table 10 shows the experimental results for the three HKU-IS [27] .857 .838 .695 .880 .719 .639 .880 .750 15%
DUTS [97] .857 .834 .647 .860 .665 .654 .665 .770 -16%
models under investigation (SRM [71], DGRL [85] and Pi-
SOC [131] .700 .670 .517 .666 .514 .593 .593 .613 -3%
CANet [40]). While the DAG attack leads to severe perfor- Mean others .821 .791 .637 .811 .640 .614 - - -
mance drops for the targeted model (see the diagonal), it
causes much less degradation to other models, i.e., the trans-
ferability between models of different network structures is [93], [101], [102]. As shown in Fig. 7, the encoder part is
weak for SOD task, which is similar to the transferability ob- borrowed from VGG16 [151], and the decoder consists of
served for semantic segmentation, as analyzed in [146]. This three convolutional layers that gradually refine the saliency
may be because the gradient directions of different models prediction. We pick six representative datasets [27], [55],
are orthogonal to each other [149], so the gradient-based [56], [97], [107], [131]. For each dataset, we train the SOD
attack in the experiment transfers poorly to non-targeted model with 800 randomly selected training images and test
models. However, adversarial images generated from an it on 200 other validation images. Please note that a total of
ensemble of multiple models might generate non-targeted 1, 000 is the maximum possible number of images consider-
adversarial instances with better transferability [149], which ing the size of the smallest selected dataset, ECSSD [55].
would be a great threat to deep SOD models.
Table 11 summarizes the results of cross-dataset gener-
alization, measured by max F. Each column corresponds to
5.6 Cross-Dataset Generalization Evaluation the performance when training on all the datasets separately
Datasets are responsible for much of the recent progress in and testing on one. Each row indicates training on one
SOD, not just as sources for training deep models, but also as dataset and testing on all of them. Since our training/testing
means for measuring and comparing performance. Datasets protocol is different from the one used in the benchmarks
are collected with the goal of representing the visual world, mentioned in previous sections, the actual performance
and to summarize the algorithm as a single number (i.e., numbers are not meaningful. Rather, it is the relative perfor-
benchmark score). A concern thus arises: it is necessary mance difference that matters. Not surprisingly, we observe
to evaluate how well a particular dataset represents the that the best results are achieved when training and testing
real world; or, more specifically, to quantitatively measuring on the same dataset. By looking at the numbers across each
the dataset’s generalization ability. Unfortunately, previous column, we can determine how easy a dataset is for models
studies [44] are quite limited – mainly concerning the de- trained on the other datasets. By looking at the numbers
grees of center bias in different SOD datasets. Here, we fol- across one row, we can determine how good a dataset is
low [150] to assess how general SOD datasets are. We study at generalizing to the others. We find that SOC [131] is the
the generalization and difficulty of several mainstream SOD most difficult dataset (lowest column, Mean others 0.614).
datasets by performing a cross-dataset analysis, i.e., train- MSRA10K [107] appears to be the easiest one (highest col-
ing on one dataset, and testing on the others. We expect umn, Mean others 0.811), and generalizes the worst (highest
our experiments to stimulate discussion in the community row, Percent drop 17%). DUTS [97] is shown to have the best
regarding this essential but largely neglected issue. generalization ability (lowest row, Percent drop −16%).
We first train a typical SOD model on one dataset, and Based on these analyses, we would make the following
then explore how well it generalizes to a representative set recommendations for SOD datasets: 1) For training deep
of other datasets, compared with its performance on the models, DUTS [97] is a good choice because it has the
“native” test set. Specifically, we implement the typical SOD best generalizability. 2) For testing, SOC [131] is good for
model as a bottom-up/top-down structure, which has been assessing the worst-case performances, since it is the most
the most standard and popular SOD architecture these years challenging dataset. DUT-OMRON [56] and DUTS-test [97]
and is the basis of many current top-performing models [82], deserve more considerations as they are also very difficult.
(c) Object level
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

6 M ORE D ISCUSSIONS
Our previous systematic review and empirical studies char-
acterized the models (§2), datasets (§3), metrics (§4), and
challenges (§5) of deep SOD. Here we further posit active
research directions, and outline several open issues.

6.1 Model Design

Based on the review of deep SOD network architectures in Fig. 8. Examples for annotation inconsistency. Each row shows two
exemplar image pairs. See §6.2 for more detailed descriptions.
§2.1, as well as recent advances in related fields, we here
discuss several essential directions for SOD model design.
• Network topology. Network topology determines the objects, instead of considering real human eye fixation be-
within-network information flow, which directly affects havior. Maintaining annotation consistency among newly
model capacity and training difficulty and thus influences collected datasets is an important consideration.
the best possible performance. To figure out an effec- • Coarse v.s. fine annotation. Modern SOD datasets all have
tive network topology for SOD, diverse architectures have pixel-level annotations, which greatly boosts the perfor-
been explored (§2.1), such as multi-stream networks, side- mance of deep SOD models. However, pixel-wise ground-
out fusion networks, as well as bottom-up/top-down net- truths are very costly to collect considering the complex
works. However, all these network architectures are hand- object boundaries and the intense data requirement. Further,
designed. Thus, a promising direction would be to use the annotation qualities of different datasets are different
automated machine learning (AutoML) algorithms, such as (see bicycles in Fig. 8). Finer labels are believed to be es-
neural architecture search [152], to automatically search for sential for high-quality saliency prediction, but usually take
the best-performing SOD network topology. more time to collect. Thus, given a limited budget, finding
• Loss function. Most deep SOD methods are trained with the optimal annotation strategy is an open problem. Some
the standard binary cross-entropy loss, which may fail to works have studied the relationship between label quality
fully capture the quality factors for the SOD task. Only and model performance in semantic segmentation [157],
a few efforts have been made to derive losses from SOD highlighting a possible research direction for SOD dataset
evaluation metrics [87]. Thus, it is worth exploring more collection. In addition, current SOD models typically as-
effective SOD loss functions, such as the mean intersection- sume that the annotations are perfect. Thus, it would also
over-union loss [153] and affinity field matching loss [154]. be of value to explore robust SOD models that can learn
• Adaptive computation. Currently, all deep SOD models saliency patterns from imperfectly annotated data.
are fixed feed-forward structures. However, most parame- • Domain-specific SOD datasets. SOD has shown potential
ters model high-level features that, in contrast to low-level in a wide range of applications, such as autonomous vehi-
and many mid-level concepts, cannot be broadly shared cles, video games, medical image processing, etc. Due to the
across categories/scenes. As such, we would like to ask the different visual appearances and semantic components, the
following question: What if a SOD model could directly saliency mechanisms in these applications are quite different
execute certain layers that can best explain the saliency from that of conventional natural images. Thus, collecting
patterns in a given scene? To answer this, one could leverage domain-specific datasets might benefit the application of
adaptive computation techniques [155], [156] to vary the SOD in certain scenarios, as observed in FP for crowds [158],
amount of computation on-the-fly, i.e., by selectively activat- webpages [159] or driving [160], and better connect SOD
ing part of the network in an input-dependent fashion. This to the biological top-down visual attention mechanism and
could bring a better trade-off between network depth and human mental state.
computational cost. On the other hand, adapting inference
pathways for different inputs would provide finer-grained
6.3 Saliency Ranking and Relative Saliency
discriminative ability for various attributes. Therefore, ex-
ploring dynamic network structures in SOD is promising Current algorithms seems over-focused on directly regress-
for improving both efficiency and effectiveness. ing the saliency map to pursue a high benchmarking num-
ber, while neglecting the fact that the absolute magnitude
of values in a saliency map might be less important than
6.2 Data Collection the relative saliency values among objects [108]. Though the
Our previous discussions (§3) and analyses (§5.3 and §5.6) relative value/rank order is rarely considered in the context
on current SOD datasets revealed several factors that are of benchmarking metrics (with the exception of [79]), it is
essential for future dataset collection. crucial for better modeling human visual attention behavior.
• Annotation inconsistency. Though existing SOD datasets This is, in essence, a selection process that centers our
play a critical role in training and evaluating modern SOD attention on certain important elements of the surroundings,
models, annotation inconsistencies among different SOD while blending other relatively unimportant things into
datasets have essentially been ignored by the community. the background. This not only hints at one shortcoming
The inconsistencies are mainly caused by separate subjects of existing benchmarking paradigms and data collection
and rules/conditions during dataset annotation (see Fig. 8). strategies, but also reveals a common limitation of current
To ease annotation burdens, most current SOD datasets only methods. Current state-of-the-arts fall short at determining
have a few human annotators directly identify the salient the relative importance of objects, such as identifying the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

most important person in a crowded room. This is also believed that parameters trained on ImageNet can serve as a
evidenced by the experiments in §5.3, which show that good starting point to accelerate the convergence of training
deep models face great difficulties in complex (CS ), indoor and prevent overfitting on smaller-scale SOD datasets. Be-
(Indoor) or multi-object (MO) scenes. In other words, deep sides pre-training deep SOD models on the de facto dataset,
SOD models, though good at semantic modeling, require ImageNet, another option is to leverage self-supervised
higher-level image understanding. Exploring more pow- learning techniques [164] to learn effective visual features
erful network designs that explicitly reason the relative from a vast amount of unlabeled images/videos. The visual
saliency and revisiting classic cognitive theories are both features can be learned through various pretext tasks like
promising directions to overcome this issue. image inpainting [165], colorization [166], clustering [167],
etc., and can be generalized to other vision tasks. Fine-tuning
6.4 Linking SOD to Visual Fixations the SOD models on parameters trained from self-supervised
The strong correlation between eye movements (implicit learning is promising to yield better performance compared
saliency) and explicit object saliency has been explored to the ImageNet initialization.
throughout history [44], [108], [161]–[163]. However, despite
the deep connections between the problems of FP and SOD, 6.7 Efficient SOD for Real-World Application
the major computational models of the two tasks remain Current top-leading deep SOD models are designed to be
largely distinct; only a few SOD models consider both complicated in order to achieve increased learning capac-
tasks simultaneously [72], [87], [96]. This is mainly due ity and improved performance. However, more ingenu-
to the overemphasis on the specific setting of SOD and ous and light-weight architectures are required to fulfill
the design bias of current SOD datasets, which overlooks the requirements of mobile and embedded applications,
the connection to eye fixations during data annotation. As such as robotics, autonomous driving, augmented reality,
stated in [108], such dataset design bias not only creates a etc. The degradation of accuracy and generalization ability
discomforting disconnection between FP and SOD, but also caused by model scale deduction should be minimal. To
further misleads the algorithm designing. Exploring classic facilitate the application of SOD in real-world scenarios,
visual attention theories in SOD is a promising and crucial it is possible to utilize model compression [168] or knowl-
direction which could make SOD models more consistent edge distillation [169], [170] techniques to develop compact
with the visual processing of human visual system and and fast SOD models with competitive performance. Such
provide better explainability. In addition, the ultimate goal compression techniques have already been shown effective
of visual saliency modeling is to understand the underlying in improving generalization ability and alleviating under-
rationale of the visual attention mechanism. However, with fitting for training efficient object detection models [171].
the current focus on exploring more powerful neural net-
work architectures and beating the latest benchmark num- 7 C ONCLUSION
bers on different datasets, have we perhaps lost sight of the
original purpose? The solution to these problems requires In this paper we present, to the best of our knowledge, the
dense collaborations between the FP and SOD communities. first comprehensive review of SOD focusing on deep learn-
ing techniques. We first provide novel testimonies for cat-
egorizing deep SOD models from several distinct perspec-
6.5 Learning SOD in a Weakly-/Unsupervised Manner
tives, including network architecture, level of supervision,
Deep SOD methods are typically trained in a fully- etc. We then cover the contemporary literature on popular
supervised manner with a plethora of finely-annotated SOD datasets and evaluation criteria, providing a thorough
pixel-level ground-truths. However, it is highly costly and performance benchmarking of major SOD methods and
time-consuming to construct a large-scale, well-annotated offering recommendations for several datasets and metrics
SOD dataset. Though some efforts have been made to that can be used to consistently assess different models.
achieve SOD with limited supervision, i.e., by leveraging Next, we consider several previously under-explored issues
category-level labels [68], [69], [97] or pseudo pixel-wise an- related to benchmarking and baselines. In particular, we
notations [67], [81], [83], [98], [99], there is still a notable gap study the strengths and weaknesses of deep and non-deep
with the fully-supervised counterparts. In contrast, humans SOD models by compiling and annotating a new dataset
usually learn with little or even no supervision. Since the and evaluating several representative models on it, reveal-
ultimate goal of visual saliency modeling is to understand ing promising directions for future efforts. We also study
the visual attention mechanism, learning SOD in an weakly- the robustness of SOD methods by analyzing the effects of
/unsupervised manner would be of great value to both the various perturbations on the final performance. Moreover,
research community and real-world applications. Further, it for the first time in the field, we investigate the robustness
would also help us understand which factors truly drive our of deep SOD models to maliciously designed adversarial
attention mechanism and saliency pattern understanding. perturbations and the transferability of these adversarial ex-
Given the massive number of algorithmic breakthroughs amples, providing baselines for future research. In addition,
over the past few years, we can expect a flurry of innovation we analyze the generalization and difficulty of existing SOD
towards this promising direction. datasets through a cross-dataset generalization study, and
quantitatively reveal the dataset bias. We finally introduce
6.6 Pre-training with Self-Supervised Visual Features several open issues and challenges of SOD in the deep learn-
Current deep SOD methods are typically built on ImageNet- ing era, providing insightful discussions and identifying a
pretrained networks, and fine-tuned on SOD datasets. It is number of potentially fruitful directions forward.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

In conclusion, SOD has achieved notable progress thanks [21] S. Avidan and A. Shamir, “Seam carving for content-aware image
to the striking development of deep learning techniques. resizing,” in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 10.
[22] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze
However, there are still under-explored problems on achiev- sensing using saliency maps,” in Proc. IEEE Conf. Comput. Vis.
ing more efficient model designs, training, and inference Pattern Recognit., 2010, pp. 2667–2674.
for both academic research and real-world applications. We [23] A. Borji and L. Itti, “Defending yarbus: Eye movements reveal
expect this survey to provide an effective way to understand observers’ task,” Journal of Vision, vol. 14, no. 3, pp. 29–29, 2014.
[24] A. Karpathy, S. Miller, and L. Fei-Fei, “Object discovery in 3d
current state-of-the-arts and, more importantly, insight for
scenes via shape analysis,” in Proc. IEEE Conf. Robot. Autom., 2013,
the future exploration of SOD. pp. 2088–2095.
[25] S. Frintrop, G. M. Garcı́a, and A. B. Cremers, “A cognitive
approach for object discovery,” in Proc. IEEE Conf. Comput. Vis.
R EFERENCES Pattern Recognit., 2014, pp. 2329–2334.
[26] A. M. Treisman and G. Gelade, “A feature-integration theory of
[1] J.-Y. Zhu, J. Wu, Y. Xu, E. Chang, and Z. Tu, “Unsupervised object
attention,” Cognitive psychology, vol. 12, no. 1, pp. 97–136, 1980.
class discovery via saliency-guided multiple class learning,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 4, pp. 862–875, 2015. [27] G. Li and Y. Yu, “Visual saliency based on multiscale deep
[2] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
feature learning for scene classification,” IEEE Trans. Geosci. Re- pp. 5455–5463.
mote Sens., vol. 53, no. 4, pp. 2175–2184, 2015. [28] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for
[3] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, saliency detection via local estimation and global search,” in Proc.
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3183–3192.
caption generation with visual attention,” in Proc. ACM Int. Conf. [29] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection
Mach. Learn., 2015, pp. 2048–2057. by multi-context deep learning,” in Proc. IEEE Conf. Comput. Vis.
[4] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, Pattern Recognit., 2015, pp. 1265–1274.
J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From captions to [30] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learning
visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern to detect a salient object,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 1473–1482. Recognit., 2007, pp. 1–8.
[5] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human [31] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-
attention in visual question answering: Do humans and deep tuned salient region detection,” in Proc. IEEE Conf. Comput. Vis.
networks look at the same regions?” Computer Vision and Image Pattern Recognit., 2009, pp. 1597–1604.
Understanding, vol. 163, pp. 90–100, 2017. [32] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,
[6] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based “Global contrast based salient region detection,” Proc. IEEE Conf.
saliency detection and its application in object recognition.” IEEE Comput. Vis. Pattern Recognit., 2011.
Trans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 769–779, 2014. [33] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency
[7] D. Zhang, D. Meng, L. Zhao, and J. Han, “Bridging saliency filters: Contrast based filtering for salient region detection,” in
detection to weakly supervised object detection based on self- Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2012, pp.
paced curriculum learning,” in International Joint Conferences on 733–740.
Artificial Intelligence, 2016. [34] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng,
[8] W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware video “Salient object detection: A discriminative regional feature inte-
object segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., gration approach,” Int. J. Comput. Vis., vol. 123, no. 2, pp. 251–268,
vol. 40, no. 1, pp. 20–33, 2018. 2017.
[9] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid [35] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,”
dilated deeper convlstm for video salient object detection,” in in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1155–
Proc. Eur. Conf. Comput. Vis., 2018. 1162.
[10] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, [36] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from
“Object region mining with adversarial erasing: A simple classi- robust background detection,” in Proc. IEEE Conf. Comput. Vis.
fication to semantic segmentation approach,” in Proc. IEEE Conf. Pattern Recognit., 2014, pp. 2814–2821.
Comput. Vis. Pattern Recognit., 2017. [37] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech,
[11] X. Wang, S. You, X. Li, and H. Ma, “Weakly-supervised semantic “Unconstrained salient object detection via proposal subset opti-
segmentation by iteratively mining common object features,” in mization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018. pp. 5733–5742.
[12] G. Sun, W. Wang, J. Dai, and L. Van Gool, “Mining cross-image
[38] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network
semantics for weakly supervised semantic segmentation,” in
for salient object detection,” in Proc. IEEE Conf. Comput. Vis.
Proc. Eur. Conf. Comput. Vis., 2020, pp. 347–365.
Pattern Recognit., 2016, pp. 678–686.
[13] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience
[39] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply
learning for person re-identification,” in Proc. IEEE Conf. Comput.
supervised salient object detection with short connections,” in
Vis. Pattern Recognit., 2013, pp. 3586–3593.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3203–
[14] S. Bi, G. Li, and Y. Yu, “Person re-identification using multiple
3212.
experts with random subspaces,” Journal of Image and Graphics,
vol. 2, no. 2, 2014. [40] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise
[15] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model contextual attention for saliency detection,” in Proc. IEEE Conf.
for video summarization,” in Proc. ACM Int. Conf. Multimedia, Comput. Vis. Pattern Recognit., 2018, pp. 3089–3098.
2002, pp. 533–542. [41] Y. Liu, Q. Zhang, D. Zhang, and J. Han, “Employing deep part-
[16] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “Summarizing object relationships for salient object detection,” in Proc. IEEE Int.
visual data using bidirectional similarity,” in Proc. IEEE Conf. Conf. Comput. Vis., 2019, pp. 1232–1241.
Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [42] Q. Qi, S. Zhao, J. Shen, and K.-M. Lam, “Multi-scale capsule
[17] J. Han, E. J. Pauwels, and P. De Zeeuw, “Fast saliency-aware attention-based salient object detection with multi-crossed layer
multi-modality image fusion,” Neurocomputing, vol. 111, pp. 70– connections,” in IEEE International Conference on Multimedia and
80, 2013. Expo, 2019, pp. 1762–1767.
[18] P. L. Rosin and Y.-K. Lai, “Artistic minimal rendering with lines [43] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,”
and blocks,” Graphical Models, vol. 75, no. 4, pp. 208–229, 2013. IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 185–207,
[19] W. Wang, J. Shen, Y. Yu, and K.-L. Ma, “Stereoscopic thumbnail 2013.
creation via efficient stereo saliency detection,” IEEE Trans. Visu- [44] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detec-
alization and Comput. Graphics, vol. 23, no. 8, pp. 2014–2027, 2016. tion: A benchmark,” IEEE Trans. Image Process., vol. 24, no. 12, pp.
[20] W. Wang, J. Shen, and H. Ling, “A deep network solution for 5706–5722, 2015.
attention and aesthetics aware photo cropping,” IEEE Trans. [45] T. V. Nguyen, Q. Zhao, and S. Yan, “Attentive systems: A survey,”
Pattern Anal. Mach. Intell., 2018. Int. J. Comput. Vis., vol. 126, no. 1, pp. 86–110, 2018.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

[46] D. Zhang, H. Fu, J. Han, A. Borji, and X. Li, “A review of co- [71] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise
saliency detection algorithms: fundamentals, applications, and refinement model for detecting salient objects in images,” in Proc.
challenges,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 4, p. 38, IEEE Int. Conf. Comput. Vis., 2017, pp. 4039–4048.
2018. [72] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment:
[47] R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang, Finding the salient objects in images via two-stream fixation-
“Review of visual saliency detection with comprehensive infor- semantic CNNs,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp.
mation,” IEEE Trans. Circuits Syst. Video Technol., 2018. 1050–1058.
[48] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced [73] Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards
deep-learning techniques for salient and category-specific object high-resolution salient object detection,” in Proc. IEEE Int. Conf.
detection: a survey,” IEEE Signal Processing Magazine, vol. 35, Comput. Vis., 2019, pp. 7234–7243.
no. 1, pp. 84–100, 2018. [74] Y. Zhuge, Y. Zeng, and H. Lu, “Deep embedding features for
[49] A. Borji, “Saliency prediction in the deep learning era: Successes salient object detection,” in AAAI Conference on Artificial Intelli-
and limitations,” IEEE Trans. Pattern Anal. Mach. Intell., 2019. gence, vol. 33, 2019, pp. 9340–9347.
[50] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object [75] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin,
detection: A survey,” Computational Visual Media, pp. 1–34, 2019. “Non-local deep features for salient object detection,” in Proc.
[51] C. Koch and S. Ullman, “Shifts in selective visual attention: IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6593–6601.
Towards the underlying neural circuitry,” Human neurobiology, [76] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet:
vol. 4, no. 4, p. 219, 1985. Aggregating multi-level convolutional features for salient object
[52] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 202–211.
attention for rapid scene analysis,” IEEE Trans. Pattern Anal. [77] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into
Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998. salient object subitizing and detection,” in Proc. IEEE Int. Conf.
[53] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient Comput. Vis., 2017, pp. 1059–1067.
object detection: A discriminative regional feature integration [78] X. Hu, L. Zhu, J. Qin, C.-W. Fu, and P.-A. Heng, “Recurrently
approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, aggregating deep features for salient object detection.” in AAAI
pp. 2083–2090. Conference on Artificial Intelligence, 2018.
[54] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using [79] M. Amirul Islam, M. Kalash, and N. D. B. Bruce, “Revisiting
background priors,” Proc. Eur. Conf. Comput. Vis., pp. 29–42, 2012. salient object detection: Simultaneous detection, ranking, and
[55] J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency subitizing of multiple salient objects,” in Proc. IEEE Conf. Comput.
detection on extended cssd,” IEEE Trans. Pattern Anal. Mach. Vis. Pattern Recognit., 2018.
Intell., vol. 38, no. 4, pp. 717–729, 2015. [80] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast
[56] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. H. Yang, “Saliency and accurate salient object detection,” in Proc. IEEE Conf. Comput.
detection via graph-based manifold ranking,” in Proc. IEEE Conf. Vis. Pattern Recognit., 2019, pp. 3907–3916.
Comput. Vis. Pattern Recognit., 2013, pp. 3166–3173. [81] Y. Zeng, Y. Zhuge, H. Lu, L. Zhang, M. Qian, and Y. Yu, “Multi-
[57] W. Wang, J. Shen, L. Shao, and F. Porikli, “Correspondence driven source weak supervision for saliency detection,” in Proc. IEEE
saliency transfer,” IEEE Trans. Image Process., vol. 25, no. 11, pp. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6074–6083.
5025–5034, 2016. [82] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng,
[58] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Y. Tang, “Egnet: Edge guidance network for salient object detection,” in
“Video saliency detection using object proposals,” IEEE Trans. Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8779–8788.
Cybernetics, 2017. [83] D. Zhang, J. Han, and Y. Zhang, “Supervision by fusion: Towards
[59] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto- unsupervised learning of deep salient object detector,” in Proc.
encoders,” in Proc. Int. Conf. Artificial Neural Netw., 2011, pp. 44– IEEE Int. Conf. Comput. Vis., vol. 1, no. 2, 2017, p. 3.
51. [84] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional
[60] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low message passing model for salient object detection,” in Proc. IEEE
level distance map and high level features,” in Proc. IEEE Conf. Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1741–1750.
Comput. Vis. Pattern Recognit., 2016, pp. 660–668. [85] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji,
[61] S. He, R. W. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: “Detect globally, refine locally: A novel approach to saliency
A superpixelwise convolutional neural network for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
detection,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 330–344, 2015. pp. 3127–3135.
[62] J. Kim and V. Pavlovic, “A shape-based approach for salient [86] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive
object detection using deep learning,” in Proc. Eur. Conf. Comput. attention guided recurrent network for salient object detection,”
Vis., 2016, pp. 455–470. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 714–
[63] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency 722.
detection with recurrent fully convolutional networks,” in Proc. [87] W. Wang, J. Shen, X. Dong, and A. Borji, “Salient object detection
Eur. Conf. Comput. Vis., 2016, pp. 825–841. driven by fixation prediction,” in Proc. IEEE Conf. Comput. Vis.
[64] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks Pattern Recognit., 2018, pp. 1171–1720.
for saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [88] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient
Recognit., 2016, pp. 3668–3677. object detection,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 236–
[65] P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for 252.
salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [89] M. Feng, H. Lu, and E. Ding, “Attentive feedback network for
Recognit., 2017, pp. 540–549. boundary-aware salient object detection,” in Proc. IEEE Conf.
[66] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning Comput. Vis. Pattern Recognit., 2019, pp. 1623–1632.
uncertain convolutional features for accurate saliency detection,” [90] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jager-
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 212–221. sand, “Basnet: Boundary-aware salient object detection,” in Proc.
[67] J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley, “Deep IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7479–7489.
unsupervised saliency detection: A multiple noisy labeling per- [91] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding,
spective,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, “A mutual learning method for salient object detection with
pp. 9029–9038. intertwined multi-supervision,” in Proc. IEEE Conf. Comput. Vis.
[68] C. Cao, Y. Hunag, Z. Wang, L. Wang, N. Xu, and T. Tan, “Lat- Pattern Recognit., 2019, pp. 8150–8159.
eral inhibition-inspired convolutional neural network for visual [92] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object
attention and saliency detection,” in AAAI Conference on Artificial detection with pyramid attention and salient edges,” in Proc.
Intelligence, 2018. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1448–1457.
[69] B. Li, Z. Sun, and Y. Guo, “Supervae: Superpixelwise variational [93] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple
autoencoder for salient object detection,” in AAAI Conference on pooling-based design for real-time salient object detection,” in
Artificial Intelligence, 2019, pp. 8569–8576. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3917–
[70] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object 3926.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [94] W. Wang, J. Shen, M.-M. Cheng, and L. Shao, “An iterative
2017, pp. 247–256. and cooperative top-down and bottom-up inference network for
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern [119] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
Recognit., 2019, pp. 5968–5977. networks for biomedical image segmentation,” in International
[95] Y. Xu, D. Xu, X. Hong, W. Ouyang, R. Ji, M. Xu, and G. Zhao, Conference on Medical Image Computing and Computer-Assisted In-
“Structured modeling of joint deep feature and prediction refine- tervention, 2015, pp. 234–241.
ment for salient object detection,” in Proc. IEEE Int. Conf. Comput. [120] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object
Vis., 2019, pp. 3789–3798. contour detection with a fully convolutional encoder-decoder
[96] S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
R. Venkatesh Babu, “Saliency unified: A deep architecture pp. 193–202.
for simultaneous eye fixation prediction and salient object [121] J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn corre-
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., spondence?” in Proc. Advances Neural Inf. Process. Syst., 2014, pp.
2016, pp. 5781–5790. 1601–1609.
[97] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, [122] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Learning to detect salient objects with image-level supervision,” “Learning deep features for discriminative localization,” in Proc.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.
[98] G. Li, Y. Xie, and L. Lin, “Weakly supervised salient object [123] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech,
detection using image labels,” in AAAI Conference on Artificial “Minimum barrier salient object detection at 80 fps,” in Proc. IEEE
Intelligence, 2018. Int. Conf. Comput. Vis., 2015, pp. 1404–1412.
[99] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowl- [124] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency
edge transfer for salient object detection,” in Proc. Eur. Conf. detection: a boolean map approach,” IEEE Trans. Pattern Anal.
Comput. Vis., 2018, pp. 370–385. Mach. Intell., no. 5, pp. 889–902, 2016.
[100] L. Zhang, J. Zhang, Z. Lin, H. Lu, and Y. He, “Capsal: Leveraging [125] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
captioning to boost semantics for salient object detection,” in “Multiscale combinatorial grouping,” in Proc. IEEE Conf. Comput.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6024– Vis. Pattern Recognit., 2014, pp. 328–335.
6033. [126] R. Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1,
[101] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance: pp. 41–75, 1997.
Boundary-aware salient object detection,” in Proc. IEEE Int. Conf. [127] E. L. Kaufman, M. W. Lord, T. W. Reese, and J. Volkmann,
Comput. Vis., 2019, pp. 3799–3808. “The discrimination of visual number,” The American Journal of
[102] Z. Wu, L. Su, and Q. Huang, “Stacked cross refinement network Psychology, vol. 62, no. 4, pp. 498–525, 1949.
for edge-aware salient object detection,” in Proc. IEEE Int. Conf. [128] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image segmenta-
Comput. Vis., 2019, pp. 7264–7273. tion by probabilistic bottom-up aggregation and cue integration,”
[103] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang, “Joint learning of saliency in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.
detection and weakly supervised semantic segmentation,” in [129] V. Movahedi and J. H. Elder, “Design and perceptual validation
Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7223–7233. of performance measures for salient object segmentation,” in
[104] G. Li and Y. Yu, “Deep contrast learning for salient object detec- Proc. IEEE Conf. Comput. Vis. Pattern Recognit. - Workshops, 2010.
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. [130] C. Xia, J. Li, X. Chen, A. Zheng, and Y. Zhang, “What is and
478–487. what is not a salient object? learning salient object detector
[105] Y. Tang and X. Wu, “Saliency detection via combining region- by ensembling linear exemplar regressors,” in Proc. IEEE Conf.
level and pixel-level predictions with cnns,” in Proc. Eur. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4321–4329.
Comput. Vis., 2016, pp. 809–825. [131] D.-P. Fan, M.-M. Cheng, J.-J. Liu, S.-H. Gao, Q. Hou, and A. Borji,
[106] P. Krähenbühl and V. Koltun, “Efficient inference in fully con- “Salient objects in clutter: Bringing salient object detection to the
nected crfs with gaussian edge potentials,” in Proc. Advances foreground,” in The Proc. Eur. Conf. Comput. Vis., 2018.
Neural Inf. Process. Syst., 2011, pp. 109–117. [132] R. Fan, Q. Hou, M.-M. Cheng, G. Yu, R. R. Martin, and S.-M. Hu,
[107] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. “Associating inter-image salient instances for weakly supervised
Hu, “Global contrast based salient region detection,” IEEE Trans. semantic segmentation,” in The Proc. Eur. Conf. Comput. Vis., 2018,
Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015. pp. 367–383.
[108] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets [133] J. Zhao, J. Li, H. Liu, S. Yan, and J. Feng, “Fine-grained multi-
of salient object segmentation,” in Proc. IEEE Conf. Comput. Vis. human parsing,” Int. J. Comput. Vis., pp. 1–19, 2019.
Pattern Recognit., 2014, pp. 280–287. [134] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of hu-
[109] R. Ju, Y. Liu, T. Ren, L. Ge, and G. Wu, “Depth-aware salient man segmented natural images and its application to evaluating
object detection using anisotropic center-surround difference,” segmentation algorithms and measuring ecological statistics,” in
Signal Processing: Image Communication, vol. 38, pp. 115–126, 2015. Proc. IEEE Int. Conf. Comput. Vis., vol. 2, 2001, pp. 416–423.
[110] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “Rgbd salient [135] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun
object detection: a benchmark and algorithms,” in Proc. Eur. Conf. database: Large-scale scene recognition from abbey to zoo,” in
Comput. Vis., 2014, pp. 92–109. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–
[111] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency in 3492.
context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, [136] Z. Wang and B. Li, “A two-stage approach to saliency detection
pp. 1072–1080. in images,” in Proc. IEEE Conf. Acoust. Speech Signal Process., 2008,
[112] J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. Lin, X. Shen, pp. 965–968.
B. Price, and R. Mech, “Salient object subitizing,” in Proc. IEEE [137] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate fore-
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4045–4054. ground maps?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[113] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and 2014, pp. 248–255.
A. Zisserman, “The pascal visual object classes (voc) challenge,” [138] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010. measure: A new way to evaluate foreground maps,” in Proc. IEEE
[114] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Int. Conf. Comput. Vis., 2017.
“Imagenet: A large-scale hierarchical image database,” in Proc. [139] D.-P. Fan, M.-M. Cheng, J.-J. Liu, S.-H. Gao, Q. Hou, and A. Borji,
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. “Enhanced-alignment measure for binary foreground map eval-
[115] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, uation,” in International Joint Conferences on Artificial Intelligence,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in 2018.
context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [140] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross,
[116] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas- and A. Sorkine-Hornung, “A benchmark dataset and evaluation
sification with deep convolutional neural networks,” in Proc. methodology for video object segmentation,” in Proc. IEEE Conf.
Advances Neural Inf. Process. Syst., 2012, pp. 1097–1105. Comput. Vis. Pattern Recognit., 2016, pp. 724–732.
[117] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- [141] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-
works for semantic segmentation,” in Proc. IEEE Conf. Comput. fellow, and R. Fergus, “Intriguing properties of neural networks,”
Vis. Pattern Recognit., 2015, pp. 3431–3440. in Proc. Int. Conf. Learn. Representations, 2014.
[118] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. [142] C. Li, R. Cong, J. Hou, S. Zhang, Y. Qian, and S. Kwong, “Nested
IEEE Int. Conf. Comput. Vis., 2015, pp. 1395–1403. network with two-stream pyramid for salient object detection in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20

optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., [168] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model com-
vol. 57, no. 11, pp. 9156–9166, 2019. pression,” in Proceedings of SIGKDD international conference on
[143] I. Mehmood, M. Sajjad, W. Ejaz, and S. W. Baik, “Saliency- Knowledge discovery and data mining, 2006, pp. 535–541.
directed prioritization of visual data in wireless surveillance [169] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in
networks,” Information Fusion, vol. 24, pp. 16–30, 2015. a neural network,” in Proc. Advances Neural Inf. Process. Syst. -
[144] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation workshops, 2014.
for autonomous driving with deep densely connected mrfs,” in [170] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 669–677. Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint
[145] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal arXiv:1412.6550, 2014.
saliency detection model and its applications in image and video [171] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning
compression,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 185– efficient object detection models with knowledge distillation,” in
198, 2009. Proc. Advances Neural Inf. Process. Syst., 2017, pp. 742–751.
[146] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adver-
sarial examples for semantic segmentation and object detection,” Wenguan Wang received his Ph.D. degree from Beijing Institute of
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1369– Technology in 2018. He is currently a postdoc scholar at ETH Zurich,
1378. Switzerland. From 2016 to 2018, he was a visiting Ph.D. student in
[147] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard, “Robustness University of California, Los Angeles. From 2018 to 2019, he was a
of classifiers: from adversarial to random noise,” in Proc. Advances senior scientist at Inception Institute of Artificial Intelligence, UAE. His
Neural Inf. Process. Syst., 2016, pp. 1632–1640. current research interests include computer vision and deep learning.
[148] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in
machine learning: from phenomena to black-box attacks using
Qiuxia Lai received the B.E. and M.S. degrees in the School of Au-
adversarial samples,” arXiv preprint arXiv:1605.07277, 2016.
tomation from Huazhong University of Science and Technology in 2013
[149] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable and 2016, respectively. She is currently pursuing the Ph.D. degree in
adversarial examples and black-box attacks,” in Proc. Int. Conf. The Chinese University of Hong Kong. Her research interests include
Learn. Representations, 2017. image/video processing and deep learning.
[150] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1521–
1528. Huazhu Fu (SM’18) received the Ph.D. degree from Tianjin University,
[151] K. Simonyan and A. Zisserman, “Very deep convolutional net- China, in 2013. He was a Research Fellow with Nanyang Technological
works for large-scale image recognition,” in Proc. Int. Conf. Learn. University, Singapore for two years. From 2015 to 2018, he was a
Representations, 2015. Research Scientist with the Institute for Infocomm Research, Agency
for Science, Technology and Research, Singapore. He is currently a
[152] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-
Senior Scientist with Inception Institute of Artificial Intelligence, UAE. His
ment learning,” in Proc. Int. Conf. Learn. Representations, 2017.
research interests include computer vision and medical image analysis.
[153] M. Berman, A. Rannen Triki, and M. B. Blaschko, “The lovász-
He is an Associate Editor of IEEE TMI and IEEE Access.
softmax loss: A tractable surrogate for the optimization of the
intersection-over-union measure in neural networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4413–4421. Jianbing Shen (M’11-SM’12) is a Professor with the School of Com-
[154] T.-W. Ke, J.-J. Hwang, Z. Liu, and S. X. Yu, “Adaptive affinity puter Science, Beijing Institute of Technology. He has published about
fields for semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 100 journal and conference papers such as TPAMI, CVPR, and ICCV.
2018, pp. 587–602. He has obtained many honors including the Fok Ying Tung Education
[155] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup, “Conditional Foundation from Ministry of Education, the Program for Beijing Excellent
computation in neural networks for faster models,” in Proc. Int. Youth Talents from Beijing Municipal Education Commission, and the
Conf. Learn. Representations, 2016. Program for New Century Excellent Talents from Ministry of Education.
[156] A. Veit and S. Belongie, “Convolutional networks with adaptive His research interests include computer vision and deep learning. He is
inference graphs,” in Proc. Eur. Conf. Comput. Vis., 2018. an Associate Editor of IEEE TNNLS, IEEE TIP and Neurocomputing.
[157] A. Zlateski, R. Jaroensri, P. Sharma, and F. Durand, “On the
importance of label quality for semantic segmentation,” in Proc. Haibin Ling received the PhD degree from University of Maryland in
IEEE Conf. Comput. Vis. Pattern Recognit., 2018. 2006. From 2000 to 2001, he was an assistant researcher at Microsoft
[158] M. Jiang, J. Xu, and Q. Zhao, “Saliency in crowd,” in Proc. Eur. Research Asia. From 2006 to 2007, he worked as a postdoc at Univer-
Conf. Comput. Vis., 2014, pp. 17–32. sity of California Los Angeles. After that, he joined Siemens Corporate
[159] Q. Zheng, J. Jiao, Y. Cao, and R. W. Lau, “Task-driven webpage Research as a research scientist. Since 2008, he has been with Temple
saliency,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 287–302. University where he is now an Associate Professor. He received the Best
[160] A. Palazzi, F. Solera, S. Calderara, S. Alletto, and R. Cucchiara, Student Paper Award at the ACM UIST in 2003, and the NSF CAREER
“Learning where to attend like a human driver,” in IEEE Intelli- Award in 2014. He is an Associate Editor of IEEE TPAMI, PR, and CVIU,
gent Vehicles Symposium, 2017, pp. 920–925. and served as Area Chairs for CVPR 2014, 2016 and 2019.
[161] A. K. Mishra, Y. Aloimonos, L. F. Cheong, and A. Kassim, “Active
visual segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., Ruigang Yang is currently a full professor of Computer Science at
vol. 34, no. 4, pp. 639–653, 2012. the University of Kentucky. His research interests span over computer
[162] C. M. Masciocchi, S. Mihalas, D. Parkhurst, and E. Niebur, vision and computer graphics, in particular in 3D reconstruction and
“Everyone knows what is interesting: Salient locations which 3D data analysis. He has received a number of awards, including the
should be fixated,” Journal of Vision, vol. 9, no. 11, pp. 25–25, US National Science Foundation Faculty Early Career Development
2009. (CAREER) Program Award in 2004, and the best Demonstration Award
[163] A. Borji, “What is a salient object? A dataset and a baseline model at CVPR 2007. He is currently an associate editor of IEEE TPAMI.
for salient object detection,” IEEE Trans. Image Process., vol. 24,
no. 2, pp. 742–756, 2015.
[164] L. Jing and Y. Tian, “Self-supervised visual feature learning with
deep neural networks: A survey,” IEEE Trans. Pattern Anal. Mach.
Intell., 2020.
[165] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
“Context encoders: Feature learning by inpainting,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2536–2544.
[166] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as a
proxy task for visual understanding,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2017, pp. 6874–6883.
[167] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus-
tering for unsupervised learning of visual features,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 132–149.

Sylenth1 Product Keys For Activation
100% (1)
Sylenth1 Product Keys For Activation
2 pages
StoneOS WebUI User Guide A 5.5R8-1
No ratings yet
StoneOS WebUI User Guide A 5.5R8-1
1,310 pages
1st Year Curriculum Structure For B.tech Courses in Engineering & Technology-2018!19!31.10.2018
No ratings yet
1st Year Curriculum Structure For B.tech Courses in Engineering & Technology-2018!19!31.10.2018
12 pages
Salient Object Detection With Importance Degree
No ratings yet
Salient Object Detection With Importance Degree
11 pages
Yan 2022
No ratings yet
Yan 2022
22 pages
DVSOD - RGB-D Video Salient Object Detection
No ratings yet
DVSOD - RGB-D Video Salient Object Detection
14 pages
10.3934 Mbe.2023282
No ratings yet
10.3934 Mbe.2023282
40 pages
Borji2019 Article SalientObjectDetectionASurvey
No ratings yet
Borji2019 Article SalientObjectDetectionASurvey
34 pages
Comprehensive Review On CNN Encoder Decoder PDF
No ratings yet
Comprehensive Review On CNN Encoder Decoder PDF
23 pages
4876-Article Text-7942-1-10-20190709
No ratings yet
4876-Article Text-7942-1-10-20190709
8 pages
Salience Model
No ratings yet
Salience Model
19 pages
A Highly Efficient Model To Study The Semantics of Salient Object Detection
No ratings yet
A Highly Efficient Model To Study The Semantics of Salient Object Detection
16 pages
Com/Tanveer Hussain/Efficientsod2
No ratings yet
Com/Tanveer Hussain/Efficientsod2
11 pages
Problem Definition Introduction Existing Solutions Innovations in ACACM-Net Methodology Remaining Work Conclusion References.
No ratings yet
Problem Definition Introduction Existing Solutions Innovations in ACACM-Net Methodology Remaining Work Conclusion References.
11 pages
SVAM: Saliency-Guided Visual Attention Modeling by Autonomous Underwater Robots
No ratings yet
SVAM: Saliency-Guided Visual Attention Modeling by Autonomous Underwater Robots
14 pages
Transformers in Small Object Detection - SOTA
No ratings yet
Transformers in Small Object Detection - SOTA
20 pages
Term Paper
No ratings yet
Term Paper
10 pages
2207.14096v4
No ratings yet
2207.14096v4
24 pages
ECFFNet_Effective_and_Consistent_Feature_Fusion_Network_for_RGB
No ratings yet
ECFFNet_Effective_and_Consistent_Feature_Fusion_Network_for_RGB
12 pages
Gao_Can_You_Spot_the_Chameleon_Adversarially_Camouflaging_Images_From_Co-Salient_CVPR_2022_paper
No ratings yet
Gao_Can_You_Spot_the_Chameleon_Adversarially_Camouflaging_Images_From_Co-Salient_CVPR_2022_paper
10 pages
Motion saliency using CNN
No ratings yet
Motion saliency using CNN
12 pages
251 Promoting Saliency From Depth
No ratings yet
251 Promoting Saliency From Depth
20 pages
U - Net: Going Deeper With Nested U-Structure For Salient Object Detection
No ratings yet
U - Net: Going Deeper With Nested U-Structure For Salient Object Detection
15 pages
Bio-Inspired Representation Learning For Visual Attention Prediction
No ratings yet
Bio-Inspired Representation Learning For Visual Attention Prediction
14 pages
20052-Article Text-24065-1-2-20220628
No ratings yet
20052-Article Text-24065-1-2-20220628
10 pages
A Comparative Attention Framework For Better Few-Shot Object Detection On Aerial Images
No ratings yet
A Comparative Attention Framework For Better Few-Shot Object Detection On Aerial Images
19 pages
Tcyb21 Dna
No ratings yet
Tcyb21 Dna
12 pages
jimaging-07-00187-v2
No ratings yet
jimaging-07-00187-v2
36 pages
Qin BASNet Boundary-Aware Salient Object Detection CVPR 2019 Paper
No ratings yet
Qin BASNet Boundary-Aware Salient Object Detection CVPR 2019 Paper
11 pages
Deep Networks For Saliency Detection Via Local Estimation and Global Search
No ratings yet
Deep Networks For Saliency Detection Via Local Estimation and Global Search
10 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
Computer Vision 3
No ratings yet
Computer Vision 3
38 pages
TIP21-SAMNet
No ratings yet
TIP21-SAMNet
11 pages
Bi-Branch_Multiscale_Feature_Joint_Network_for_ORSI_Salient_Object_Detection_in_Adverse_Weather_Conditions
No ratings yet
Bi-Branch_Multiscale_Feature_Joint_Network_for_ORSI_Salient_Object_Detection_in_Adverse_Weather_Conditions
10 pages
SLICING AIDEDHYPERINFERENCEANDFINE-TUNING FORSMALLOBJECTDETECTION
No ratings yet
SLICING AIDEDHYPERINFERENCEANDFINE-TUNING FORSMALLOBJECTDETECTION
5 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Saliency Detection by Multi-Context Deep Learning
No ratings yet
Saliency Detection by Multi-Context Deep Learning
10 pages
Deep Cognitive Gate: Resembling Human Cognition For Saliency Detection
No ratings yet
Deep Cognitive Gate: Resembling Human Cognition For Saliency Detection
18 pages
A Survey of Deep Learning-Based Object Detection
No ratings yet
A Survey of Deep Learning-Based Object Detection
30 pages
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
No ratings yet
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
6 pages
Saliency Detection Via Graph-Based Manifold Ranking
No ratings yet
Saliency Detection Via Graph-Based Manifold Ranking
8 pages
A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review
No ratings yet
A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review
29 pages
2020-TMM-Saliency Detection Via A Multiple Self-Weighted
No ratings yet
2020-TMM-Saliency Detection Via A Multiple Self-Weighted
12 pages
Towards Building Self-Aware Object Detectors Via Reliable Uncertainty
No ratings yet
Towards Building Self-Aware Object Detectors Via Reliable Uncertainty
36 pages
Bio-Inspired Visual Attention Model and Saliency Guided Object Segmentation
No ratings yet
Bio-Inspired Visual Attention Model and Saliency Guided Object Segmentation
2 pages
Application of Deep Learning For Object Detection
No ratings yet
Application of Deep Learning For Object Detection
12 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Efficient Fusion of Spatio-Temporal Saliency For Frame Wise Saliency Identification
No ratings yet
Efficient Fusion of Spatio-Temporal Saliency For Frame Wise Saliency Identification
13 pages
3.15 T-CNN Tubelets With Convolutional Neural Networks For Object Detection From Videos
No ratings yet
3.15 T-CNN Tubelets With Convolutional Neural Networks For Object Detection From Videos
11 pages
[2025-AEJ]Object detection in real-time video surveillance using attention based transformer-YOLOv8 model
No ratings yet
[2025-AEJ]Object detection in real-time video surveillance using attention based transformer-YOLOv8 model
14 pages
Seminar Topic On Spot Detection of Self Driving Car
No ratings yet
Seminar Topic On Spot Detection of Self Driving Car
7 pages
Research Article: Saliency Mapping Enhanced by Structure Tensor
No ratings yet
Research Article: Saliency Mapping Enhanced by Structure Tensor
9 pages
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Icgec2013
No ratings yet
Icgec2013
8 pages
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
No ratings yet
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
18 pages
An Unsupervised Game-Theoretic Approach To Saliency Detection
No ratings yet
An Unsupervised Game-Theoretic Approach To Saliency Detection
12 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
Electronics 12 01515
No ratings yet
Electronics 12 01515
21 pages
1-realtimeobjectdetection
No ratings yet
1-realtimeobjectdetection
6 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bistatic SAR Data Processing Algorithms
From Everand
Bistatic SAR Data Processing Algorithms
Xiaolan Qiu
5/5 (1)
CBPST, Kochi Testing Technical Specifications PDF
No ratings yet
CBPST, Kochi Testing Technical Specifications PDF
32 pages
Chapter 8: Programmable Logic Controller (PLC) : EG2098: Industrial Electronics and Control &topic 8 1 1
No ratings yet
Chapter 8: Programmable Logic Controller (PLC) : EG2098: Industrial Electronics and Control &topic 8 1 1
26 pages
Voting Machine Lab 11 Grp7
No ratings yet
Voting Machine Lab 11 Grp7
19 pages
E-17 Adg3000vt Adaptiver Digital Gyropilot
No ratings yet
E-17 Adg3000vt Adaptiver Digital Gyropilot
126 pages
Medicamentos
No ratings yet
Medicamentos
6 pages
AutoCAD Plant 3D - V (3D)
No ratings yet
AutoCAD Plant 3D - V (3D)
11 pages
Cyberloafing in The Workplace
No ratings yet
Cyberloafing in The Workplace
3 pages
7-System Security
No ratings yet
7-System Security
113 pages
Design & Analysis of Algorithms: Chapter 3 Greedy Method Department of Computer Science Mekdela Amba University
No ratings yet
Design & Analysis of Algorithms: Chapter 3 Greedy Method Department of Computer Science Mekdela Amba University
34 pages
Failed_Migrating Watsonx Assistant Classic Deployments [WA Deploy L4] Quiz_ Attempt Review
No ratings yet
Failed_Migrating Watsonx Assistant Classic Deployments [WA Deploy L4] Quiz_ Attempt Review
6 pages
Cybersecurity Complaince Framework & System Administration
No ratings yet
Cybersecurity Complaince Framework & System Administration
1 page
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
I TUTOR Business Plan
No ratings yet
I TUTOR Business Plan
53 pages
CZUR ET Series User Manual - English-20220211 A
No ratings yet
CZUR ET Series User Manual - English-20220211 A
15 pages
Ms Access Tutorial Notes For Bca Part 2
No ratings yet
Ms Access Tutorial Notes For Bca Part 2
39 pages
Form Pengajuan Dana Beasiswa
No ratings yet
Form Pengajuan Dana Beasiswa
4 pages
WDH-SL3 Iom M01M40N13-07 - 1542713324932
100% (1)
WDH-SL3 Iom M01M40N13-07 - 1542713324932
72 pages
Session 20
No ratings yet
Session 20
26 pages
05 IP Routing Basics
No ratings yet
05 IP Routing Basics
52 pages
Download Full The Rough Guide to Nepal 10th Edition Stuart Butler PDF All Chapters
100% (3)
Download Full The Rough Guide to Nepal 10th Edition Stuart Butler PDF All Chapters
28 pages
Guía de Instalación PLC11-02
No ratings yet
Guía de Instalación PLC11-02
28 pages
European Coatings Journal PDF
No ratings yet
European Coatings Journal PDF
2 pages
Instant Access to Engineering Fundamentals: An Introduction to Engineering, 6th Edition Moaveni Saeed - eBook PDF ebook Full Chapters
100% (1)
Instant Access to Engineering Fundamentals: An Introduction to Engineering, 6th Edition Moaveni Saeed - eBook PDF ebook Full Chapters
62 pages
SwiftUI Basics 2.0
100% (2)
SwiftUI Basics 2.0
58 pages
Business Data + Graphs and Tables
No ratings yet
Business Data + Graphs and Tables
19 pages
Push Relabel
No ratings yet
Push Relabel
70 pages
Fluid Mechanics Mech Gate Ies Notes
50% (2)
Fluid Mechanics Mech Gate Ies Notes
21 pages
Sunbase Data Assignment
No ratings yet
Sunbase Data Assignment
11 pages

2020 Survey

Uploaded by

2020 Survey

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Salient Object Detection in the

Index Terms—Salient Object Detection, Deep Learning, Benchmark, Image Saliency.

S ALIENT object detection (SOD) aims at highlighting vi-

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

Traditional Methods Heuristic FCN-based

Title Year Venue Description

Level of Learning Obj.-/Inst.-

+ImageNet [116]+DUTS [97] +300,000+10,553

[0.1, 0.9] [0.8, 0.2]

Super-pixel/Patch/ Input Image

(a) (b) (c) (d) Fully-Connected Layer

(e) (f) (g) (h)

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

NatObj Artifact, NatObj Animal, NatObj

Artifact NatObj Animal Artifact Human Human, Artifact

Salient object categories Challenges Scene categories

TABLE 6 images with attribute annotations are shown in Fig. 4. Please

Salient object categories Challenges Scene categories

TABLE 9 same content. However, the recently introduced adversarial

SRM DGRL PiCANet

Attack Attack Attack Attack Attack Attack

rarely explored. As SOD has been integrated as a critical

SRM DGRL PiCANet Image Pert.

part in many security systems and commercial projects,

important because the insecurity of SOD modules may

SRM DGRL PiCANet Image

adversarial attacks and transferability of adversarial exam-

TABLE 10 Prediction Ground-truth Map

6.1 Model Design

You might also like