0% found this document useful (0 votes)
16 views

Salient Object Detection With Importance Degree

This paper introduces salient object detection with importance degree (SOD-ID), which detects salient objects and estimates their importance values. The paper defines SOD-ID, discusses its evaluation and data collection, and proposes an evaluation metric and data preparation. It also proposes an SOD-ID method using instance segmentation, saliency detection with CNN, and importance degree estimation. Experiments show the method outperforms other methods on various metrics.

Uploaded by

Harsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Salient Object Detection With Importance Degree

This paper introduces salient object detection with importance degree (SOD-ID), which detects salient objects and estimates their importance values. The paper defines SOD-ID, discusses its evaluation and data collection, and proposes an evaluation metric and data preparation. It also proposes an SOD-ID method using instance segmentation, saliency detection with CNN, and importance degree estimation. Experiments show the method outperforms other methods on various metrics.

Uploaded by

Harsh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received July 8, 2020, accepted July 23, 2020, date of publication August 7, 2020, date of current version August

20, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3014886

Salient Object Detection With Importance Degree


YO UMEKI 1 , ISANA FUNAHASHI2 , TAICHI YOSHIDA2 , (Member, IEEE),
AND MASAHIRO IWAHASHI 1 , (Senior Member, IEEE)
1 Department of Electrical, Electronics, and Information Engineering, Nagaoka University of Technology, Nagaoka 940-2137, Japan
2 Department of Communication Engineering and Informatics, The University of Electro-Communications, Chofu-shi 182-8585, Japan
Corresponding author: Yo Umeki ([email protected])
This work was supported by JSPS KAKENHI under Grant 16K18104.

ABSTRACT In this article, we introduce salient object detection with importance degree (SOD-ID), which is
a generalized technique for salient object detection (SOD), and propose an SOD-ID method. We define SOD-
ID as a technique that detects salient objects and estimates their importance degree values. Hence, it is more
effective for some image applications than SOD, which is shown via examples. The definition, evaluation
procedure, and data collection for SOD-ID are introduced and discussed, and we propose its evaluation
metric and data preparation, whose validity is discussed with the simulation results. Moreover, we propose
an SOD-ID method, which consists of three technical blocks: instance segmentation, saliency detection, and
importance degree estimation. The saliency detection block is proposed based on a convolutional neural
network using the results of the instance segmentation block. The importance degree estimation block
is achieved using the results of the other blocks. The proposed method accurately suppresses inaccurate
saliencies and estimates the importance degree for multi-object images. In the simulations, the proposed
method outperformed state-of-the-art methods with respect to the F-measure for SOD; and Spearman’s and
Kendall rank correlation coefficients, and the proposed metric for SOD-ID.

INDEX TERMS Saliency detection, salient object detection, instance segmentation, convolutional neural
network (CNN), rank correlation metric.

I. INTRODUCTION Moreover, Islam et al. proposed an expansion of SOD


Saliency detection (SD) is an image processing technique [25], which is called RSOD in this article, and studies have
that estimates salient local regions in images [1]–[7]. Salient shown that it has high potential for image applications. SOD
regions are generally defined as areas that attract human classifies estimated objects as salient or non-salient, whereas
attention with respect to characteristics such as high con- RSOD estimates salient object contours and their importance
trast, unique orientation, and distinctive color. Detecting these scores. Importance scores are useful for several applications,
regions is important for image applications, such as human which we show in Fig. 1, where (a) is an input image,
eye fixation estimation and context-aware image coding. (b) and (c) are its ideal saliency map in SOD and RSOD, and
Recently, several methods have been proposed for salient (d) and (e) are the retargeting results for (a) using (b) and
object detection (SOD) which is similar to SD [8]–[27]. (c) according to [30], respectively. In (b), the white and black
Instead of estimating local regions, SOD identifies char- areas represent salient and non-salient regions, respectively,
acteristic objects, such as a tall man, a red car, or signs. whereas in (c), the white, gray, and black areas represent first
Some image processing applications require not only salient salient, second salient, and non-salient regions, respectively.
information but also important object locations [28]–[31]. In (d), a part of the dog, which seems to be the most important
For example, image retargeting uses the object locations and object in (a), is cropped because the chair and dog are given
resizes images while retaining their shapes. Thus, SOD has the same importance value by SOD shown in (b). By contrast,
been shown to be more useful than SD for some applications. because of the different scores in (c), the dog is completely
preserved in (e). Fig.1 shows one advantage of the expansion,
and we experimentally understand that it has high potential
The associate editor coordinating the review of this manuscript and not only for image retargeting but also, for example, for
approving it for publication was Ikramullah Lali . content-aware image coding and image representation.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 147059
Y. Umeki et al.: SOD-ID

without actually applying them to image applications,


which contributes to the development of this theme.
• We propose an SOD-ID method via combining instance
segmentation and SD based on deep learning as a sepa-
rable system, which is also useful for SOD.
In simulations, the proposed method perceptually demon-
strated N -degree salient objects, and objectively outper-
formed RSOD [25]. The proposed method accurately
detected salient objects and estimated their values of impor-
tance degree. The proposed method was objectively com-
pared with RSOD for Spearman’s rank correlation coefficient
[32], the Kendall rank correlation coefficient [33], and the
proposed metric, and obtained better scores. Moreover, in the
evaluation procedure of SOD, the results of the proposed
method were objectively comparable with state-of-the-art
SOD methods; therefore, we demonstrated that it is as effec-
tive as SOD.
The remainder of this article is organized as follows:
In Section II, we provide an overview of existing meth-
ods of SD, SOD, fixation estimation, semantic segmenta-
FIGURE 1. Retargeting simulation based on SOD and SOD-ID.
tion, and instance segmentation. In Section III, we briefly
present the fundamentals of SOD and RSOD. In Section IV,
we discuss the definition, significance, evaluation metric and
However, the discussion of RSOD was not sufficient in dataset of SOD-ID. In Section V, we explain the proposed
[25] to introduce it as a new theme of computer vision. The index, dataset, and method. Finally, we present experimen-
authors presented the expansion with little detail as a sup- tal comparisons in Section VI, and conclude this article in
plement to the main topic. The details of its definition were Section 7.
omitted, and its significance was not discussed. As the evalua-
tion metric, Spearman’s rank correlation coefficient [32] was
II. RELATED WORKS
use simply, but, unfortunately, its validity was not discussed.
In this section, we explore existing methods of SD, SOD,
This inadequate discussion is a problem for tackling the new
semantic segmentation, and instance segmentation. First,
theme.
we describe SD and human eye fixation. Second, we explain
In this article, we call the technique, which is denoted by
two types of SOD and RSOD. Finally, we discuss the differ-
RSOD, SOD with importance degree (SOD-ID); refine it to
ence between semantic segmentation and instance segmenta-
introduce a new theme via discussing its definition, signifi-
tion, and review their recent methods.
cance, and assessment; and propose an SOD-ID method that
SD is similar to human eye fixation; that is, they esti-
outperforms state-of-the-art methods. First, we discuss and
mate regions of interest that correspond to human attention
construct the definition and significance of SOD-ID using
[1]–[7], [34], [35]. The traditional SD method uses charac-
several pieces of evidence and application examples. We
teristic features, such as high contrast, unique orientation, and
define the importance score as represented in N degrees, and
distinctive color [1]. Harel et al. proposed a method that uses
refer to this as the importance degree. Based on this dis-
a graph-based algorithm, calculates activation maps based on
cussion, we present the evaluation procedure and the dataset
several features, and combines them to generate one saliency
preparation of SOD-ID. We also propose an assessment index
map [2]. Recently, methods based on a convolutional neural
based on the squared error and Kendall rank correlation
network (CNN) have been proposed, and effectively extract
coefficient [33], and create a dataset based on those of SD
global and complex features as a result of training using a
and instance segmentation. Finally, we propose an SOD-ID
large number of images and their corresponding gaze infor-
method based on deep learning and instance segmentation.
mation [6], [7]. Although they accurately estimate human
Our contributions are summarized as follows:
interest, they cannot estimate object contours.
• We introduce and define a new theme, SOD-ID, via a SOD simultaneously estimates object regions and whether
discussion and examples. they are salient [8]–[24], [26], [27]. Traditional SOD methods
• We introduce the importance degree using N , which is use the propagation algorithm [11], [22], [36]. They itera-
the generalized importance degree of SOD, and show its tively propagate salient and background information based
efficacy and advantages. on color similarities between neighboring pixels and the
• We introduce valid dataset preparation and an effectual Markov absorption probability. However, they often produce
evaluation procedure for SOD-ID to evaluate methods inaccurate results along object boundaries. Recent methods

147060 VOLUME 8, 2020


Y. Umeki et al.: SOD-ID

based on fully convolutional network (FCN) architectures five blocks that each have two or three convolutional layers
have successfully reduced inaccurate detection. Liu and J and a pooling layer. The FCN architecture is constructed by
proposed a deep hierarchical saliency network that realizes replacing the last layer of the VGG architecture with a one-
coarse-to-detailed estimation for salient objects [21]. Another channel convolutional layer. Some methods that apply merge
method adopts a recurrent network to consider the connection and convolution layers to the FCN obtain superior results to
of salient pixels [16]. past methods because the layers realize both shallow and deep
Although a major SOD dataset contains the importance convolutions; thereby, they can capture both global and local
degree for objects [10], existing methods produce binary features [6], [38].
results; that is, they classify detected objects into salient or
non-salient. The PASCAL-S dataset provides integer saliency C. LOCATION-BIASED DETECTION
values in [0, 255] with object contours. However, SOD meth- In SD and SOD, the location assumption is generally used
ods disregard the priority of each object, and instead focus on as prior information [7], [11], [13], [15], [24], [36]. Photog-
estimating the contours of salient objects. Because detecting raphers generally center interesting objects in images, and
salient objects and their correct contours is a challenging task, thus natural images often present salient areas at their center.
researchers generally propose the estimation of the priority of To exploit this tendency, some SOD methods apply higher
each detected object as future work. weights to salient pixels closer to the center of images [13],
Semantic segmentation is a technique that identifies cate- [24]. Following this strategy, in an SD method, a location-
gories to which pixels belong, such as human, tree, and car biased convolution layer was introduced in the FCN, which
[37]–[39]. Traditional semantic segmentation uses contour obtained superior results [7].
detection and the histogram of oriented gradients feature [37].
Recently, the FCN, which is a breakthrough approach for D. RSOD
semantic segmentation, has been used to successfully detect The CNN model detects the contours of salient objects and
image regions [38]. However, semantic segmentation meth- estimates their multiple saliency values because of its archi-
ods cannot separate objects that belong to the same category. tecture [25]. The architecture recursively calculates saliency
Instance segmentation is derived from semantic segmen- maps from coarse to fine levels, and finally fuses the resultant
tation and can identify not only object classes but also their saliency maps. The calculation units are learned using the
instances [40], [41]. A basic instance segmentation method multi-stage GT of the saliency maps that is generated from
uses an FCN to detect small windows that each include one PASCAL-S by thresholding its saliency maps at various val-
object [41]. Another method uses the recurrent architecture to ues. Therefore, the fused maps have various pixel values that
iteratively detect object regions based on previous detection reflect saliency levels from coarse to fine.
results [40]. Although instance segmentation and SOD simi- As an additional process, the method estimates the impor-
larly detect object contours, instance segmentation disregards tance score for each salient object from the output saliency
their importance; therefore, the purposes of the approaches map [25]. In basic terms, the score value is calculated by
has been shown to be different. averaging the saliency values of pixels within the object as
i∈X χi
P
III. FUNDAMENTALS OF SOD Rank(S(X )) = , (1)
NX
A. PASCAL-S DATASET
where S, X , X , χi , and NX denote a predicted saliency map,
The PASCAL-S dataset contains images, their fixation data,
candidate salient object, set of indices of pixels that belong
and their SOD maps with multiple values that can be used
to X , saliency value of the i-th pixel, and total number of
ground truth (GT) for SOD-ID [10]. It contains 850 natural
pixels in X , respectively. It is unknown whether the calculated
images whose full segmentation masks are provided in [42].
values are normalized because this is not clearly described in
The fixation data were obtained by applying an eye-tracker to
[25]. Note that the authors used the GT segmentation masks
eight subjects that were instructed to perform a free-viewing
in PASCAL-S in this process.
task for images. In the SOD experiment, 12 subjects were
In experiments, the method simply uses conventional
given images and asked to highlight salient objects by click-
methods for evaluation. Spearman’s rank correlation coef-
ing on them. The pixels of the SOD maps have integer values
ficient [32] is used as the evaluation metric, and the resul-
in [0, 12], and they are linearly normalized in [0, 255] for the
tant scores are linearly normalized in [0, 1]. PASCAL-S
png format. Therefore, we believe that PASCAL-S is an SOD-
without images used in training is directly used for testing
ID dataset with 13 degrees.
the method.

B. FCN METHODS IV. DISCUSSION ON SOD-ID


The FCN was introduced for image classification based on A. DEFINITION OF SOD-ID
Visual Geometry Group (VGG) networks in [43], and then We define SOD-ID as a technique that detects the contours
several FCN architectures were proposed for several applica- of salient objects and estimates their importance degree. Its
tions [6], [34], [35], [38]. The VGG architecture consists of methods produce a saliency map whose pixel values represent

VOLUME 8, 2020 147061


Y. Umeki et al.: SOD-ID

the importance degree scores of objects to which they belong.


SOD-ID is mostly similar to SOD, but in contrast to the binary
maps of SOD, its GT saliency maps have several values for
N -degree objects, as shown in Fig. 2. N -degree means that the
maps have integer values in [0, N − 1], where, clearly, zero
indicates that the pixel of the map belongs to a non-salient
object, and N -degree is linearly normalized according to the
coding format As mentioned in Section III-A, PASCAL-S
seems to be an SOD-ID dataset which N = 13 according
to experiments. Moreover, note that SOD-ID is a generalized
version of SOD; that is, SOD is an SOD-ID in N = 2.

FIGURE 3. Retargeting simulation for multi-object images based on SOD


and SOD-ID.
FIGURE 2. (a) Input image and (b) GT map of SOD-ID.

which is used as pre-processing, results in a variety of


In this article, N = 7 is empirically used based on the
saliency information useful to image applications, such as
characteristics of natural images. Table 1 shows the distribu-
retargeting, content-aware coding, and summarizing.
tion of natural images in PASCAL-S [10] with respect to the
For instance, SOD-ID is clearly more suitable than SOD
number of salient objects within them, where the first, second,
for image retargeting from our experiments. Similar to
and last rows denote the number of salient objects, number
Fig. 1, Fig. 3 shows retargeting results according to [30] for
of images that include salient objects of the corresponding
a multi-object image. The input image in Fig. 3 (a) repre-
number in the first row, and distribution, respectively, and
sents ‘‘dogs pull a sled and a human rides’’ and therefore
‘‘7+’’ in the eighth column indicates seven or more salient
its important words are ‘‘dog,’’ ‘‘pull,’’ ‘‘sled,’’ ‘‘human,’’
objects. From Table 1, natural images typically contain six
and ‘‘ride.’’ Image retargeting should retain important words
or fewer salient objects. They rarely contain seven or more
and sentences for input images in its results. In that sense,
salient objects, but in most cases, some objects in one image
Fig. 3 (d) shows a failure because the dog is not clearly visible
have the same saliency levels. Therefore, because 7 degrees
and hence, unfortunately, it represents the wrong sentence,
adequately realizes SOD-ID for natural images, N = 7 is
‘‘something pulls a sled and a human rides.’’ By contrast,
generally valid. Clearly, the value of N can be fixed flexibly
in Fig. 3 (e), the retargeting result for SOD-ID, accurately
for various image applications.
represents the original sentence, ‘‘dogs pull a sled and a
human rides.’’ For other images and retargeting methods,
TABLE 1. Number and percentage of images in the PASCAL-S dataset [10]
with respect to the number of salient objects.
the results sometimes demonstrate the superiority of SOD-ID
for image retargeting, as shown in Figs. 1 and 3.

C. SUPERVISED EVALUATION METRIC


The supervised evaluation metric of SOD-ID should measure
the degree of similarity with respect to segmentation and the
B. SIGNIFICANCE OF SOD-ID importance degree. Because SOD-ID methods aim to detect
SOD-ID is a generalization of SOD and more suitable for the contours of salient objects, they should be evaluated in
image applications than SOD. People ordinarily rank objects the same manner as segmentation. Additionally, they should
in an image with respect to their interests. Similarly, it has be evaluated when calculating the correlation and similarity
been observed in experiments that subjects sometimes rec- of values for scores of the importance degree. An object that
ognize salient objects as non-salient because of the objects’ has higher scores than another object in the GT should have
locations. SOD-ID estimates general results of this ranking, higher scores in the results of SOD-ID methods, and smaller
and therefore saliency information produced by SOD-ID is better in terms of the difference between the score values
is more related to human behavior than SOD. Moreover, between the GT and the results of SOD-ID methods. Unfor-
by thresholding with various parameters as post-processing, tunately, because conventional rank correlation coefficients
SOD-ID produces various saliency maps of SOD. SOD-ID, evaluate the correlation but ignore the similarity of scores,

147062 VOLUME 8, 2020


Y. Umeki et al.: SOD-ID

Spearman’s rank correlation coefficient, which is used in the SOD-ID maps. The proposed procedure calculates the
[25], is unsuitable for calculating the importance degree. sum of pixel values within objects in the GT maps of SD,
In this article, we propose an evaluation metric for the and resultant values are considered as their scores of the
importance degree of SOD-ID. As the evaluation met- importance degree, which is defined in one image as
ric for segmentation, conventional methods, for example, P
j∈ sj
the F-measure, can be used. An evaluation metric for SOD-ID Degi = Pi , (4)
is defined as a linear combination of the F-measure and the maxi { j∈i sj }
proposed metric, or the parallel use of them. The proposed
where Degi , sj , and i denote the score of the i-th object, i-th
metric F is defined based on simply combining metrics for
pixel value of the SD map, and a set of indices of pixels within
the correlation and score similarity as
the i-th object, respectively. To produce the SOD-ID map,
F(vp , vt ) = αR(vp , vt ) + (1 − α)I (vp , vt ), (2) pixel values within the i-th object are uniformly set as Degi ,
and the resultant map is linearly quantized using N . Because
where R, I , α, vp , and vt denote the correlation and similarity the GT maps of SD represent the degree of saliency for each
metrics, a balancing free parameter, and vectors for which pixel, the summation values within an object are approxi-
each element is the score value of each object, respectively. mately recognized as the degree of interest for the object.
We use the Kendall rank correlation coefficient as R [33] Similarly, a pixel value within an object in the GT maps of SD
because it straightforwardly evaluates the correlation and is approximately considered as the number of subjects that
therefore is more suitable than Spearman’s rank correlation recognize the object and categorize it as salient, and therefore,
coefficient. For I, we use the squared error and define it as in the case of a large number of subjects, the summation
N procedure is recognized as the same as that of PASCAL-
1 X
I (vp , vt ) = exp(−(vpi − vti )2 /(2σ 2 )), (3) S for SOD mentioned in Section III-A. Based on the above
N assumptions, we believe that the proposed procedure is valid
i=1
for creating SOD-ID datasets.
where N , vpi , and vti denote the number of objects, and the i-th
We experimentally show that the proposed procedure men-
element of vp and vt , respectively, and σ is a free parameter
tioned above has high validity compared with the RSOD
that controls the variance of the Gaussian distribution. R
procedure mentioned in Section III-D [25]. Using these pro-
that outputs real values in [−1, 1] is linearly normalized in
cedures, SOD-ID maps are produced using the full segmen-
[0, 1], I has real values in [0, 1] because of (3), and α is
tation masks and fixation data of PASCAL-S. Table 2 shows
restricted in [0, 1], and hence F outputs real values in [0, 1].
this comparison, where ‘‘Sum.’’ and ‘‘Ave.’’ denote the results
The metric proposition requires much experimental evidence,
of the proposed and RSOD procedures; that is, they show
but because of the limited space in this article, the validity of
values of the evaluation metrics between the SOD maps of
F is briefly shown in Section 6 and a detailed discussion on
PASCAL-S and their resultant maps, respectively. For sim-
this topic remains as future work.
plicity, we use Spearman’s and Kendall rank correlation coef-
ficients as the metrics [32], [33]. From Table 2, the proposed
D. DATASET PREPARATION
procedure is clearly better than the RSOD procedure and thus
To create SOD-ID datasets, the procedure of PASCAL-S
our opinions mentioned above has been shown to be valid.
mentioned in Section III-A is suitable. The segmentation
masks are simply obtained manually, and the importance
TABLE 2. Scores for the estimation methods of the importance degree for
degree is determined as follows: By the strict rules, the sub- the PASCAL-S dataset [10].
jects of experiments are asked to collect and rank interesting
objects in one image. The strict procedure requires several
subjects, but unfortunately, it is a difficult task for them.
By contrast, the procedure of PASCAL-S only asks subjects
to collect interesting objects. For an object, the number of
subjects that recognize it as salient is directly determined as
its values of the importance degree, and to create a GT map V. PROPOSED SOD-ID METHOD
of SOD-ID, pixels within each salient object have their scores A. OVERVIEW
based uniformly on the segmentation mask. If M subjects are The proposed SOD-ID method is briefly shown in Fig. 4. The
applied, the resultant map has M degrees. This is simple and system consists of three technical blocks: instance segmenta-
useful, but a large number of subjects are required to create tion, SD, and importance degree estimation. First, instance
general datasets. segmentation is applied to an input image to detect object
To avoid experiments using subjects, we introduce a prepa- contours, and its arbitrary method can be used here such as
ration procedure for the SOD-ID dataset based on existing that in [40], [41], [47], [48]. Second, the salient regions of the
SD data. As mentioned above, subjective experiments have input image are detected by the proposed CNN method using
the troublesome characteristic of requiring many people and object contours detected in the first block. Finally, using the
large costs. To avoid this, we use existing SD data to produce results of the first and second blocks, the proposed method

VOLUME 8, 2020 147063


Y. Umeki et al.: SOD-ID

FIGURE 4. Overview of the proposed method.

FIGURE 5. Architecture of the proposed CNN method.

outputs an SOD-ID map with N degrees through the estima- correspond to Tables3 (a)–(c), respectively. In Table 3,
tion block of the importance degree. The technical blocks ‘‘Conv.’’, ‘‘Pool.’’, and ‘‘p∗’’ indicate convolution, max pool-
can be independently developed, and therefore the system ing layers, and the pyramid pooling module, respectively. The
provides suitable expandability and serves as a fundamental rectified linear unit [49] is used as the activation function
design of SOD-ID methods. in the convolution layers. A VGG-based method is used to
extract image features in Fig. 5 (a). The results of the first
B. PROPOSED CNN METHOD FOR SD block and the features after Pool.3, Pool.4, and Conv.5-3 are
In this section, we explain the proposed CNN method for SD merged along the channel direction, and input merged signals
in the second block that uses the detected contours of the first into Conv. 6-1. The signals after Conv. 6-2 are transformed
block. The architecture uses the contours as a part of the input using the pyramid pooling module proposed in [39], and the
and extracts their multi-resolution features to estimate the resultant signals are resized to the same size as the signals
saliency values. The loss function imposes different weights after Conv. 6-2. Finally, the resized signals and those after
for object and background regions based on the contours. Conv. 6-2 are merged along the channel direction, and pro-
Note that the proposed CNN method considers location bias cessed through Conv. 7-1, 2, and 3.
similar to conventional SD and SOD methods.
2) LOSS FUNCTION
1) ARCHITECTURE The loss function of the proposed CNN method assigns high
Fig. 5 and Table 3 show the architecture of the proposed and medium weights for salient and object regions, respec-
CNN method and its parameters, respectively. Figs. 5 (a)–(c) tively, and by contrast, low weights to background regions

147064 VOLUME 8, 2020


Y. Umeki et al.: SOD-ID

TABLE 3. Construction details of the proposed CNN architecture. C. ESTIMATION OF THE IMPORTANCE DEGREE
In the proposed method, the estimation block process is
defined similarly to the proposed procedure in Section IV-D.
Object contours are already detected in the first block and
their saliency values are estimated in the second block. In the
third block, the values within one object contour are summed
and the result is its score of the importance degree as given
in (4). Similar to the proposed procedure, SOD-ID maps are
created based on the resultant scores and linearly quantized
with N .

VI. SIMULATION
In this section, we compare the performance of the proposed
method and state-of-the-art methods for SOD and SOD-
ID. We present the comparisons in Section VI-B and VI-C,
respectively, and before that, we discuss the validity of the
proposed metric in Section VI-A by presenting some exam-
ples. For this simulation, we used the instance segmentation
method proposed in [40] in the first block of the proposed
method because it is not recent but has high accuracy. Based
on Section IV-D, we introduced a dataset from the test sets
of COCO and SALICON, which contain images with seg-
mentation masks and their SD maps, respectively, where the
proposed dataset is called a SALICON-based dataset in this
section. Note that the proposed method is also represented by
Prop. in this section.
because they are generally uninteresting. The loss function L
is formulated as A. VALIDITY OF THE PROPOSED METRIC
N φ( xi ) As mentioned in Section IV-C, the validity of the proposed
1 X max φ( xi ) − yi
L(w) =  , (5) metric is briefly shown in this section. Table 4 shows scores
N
i=1
β − Oi + yi of pairs of arbitrary vectors in Spearman’s and Kendall rank
correlation coefficients, and the proposed metric. In Table 4,
where yi , xi , Oi , φ(·), and β denote true saliencies, estimated the pairs from the top to the bottom, respectively, indicate
saliencies, object region masks, a normalization function, various scenarios as follows: same rank and slightly different
and a free parameter, respectively. The masks are produced value, slightly different rank and value, same rank and quite
by binarizing signals of the instance segmentation results. different value, and quite different rank and slightly different
φ(·) normalizes the estimated saliency values in [0, 1]. value. As mentioned in Section IV-C, SOD-ID metrics have
We generally set β to 2 or a value that is the maximum to simultaneously evaluate the rank correlation and the value
of Oi + yi . If the i-th pixel is in a salient object, β − similarity. In that sense, from the first and third pairs, the pro-
Oi + yi is a low value, and hence this pixel is assigned a posed metric only satisfies the above property. We observed
high weight. from the second and fourth pairs that the Kendall coefficient
is too sensitive to the rank difference to be used as the SOD-
3) TRAINING ID metric. The fourth pair shows that the rank correlation is
For training, the loss function in Section V-B2 and the training quite different, but its values are almost the same and hence
dataset of COCO and SALICON were used [45], [46]. COCO the importance of objects is also considered to be comparable.
contains natural images and their segmentation masks, and However, the score obtained using Spearman’s coefficient is
SALICON has saliency maps that correspond to them. The rather bad and its weight for the rank correlation and the value
maps were binarized using a threshold value of τ = 0.15, and similarity has been shown to be unbalanced. The proposed
their elements corresponding to background pixels, which
were detected by the masks, were set to zero. Stochastic
gradient descent was used as optimizing, where Nesterov TABLE 4. Correlation scores of pairs of arbitrary vectors.
momentum, the weight decay, and the learning rate were
set to 0.9, 0.5, and 10−3 , respectively [50]. β in the loss
function was set to 2.3, which was experimentally deter-
mined from the ratios of salient, object, and background
regions.

VOLUME 8, 2020 147065


Y. Umeki et al.: SOD-ID

TABLE 5. F-measure scores of the SOD methods for the DUTS dataset [44].

TABLE 6. F-measure scores of the SOD methods for the PASCAL-S dataset [10].

TABLE 7. F-measure scores of the SOD methods for the SALICON-based dataset [45], [46].

TABLE 8. Scores for the estimation of the importance degree for the TABLE 9. Scores for the estimation of the importance degree for the
PASCAL-S dataset [10]. SALICON-based dataset [45], [46].

metric is clearly more suitable to be used as the SOD-ID


metric than the two coefficients. the above, the results of HDCT, RFCN, DHS and DSSOD
were binarized with a threshold value of 0.14, and for RSOD
B. COMPARISON OF THE PROPOSED METHOD and Prop., the value was 1.
WITH SOD METHODS
1) SETTINGS 2) EVALUATION
HDCT [15], RFCN [16], DHS [21], DSSOD [24], and RSOD Tables 5, 6, and 7 show the F-measure scores of the methods
[25] were used as SOD methods for comparison. The meth- for each dataset, and Figs. 6 and 7 show the images, their GT
ods were applied to the test sets of the DUTS, PASCAL- maps, and their results before thresholding, where ‘‘Average’’
S, and SALICON-based datasets [10], [44]–[46], and the denotes the average values over all the images in each dataset.
results were evaluated using the F-measure [51]. To calculate The images in the PASCAL-S and SALICON-based datasets
the F-measure, it is required that the saliency maps of the generally contain multiple objects, and by contrast, those
PASCAL-S and SALICON-based datasets, and the results of in the DUTS dataset generally contain one large object or
the methods are binarized. Because we set N = 7 in this two objects. From Table 5, unfortunately, Prop. had worse
article and the maps of PASCAL-S have integer values in scores for DUTS. However, Prop. outperformed the other
[0, 255] with N = 13, objects whose scores of the importance methods in Tables 6 and 7, and we observed in Figs. 6 and 7
degree were one or more were recognized as salient for the that Prop. accurately estimated object contours. Particularly,
SALICON-based dataset and therefore the maps of PASCAL- Prop. suppressed the inaccurate estimation in ‘‘Parking’’ and
S were binarized with a threshold value of 36. According to ‘‘Party’’ in Fig. 7. Unfortunately, the instance segmentation

147066 VOLUME 8, 2020


Y. Umeki et al.: SOD-ID

FIGURE 6. Resultant saliency maps for the PASCAL-S dataset [10].

FIGURE 7. Resultant saliency maps for the SALICON-based dataset [45], [46].

method often detected nothing for DUTS because of its the PASCAL-S and SALICON-based datasets, and the results
above characteristic, as shown in the upper half of Table 5. were evaluated using Spearman’s and Kendall rank correla-
However, the results of Prop. except that case were equivalent tion coefficients and the proposed metric (2), where α and σ
to those of the other methods. Prop. can solve this problem were experimentally set to 0.5 and 2.0, respectively. Clearly,
using an efficient instance segmentation method that accu- the GT and resultant maps were uniformly normalized with
rately detects objects. N = 7.

C. COMPARISON OF THE PROPOSED METHOD WITH 2) EVALUATION


SOD-ID METHOD Tables 8 and 9 show the scores of RSOD and Prop. in the
1) SETTINGS metrics for each dataset, and Figs. 6 and 7 show the images
In SOD-ID, Prop. was compared with RSOD which is the and GT maps in the datasets, and their resultant maps, where
only existing SOD-ID method. The methods were applied to high values of pixels in the maps indicate high scores of the

VOLUME 8, 2020 147067


Y. Umeki et al.: SOD-ID

importance degree. Note that the rows in Tables 8 and 9 [12] R. Zhao, W. Ouyang, H. Li, and X. Wang, ‘‘Saliency detection by multi-
correspond to those in Figs. 6 and 7, respectively. From context deep learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2015, pp. 1265–1274.
Table 8 and 9, Prop. clearly outperformed RSOD in terms of [13] N. Tong, H. Lu, X. Ruan, and M.-H. Yang, ‘‘Salient object detection via
the metrics. From Figs. 6 and 7, Prop. accurately estimated bootstrap learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
the importance degree of objects. Particularly, in ‘‘Party,’’ (CVPR), Jun. 2015, pp. 1884–1892.
[14] Y. Qin, H. Lu, Y. Xu, and H. Wang, ‘‘Saliency detection via cellular
‘‘Woman,’’ and ‘‘Man,’’ Prop. estimated the importance automata,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
degree of small objects that had low saliency scores and were Jun. 2015, pp. 110–119.
located in highly salient objects. [15] J. Kim, D. Han, Y.-W. Tai, and J. Kim, ‘‘Salient region detection via high-
dimensional color transform and local spatial support,’’ IEEE Trans. Image
Process., vol. 25, no. 1, pp. 9–23, Jan. 2016.
VII. CONCLUSION [16] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, ‘‘Saliency detection with
In this article, we introduced SOD-ID via discussing its def- recurrent fully convolutional networks,’’ in Proc. Eur. Conf. Comput. Vis.,
Springer, 2016, pp. 825–841.
inition, significance, dataset condition, and evaluation met- [17] T. Wang, L. Zhang, H. Lu, C. Sun, and J. Qi, ‘‘Kernelized subspace
ric property, and proposed its dataset, metric, and method. ranking for saliency detection,’’ in Proc. Eur. Conf. Comput. Vis., 2016,
The proposed metric consists of the Kendall rank correlation pp. 450–466.
coefficient and mean squared error, and simultaneously eval- [18] L. Zhang, C. Yang, H. Lu, R. Xiang, and M.-H. Yang, ‘‘Ranking saliency,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 9, pp. 1892–1904,
uates the rank correlation and value similarity for SOD-ID. Sep. 2017.
The proposed dataset is generated using the proposed proce- [19] C. Sheth and R. V. Babu, ‘‘Object saliency using a background prior,’’
dure based on the COCO and SALICON datasets. The pro- in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
Mar. 2016, pp. 1931–1935.
posed method of SOD-ID consists of three processing blocks: [20] J. Yang and M.-H. Yang, ‘‘Top-down visual saliency via joint CRF and
instance segmentation, SD, and importance degree estima- dictionary learning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
tion. We proposed a CNN-based SD method for the second no. 3, pp. 576–588, Mar. 2017.
[21] N. Liu and J. Han, ‘‘DHSNet: Deep hierarchical saliency network for
block that uses the results of the first block. With this strategy, salient object detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
the proposed method objectively outperformed state-of-the- nit. (CVPR), Jun. 2016, pp. 678–686.
art methods with respect to SOD and achieved an accurate [22] L. Zhang, J. Ai, B. Jiang, H. Lu, and X. Li, ‘‘Saliency detection via
absorbing Markov chain with learnt transition probability,’’ IEEE Trans.
SOD-ID. Image Process., vol. 27, no. 2, pp. 987–998, Feb. 2018.
[23] G. Li, Y. Xie, L. Lin, and Y. Yu, ‘‘Instance-level salient object segmen-
ACKNOWLEDGMENT tation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jul. 2017, pp. 2386–2395.
We thank Irina Entin, M. Eng., and Maxine Garcia, Ph.D., [24] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, ‘‘Deeply
from Edanz Group (www.edanzediting.com/ac) for editing a supervised salient object detection with short connections,’’ in Proc. IEEE
draft of this manuscript. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3203–3212.
[25] M. A. Islam, M. Kalash, and N. D. B. Bruce, ‘‘Revisiting salient object
detection: Simultaneous detection, ranking, and subitizing of multiple
REFERENCES salient objects,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
[1] L. Itti, C. Koch, and E. Niebur, ‘‘A model of saliency-based visual attention Jun. 2018, pp. 7142–7150.
for rapid scene analysis,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, [26] C. Aytekin, A. Iosifidis, and M. Gabbouj, ‘‘Probabilistic saliency estima-
no. 11, pp. 1254–1259, 1998. tion,’’ Pattern Recognit., vol. 74, pp. 359–372, Feb. 2018.
[2] J. Harel, C. Koch, and P. Perona, ‘‘Graph-based visual saliency,’’ in Proc. [27] R. Fan, M.-M. Cheng, Q. Hou, T.-J. Mu, J. Wang, and S.-M. Hu, ‘‘S4Net:
Neural Inf. Process. Syst., 2006, pp. 545–552. Single stage salient-instance segmentation,’’ in Proc. IEEE/CVF Conf.
[3] X. Hou and L. Zhang, ‘‘Saliency detection: A spectral residual approach,’’ Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 6103–6112.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8. [28] L. Marchesotti, C. Cifarelli, and G. Csurka, ‘‘A framework for visual
[4] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, ‘‘Frequency-tuned saliency detection with applications to image thumbnailing,’’ in Proc. IEEE
salient region detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- 12th Int. Conf. Comput. Vis., Sep. 2009, pp. 2232–2239.
nit., Jun. 2009, pp. 1597–1604. [29] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, ‘‘A compar-
[5] J. Zhang and S. Sclaroff, ‘‘Exploiting surroundedness for saliency detec- ative study of image retargeting,’’ ACM Trans. Graph., vol. 29, no. 6,
tion: A Boolean map approach,’’ IEEE Trans. Pattern Anal. Mach. Intell., pp. 160–169, 2010.
vol. 38, no. 5, pp. 889–902, May 2016. [30] A. Mansfield, P. Gehler, L. V. Gool, and C. Rother, ‘‘Scene carving: Scene
[6] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, ‘‘A deep multi-level consistent image retargeting,’’ in Proc. Eur. Conf. Comput. Vis., Springer,
network for saliency prediction,’’ in Proc. 23rd Int. Conf. Pattern Recognit. 2010, pp. 143–156.
(ICPR), Dec. 2016, pp. 3488–3493. [31] A. Jose and I. Heisterklaus, ‘‘Bag of Fisher vectors representation of
[7] S. S. S. Kruthiventi, K. Ayush, and R. V. Babu, ‘‘DeepFix: A fully convo- images by saliency-based spatial partitioning,’’ in Proc. IEEE Int. Conf.
lutional neural network for predicting human eye fixations,’’ IEEE Trans. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 1762–1766.
Image Process., vol. 26, no. 9, pp. 4446–4456, Sep. 2017. [32] C. Spearman, ‘‘The proof and measurement of association between two
[8] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, ‘‘Saliency filters: things,’’ Tech. Rep., 1961.
Contrast based filtering for salient region detection,’’ in Proc. IEEE Conf. [33] M. G. Kendall, ‘‘A new measure of rank correlation,’’ Biometrika, vol. 30,
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 733–740. nos. 1–2, pp. 81–93, Jun. 1938.
[9] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, ‘‘Salient object [34] N. Imamoglu, C. Zhang, W. Shmoda, Y. Fang, and B. Shi, ‘‘Saliency
detection: A discriminative regional feature integration approach,’’ in Proc. detection by forward and backward cues in deep-CNN,’’ in Proc. IEEE
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2083–2090. Int. Conf. Image Process. (ICIP), Sep. 2017, pp. 430–434.
[10] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, ‘‘The secrets of salient [35] R. Monroy, S. Lutz, T. Chalasani, and A. Smolic, ‘‘SalNet360: Saliency
object segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., maps for omni-directional images with CNN,’’ Signal Process., Image
Jun. 2014, pp. 280–287. Commun., vol. 69, pp. 26–34, Nov. 2018.
[11] J. Sun, H. Lu, and X. Liu, ‘‘Saliency region detection based on Markov [36] H. Li, H. Lu, Z. Lin, X. Shen, and B. Price, ‘‘Inner and inter label
absorption probabilities,’’ IEEE Trans. Image Process., vol. 24, no. 5, propagation: Salient object detection in the wild,’’ IEEE Trans. Image
pp. 1639–1649, May 2015. Process., vol. 24, no. 10, pp. 3176–3186, Oct. 2015.

147068 VOLUME 8, 2020


Y. Umeki et al.: SOD-ID

[37] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, ‘‘Seman- ISANA FUNAHASHI received the B.Eng. and
tic contours from inverse detectors,’’ in Proc. Int. Conf. Comput. Vis., M.Eng. degrees from the Nagaoka University
Nov. 2011, pp. 991–998. of Technology, Nagaoka, Japan, in 2017 and
[38] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networks 2019, respectively. He is currently pursuing the
for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Ph.D. degree with the Department of Com-
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. puter and Network Engineering, The University
[39] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ‘‘Pyramid scene parsing of Electro-Communications, Tokyo, Japan. His
network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
research interests include image processing and
Jul. 2017, pp. 2881–2890.
computer vision.
[40] B. Romera-Paredes and P. H. S. Torr, ‘‘Recurrent instance segmentation,’’
in Proc. Eur. Conf. Comput. Vis., Springer, 2016, pp. 312–329.
[41] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, ‘‘Fully convolutional instance-
aware semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jul. 2017, pp. 2359–2367.
[42] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun,
and A. Yuille, ‘‘The role of context for object detection and semantic
segmentation in the wild,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2014, pp. 891–898. TAICHI YOSHIDA (Member, IEEE) received
[43] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for the B.Eng., M.Eng., and Ph.D. degrees in engi-
large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent., 2015, neering from Keio University, Yokohama, Japan,
pp. 1–14. in 2006, 2008, and 2013, respectively. In 2014,
[44] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, ‘‘Learn-
he joined the Nagaoka University of Tech-
ing to detect salient objects with image-level supervision,’’ in Proc. IEEE
nology. In 2018, he joined the University of
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 136–145.
[45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, Electro-Communications, where he is currently
and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ in Proc. an Assistant Professor with the Department of
Eur. Conf. Comput. Vis., Springer, 2014, pp. 740–755. Communication Engineering and Informatics. His
[46] M. Jiang, S. Huang, J. Duan, and Q. Zhao, ‘‘SALICON: Saliency in research interests include filter bank design and
context,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), image coding applications.
Jun. 2015, pp. 1072–1080.
[47] K. Li, B. Hariharan, and J. Malik, ‘‘Iterative instance segmentation,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 3659–3667.
[48] A. Arnab and P. H. S. Torr, ‘‘Pixelwise instance segmentation with a
dynamically instantiated network,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 441–450.
MASAHIRO IWAHASHI (Senior Member, IEEE)
[49] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted Boltz-
received the B.Eng., M.Eng., and D.Eng. degrees
mann machines,’’ in Proc. Int. Conf. Mach. Learn., vol. 2010, pp. 807–814.
[50] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. in electrical engineering from Tokyo Metropoli-
[51] R. Szeliski, Computer Vision: Algorithms and Applications. London, U.K.: tan University, in 1988, 1990, and 1996, respec-
Springer, 2010. tively. In 1990, he joined Nippon Steel Company
Ltd. From 1991 to 1992, he was seconded to
Graphics Communication Technology Company
Ltd. In 1993, he joined the Nagaoka University
YO UMEKI received the B.Eng. and M.Eng. of Technology, where he is currently a Professor
degrees from the Nagaoka University of Technol- with the Department of Electrical Engineering,
ogy, Nagaoka, Japan, in 2015 and 2019, respec- Faculty of Technology. From 1995 to 2001, he was also a Lecturer with the
tively. He is currently pursuing the Ph.D. degree Nagaoka Technical College. From 1998 to 2001, he relocated to Thammasat
with the Department of Information Science and University, Thailand and the Electronic Engineering Polytechnic Institute of
Control Engineering. His main research interest Surabaya, Indonesia, as a JICA Expert. His research interests include digital
includes saliency detection. signal processing, multi-rate systems, and image compression. He served as
an Editorial Committee Member of the IEICE Transactions on Fundamentals
of Electronics, Communications, and Computer Sciences, from 2007 to 2011.

VOLUME 8, 2020 147069

You might also like