0% found this document useful (0 votes)
5 views11 pages

TCYB21-HVP

The document presents a lightweight salient object detection (SOD) model called HVPNet, which incorporates a Hierarchical Visual Perception (HVP) module inspired by the primate visual system to enhance multi-scale learning. HVPNet achieves competitive accuracy while maintaining a small network size of only 1.23M parameters and efficient processing speeds of 4.3fps on CPU and 333.2fps on GPU. This work aims to facilitate the deployment of SOD methods in practical applications, particularly on mobile devices with limited computational resources.

Uploaded by

xiaoyuancaizi02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

TCYB21-HVP

The document presents a lightweight salient object detection (SOD) model called HVPNet, which incorporates a Hierarchical Visual Perception (HVP) module inspired by the primate visual system to enhance multi-scale learning. HVPNet achieves competitive accuracy while maintaining a small network size of only 1.23M parameters and efficient processing speeds of 4.3fps on CPU and 333.2fps on GPU. This work aims to facilitate the deployment of SOD methods in practical applications, particularly on mobile devices with limited computational resources.

Uploaded by

xiaoyuancaizi02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/347760709

Lightweight Salient Object Detection via Hierarchical Visual Perception Learning

Article in IEEE Transactions on Cybernetics · December 2020


DOI: 10.1109/TCYB.2020.3035613

CITATIONS READS
12 150

5 authors, including:

Yun Liu Xin-Yu Zhang


ETH Zurich Nankai University
61 PUBLICATIONS 1,938 CITATIONS 11 PUBLICATIONS 487 CITATIONS

SEE PROFILE SEE PROFILE

Ming-Ming Cheng
Nankai University
233 PUBLICATIONS 18,439 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Saliency View project

移动设备上的图像交互式分析与编辑 View project

All content following this page was uploaded by Yun Liu on 24 April 2021.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON CYBERNETICS 1

Lightweight Salient Object Detection via


Hierarchical Visual Perception Learning
Yun Liu, Yu-Chao Gu, Xin-Yu Zhang, Weiwei Wang, Ming-Ming Cheng

Abstract—Recently, salient object detection (SOD) has wit- both low-level fine-grained details and high-level semantic
nessed vast progress with the rapid development of convolutional features, and a carefully calibrated decoder to recover the
neural networks (CNNs). However, the improvement of SOD spatial resolution without losing spatial information, both of
accuracy comes with the increase in network depth and width, re-
sulting in large network size and heavy computational overhead. which may bring tremendous parameters and computational
This prevents state-of-the-art SOD methods from being deployed overhead [7]–[18], [22]–[32]. On the contrary, recent increased
into practical platforms, especially mobile devices. To promote the interest in mobile applications, such as cell phones, where the
deployment of real-world SOD applications, we aim at developing computational capacity, memory space, and energy support
a lightweight SOD model in this paper. Our observation comes are limited, cannot deploy these large SOD models. The cost
from that the primate visual system processes visual signals
hierarchically with different receptive fields and eccentricities of deploying these large models on the servers is also high.
in different visual cortex areas. Inspired by this, we propose a This inspires us to consider the efficiency and the number
Hierarchical Visual Perception (HVP) module to imitate the pri- of network parameters as important as the accuracy in the
mate visual cortex for hierarchical perception learning. With the evaluation of SOD methods.
HVP module incorporated, we design a lightweight SOD network, With the aforementioned consideration, we aim at designing
namely HVPNet. Extensive experiments on popular benchmarks
demonstrate that HVPNet achieves highly competitive accuracy a lightweight SOD model to promote practical SOD systems.
compared with state-of-the-art SOD methods while running at a Although lightweight network architecture has been studied in
4.3fps CPU speed and a 333.2fps GPU speed with only 1.23M other vision tasks, such as image classification [33]–[36], di-
parameters. rectly applying lightweight backbone networks for SOD leads
Index Terms—Lightweight salient object detection, lightweight to suboptimal performance. This is because SOD has special
saliency detection, hierarchical visual perception. requirements in multi-scale learning as described above, while
lightweight backbone networks, such as MobileNets [33],
I. I NTRODUCTION [34] and ShuffleNets [35], [36], focus on capturing high-level
semantics and are less powerful in multi-scale learning than
T HE human vision system can detect the most arresting
objects or regions in natural images rapidly and automat-
ically. Salient object detection (SOD) aims at imitating such
traditional large networks that are deeper, wider, and more in
the number of convolution filters. Therefore, lightweight SOD
a human instinct to capture the most eye-catching area in an is still a challenging problem, and the key is how to effectively
image. The progress in SOD has benefited a broad range of learn multi-scale contexts in a lightweight setting.
computer vision applications, including object detection [1], We get inspiration from the primate visual system to tackle
image retrieval [2], visual tracking [3], image thumbnailing this problem because modeling human visual perception for
[4], etc. Conventional SOD methods [5], [6] mainly rely scene interpretation is a strong trend in computer vision [37].
on hand-crafted low-level features. In spite of the efficiency, About 55 percent of the neocortex of the primate brain is
the lack of representation capacity for high-level semantics associated with vision [38], and the processing pipeline is in
makes these methods difficult to model complicated natural a hierarchical structure [39]–[41]. Multi-scale visual signals
scenes. Due to the powerful capacity in representation learn- are hierarchically processed in different cortex areas that have
ing, convolutional neural networks (CNNs), especially fully different population receptive fields (pRFs) [42]. Wandell et
convolutional networks (FCNs), have dominated this field. al. [42] found that the pRF size increases with eccentricity in
Numerous CNN- and FCN-based SOD approaches [7]–[22] retinotopic maps. A recent study [43] attempts to simulate the
have pushed the state of the art forward. size and eccentricity of pRF using the kernel size and dilation
However, the accuracy improvement is not free. Traditional rate of the convolution layer, respectively, so that the kernel
SOD requires a strong backbone (i.e., encoder) to capture size and dilation rate have a similar positive functional relation
as that of the size and eccentricity of pRF. A simple way to
This work was supported in part by the Major Project for New Generation of
AI under Grant 2018AAA0100400; in part by NSFC under Grant 61922046; simulate the primate visual system is the parallel organization
in part by the National Youth Talent Support Program; and in part by of various pRF. However, this ignores the visual hierarchy in
the Tianjin Natural Science Foundation under Grant 17JCJQJC43700. (Yun the visual cortex, which has been studied in the conventional
Liu, Yu-Chao Gu, and Xin-Yu Zhang contributed equally to this work.)
(Corresponding author: Ming-Ming Cheng.) computer vision, i.e., before deep learning [44], [45]. In this
Yun Liu, Yu-Chao Gu, and Ming-Ming Cheng are with the College paper, we propose the Hierarchical Visual Perception (HVP)
of Computer Science, Nankai University, Tianjin 300350, China (e-mail: module to simulate the structure of the primate visual cortex.
[email protected]).
Xin-Yu Zhang and Weiwei Wang are with the School of Mathematical The HVP module uses a densely-connected structure to imitate
Science, Nankai University, Tianjin 300350, China. the visual hierarchies and dilated convolution to imitate the
IEEE TRANSACTIONS ON CYBERNETICS 2

pRF. Experimental results suggest that using the kernel sizes


r=7 F 1×1 (F̂ r×r (·))
and dilation rates in the descending order performs best, Input
which is consistent with Hochstein and Ahissar’s Reverse r=5 F 1×1 (F̂r3×3 (·))
Hierarchy Theory (RHT) [40] that claims visual perception
begins at the higher levels and travels to the lower areas. r=3
With the HVP module and attention mechanism incorporated, r=1
we design a lightweight SOD network, namely HVPNet.
Extensive evaluation on popular benchmarks demonstrates Output
that HVPNet with only 1.23M parameters achieves highly
competitive accuracy compared with state-of-the-art methods
while running at a CPU speed of 4.3fps and a GPU speed of
333.2fps for 336 × 336 images. Fig. 1. Illustration of the proposed HVP module.
We summarize our contribution as follows:
• We propose a novel Hierarchical Visual Perception (HVP) potential to promote SOD into more practical applications.
module to imitate the primate visual hierarchies for better Our proposed HVPNet performs comparably with state-of-
multi-scale learning. the-art methods while maintaining high efficiency and small
• With the HVP module and attention mechanism incor- network size.
porated, we design HVPNet that is the first lightweight b) Lightweight Neural Networks: In many real-world
SOD network as we know. applications, visual recognition tasks must be carried out
• We conduct extensive experiments to investigate and in a timely, power-saving, and memory-friendly fashion
evaluate the proposed HVPNet that thus serves as a strong with computational resource constraints. Although it has not
baseline for future lightweight SOD research. been brought into SOD, many other vision tasks have built
lightweight models to satisfy real-world requirements using
II. R ELATED W ORK weight quantization [65], [66], network compression [67],
[68], computationally efficient architecture design [33]–[36],
In this section, we first summarize recent progress in SOD
etc. Notably, for some vision tasks, such as image classification
and then review literature about lightweight deep learning.
[33]–[36], lightweight networks have shown their superiority
a) Salient Object Detection: Traditional SOD methods
by reducing the model size and floating-point operations
[5], [6], [46]–[48] mainly rely on hand-crafted features and
(FLOPs) with a little performance drop. MobileNets [33], [34]
heuristic priors. Due to the restricted representation capacity,
adopt depth-wise separable convolutions to approximate the
hand-crafted features have been gradually replaced by deep
representation ability of regular convolutions with significantly
learning. Thanks to the powerful capacity of feature represen-
reduced parameters. Based on the depth-wise convolution,
tation learning of CNNs and FCNs, this field has witnessed
ShuffleNets [35], [36] utilize a channel shuffle operation to
the burst of numerous CNN- and FCN-based methods [7]–
further reduce the redundancy of point-wise convolutions.
[12], [14], [15], [17]–[32], [49]–[53] in the last five years.
We share the same spirits with prior arts [33]–[36] to build
Most efforts of these deep models are on how to effectively
our model using depth-wise separable convolutions, while
fuse the multi-scale information of various side-outputs [54],
our main technical contribution comes from the observation
[55]. Some methods [56]–[60] directly concatenate or sum
about the hierarchical primate visual system. We propose the
the side-output features. Some methods [61], [62] perform
HVP module to imitate the primate visual hierarchies and
saliency prediction using side-output features and then fuse all
pRF. We also explore the attention mechanism for further per-
side-output prediction to obtain the final saliency map. Most
formance improvement. With these components incorporated,
methods [7]–[20], [22]–[31] use the encoder-decoder structure,
the proposed HVPNet achieves comparable performance with
in which the encoder is usually backbone networks for image
state-of-the-art methods while in an extremely lightweight
classification [63], and the decoder is responsible for side-
setting.
output feature fusion. Some clever designs have attracted much
attention in this field. For example, PiCANet [12] proposes to III. M ETHODOLOGY
apply bidirectional LSTM to compute global contexts. RAS
[61] presents a reverse attention mechanism to manage side- In this section, we elaborate on our lightweight SOD
outputs in a top-down way. network architecture. Concretely, we introduce our motivation
While the accuracy has been improved with the increase from the primate visual system in Section III-A. Then, we
in network depth and width, the substantial computational present the primary building block, namely, Hierarchical Vi-
overhead and large network size have hindered state-of-the- sual Perception (HVP) module in Section III-B. Other network
art SOD methods from being deployed into practical systems, components and the overall architecture are summarized in
especially for mobile devices. For example, the recent state- Section III-C and Section III-D, respectively.
of-the-art method, EGNet [64], has 108M parameters, which
exceeds the tolerance of most mobile devices. Instead of A. Motivation and Principles
continuing to go in this direction, in this paper, we pave a Lots of neurophysiological evidence suggests that a se-
new path for SOD, i.e., lightweight SOD, which has great quence of different levels of signal processing (8 to 10 levels)
IEEE TRANSACTIONS ON CYBERNETICS 3

constitute the hierarchical signal processing in the primate have similar relation to the kernel size and dilation rate of
visual system [39]–[41]. The hierarchical processing exhibits convolution. Here, we use depth-wise separable convolution
straightforward superiority over the so-called flat processing (DSConv) [33] and point-wise convolution (i.e., the vanilla
that processes signals in a parallel way [45]. In fact, there is a 1 × 1 convolution) as the atomic operations to reduce pa-
large amount of neurophysiological evidence that cognition is rameters and computational load. Let F k×k be the vanilla
associated with the concept of the deep hierarchy [69]. This convolution with the kernel size of k × k. For example, F 1×1
is intuitive because our eye does not perceive all contents in is the vanilla 1 × 1 convolution. Suppose that F̂dk×k denotes a
a natural scene at first glance but recognize objects with the DSConv with the kernel size of k × k and the dilation rate of
highest contrast to their surroundings first, which is a simple d, and we omit the subscript for d = 1, i.e., F̂1k×k = F̂ k×k .
understanding of the visual hierarchy. The capacity of the Each simulation unit for pRF is composed of a DSConv
primate visual system to process information at hierarchical with the kernel size of r and a DSConv with the dilation rate
levels has inspired computer vision research. Please refer to of r, which can be formulated as
[70] for a summation.  1×1
F (X), if r = 1;
On the other hand, neurons in different cortex areas have Rr (X) =
different population Receptive Field (pRF) sizes, and the pRF F 1×1 (F̂r3×3 (F 1×1 (F̂ r×r (X)))), if r > 1,
(1)
size increases with eccentricity in each retinotopic map [42].
where standard batch normalization [71] and PReLU [72]
The impact of the eccentricity of pRF in the visual system
layers are connected after each convolution layer. Here, we
can be simulated by dilated convolutions [43]. Specifically, we
imitate the pRF size with the kernel size of F̂ r×r and the
can imitate pRF size with the kernel size and the eccentricity
eccentricity with the dilation rate of F̂r3×3 , so that the pRF
with dilation rate, so that the kernel size and dilation rate
size and the pRF eccentricity have the same positive functional
have the same positive functional relation as that of the size
relation. Note that we use two convolutions of F̂ r×r and
and eccentricity of pRF. However, this simple flat processing
F̂r3×3 , not a single convolution of F̂rr×r , because F̂rr×r would
[45] for feature learning from different pRFs, i.e., with par-
have large sparse convolution kernels (e.g., r > 3) that
allel connections, is suboptimal, because it ignores the basic
is suboptimal for network training [73] and inefficient for
concept of the deep hierarchy in the primate visual system. In
network inference. Applying Eq. (1) with different values
the experiments, we will demonstrate the design of the parallel
of r, we can imitate different areas on the primate visual
connection is suboptimal for lightweight SOD.
cortex, e.g., the occipital areas V1 to hV4 whose sizes and
In this paper, we propose a more realistic approach to mimic
eccentricities of pRF gradually increase. In the concept of deep
the primate visual system. We still use dilated convolution to
learning, we can learn multi-scale information with various
imitate the pRF. In order to imitate the positive correlation
receptive fields in this way.
between the size and eccentricity of pRF, a large dilation rate
As discussed in Section III-A, the processing of different vi-
will correspond to the large kernel size. Instead of using simple
sual cortex areas is organized in a hierarchical manner. Instead
parallel connections, we adopt serial connections for different
of using “flat” processing as in the existing computer vision
pRFs. Since the practical connections of the primate visual
systems [74], [75], we propose to use hierarchical processing.
system are involved, without an exact connection order, we
Specifically, we connect the simulation units for different pRFs
propose to impose dense connections for different pRFs so
in a serial way. Besides, connections on the visual cortex are
that the output feature of one pRF will serve as the input for
very complicated, and one area is not only connected to one
all of the following pRFs. Moreover, Hochstein and Ahissar’s
other area. Hence, we apply dense connections [76] for pRF
reverse hierarchy theory (RHT) [40] claims that the visual
simulation units to mimic the complex connections on the
system first generates perception at higher levels and then
visual cortex so that the output of one pRF will be viewed as
travels to low levels, which means that visual attention works
the input signals for all of the following pRFs. Formally, the
in a coarse-to-fine way. Hence the proposed HVP module puts
output responses of all preceding pRF units are concatenated
large kernel sizes and dilation rates at the beginning to capture
to serve as the input for the next unit, i.e.,
the high-level information (with large pRF). Intuitively, the
primate visual system in a pre-attentive vision sends the infor- Xi = Rri (Concat(X0 , X1 , · · · , Xi−1 )), 1 ≤ i ≤ N, (2)
mation to interpret the scene at a glance, i.e., only large details.
Experimental results show that arranging kernel sizes and where N is the number of pRF units, and X0 (Xi with
dilation rates in the descending order outperforms other orders, i = 0) denotes the input for an HVP module. Concat(·)
which demonstrates our hypothesis about HVP and RHT. represents the concatenation operation. The number of output
Therefore, the proposed HVP module is not only theoretically channels and convolution groups of the first DSConv F̂ ri ×ri (·)
but also experimentally reasonable. in Rri is equal to the number of channels of X0 . Other
convolutions in Rri have the same number of channels as
X0 . From the perspective of deep learning, dense connections
B. Hierarchical Visual Perception Module bring increased depth and more powerful representation space
With the principles described above, we continue by elabo- that leads to better performance.
rating on the proposed HVP module. As shown in Fig. 1, we The last problem is how to decide the order of pRFs.
adopt dilated convolutions to imitate different visual cortex We follow the reverse hierarchy theory [40] to first generate
areas that have different pRFs whose size and eccentricities visual perception at large pRFs and then flow the perception
IEEE TRANSACTIONS ON CYBERNETICS 4

into small pRFs. Therefore, we use a descending order of TABLE I


kernel sizes/eccentricities of convolutions for imitating pRFs. E NCODER CONFIGURATIONS OF THE PROPOSED LIGHTWEIGHT SOD
MODEL . “M ODULE ”, “#M”, “#F”, “K”, AND “S” REPRESENT THE
For example, we adopt a sequence of r values of 7, 5, 3, 1 MODULE TYPE , THE NUMBER OF MODULES , THE NUMBER OF
for each HVP module. Intuitively, human eyes usually see CONVOLUTION FILTERS , KERNEL SIZE , AND STRIDE , RESPECTIVELY.
large objects (large pRFs) at first glance and then gradually “R ES ATT ” REFERS TO THE RESIDUAL ATTENTION IN S ECTION III-C.
focalize to fine-grained image details. This is also similar to Stage Resolution Module #M #F K S
the principle of hierarchical representations used for image 224 × 224 Conv 1 16 3 2
1
retrieval and scalable indexes [77]. Our experimental results 112 × 112 Conv & ResAtt 1 16 3 1
also demonstrate that the descending order performs better 112 × 112 DSConv 1 32 5 2
2
56 × 56 HVP & ResAtt 1 32 7-5-3-1 1
than any other order. 56 × 56 DSConv 1 64 5 2
3
28 × 28 HVP & ResAtt 3 64 7-5-3-1 1
C. Attention and Dropout Mechanism 4
28 × 28 DSConv 1 128 5 2
14 × 14 HVP & ResAtt 5 128 7-5-3-1 1
The attention mechanism has been demonstrated to be
effective [10], [12], [17], [60], [61], [78]. Instead of only
using spatial attention to adaptively highlight or suppress
certain locations as in these methods, we further include where v is duplicated to the size of C × H × W before
channel attention to explore inter-channel dependencies and multiplication.
re-calibrate channel activation. On the other hand, we insert c) Residual Attention: As we employ attention mecha-
the attention mechanism in the encoder network, while the nism sequentially and iteratively, multiplying by factors within
above previous methods adopt attention in the decoder because the range of (0, 1) would weaken the activation gradually,
their encoder is usually fixed to existing backbone networks. leading to vanishing gradients. To this end, we employ residual
a) Channel Attention: The channel attention mechanism learning [63] to facilitate gradient propagation. Finally, the
is first introduced in [79]. Let X ∈ RC×H×W be the input output activation become
activation, in which C, H, and W are the number of channels,
height, and width, respectively. We first apply global average Y =X
c + X. (8)
pooling (GAP) to extract the channel-wise representations, i.e.,
H W d) Dropout: Overfitting is always a pesky problem in
1 XX
dc = GAP (X) = Xc,i,j , (3) deep learning. The strategy of dropping CNN activation is
HW i=1 j=1
shown to be useful in increasing the generalization capability
in which dc is the c-th value of the feature vector d ∈ RC , and avoiding overfitting [81], [82]. In this paper, we connect a
and Xc,i,j is the value of X at coordinates (c, i, j). Then, standard dropout layer [81] with a dropout rate of 0.1 before
we employ a simple soft-gating mechanism to calculate the each HVP module in training. Unlike the recent attention-
per-channel importance, namely, based dropout strategy [82] that drops activation according to
the computed saliency map, the attention and dropout in our
dˆ = σ(F 1×1 (ψ(F 1×1 (d)))), (4)
method are independent for a fair comparison with previous
where the inner and outer 1 × 1 convolutions have Cr and SOD methods. In other words, the exploration of the new
C output channels, respectively. Here, r denotes the rate for dropout strategy is out of the scope of this paper, so we follow
channel reduction. Hence, we have dˆ ∈ RC . ψ refers to previous literature to use the standard dropout layer [81] in this
the standard nonlinear activation function [80], and σ is the paper. In testing, the dropout layers are directly abandoned.
Sigmoid soft-gating function. Afterwards, channel activation
is re-calibrated in a multiplicative manner, i.e.,
f = dˆ ⊗ X,
X (5) D. Network Architecture
where dˆ is duplicated into the size of C × H × W , and ⊗ With the aforementioned components, we build an encoder-
indicates element-wise multiplication. decoder network with lateral connections, namely HVPNet.
b) Spatial Attention: Given the re-calibrated features X,
f For the encoder, we stack the proposed HVP modules for
we extract the pixel-wise importance based on local responses. fast deep feature extraction in a bottom-up manner. For the
Our operation is computationally efficient for the requirement decoder, we use a simple method to integrate the high-level
of lightweight SOD. Concretely, we adopt a simple k × k semantic features and low-level fine-grained details in a top-
convolution with a single output channel, and again use down way. The details of our design are introduced as follows.
soft-gating mechanism (i.e., Sigmoid) to compute the spatial a) Encoder Network: Our encoder consists of 4 stages,
multipliers. Mathematically, we have and the default configurations for each stage are summarized
v = σ(F k×k (X)),
f (6) in Table I. At the s-th stage, the input activation Fs−1 is
first downsampled by a (depth-wise separable or vanilla)
where we have v ∈ RH×W . Similarly, the spatially re- convolution with stride 2, which is formulated as
calibrated activation are formulated as
c = v ⊗ X,
X f (7) Fs = Concat(Hs (Fs−1 ), MaxPool2 (Fs−1 )), (9)
IEEE TRANSACTIONS ON CYBERNETICS 5

where MaxPool2 denotes the max pooling operator with the HVPNet on the ImageNet as commonly done in state-of-the-
stride of 2. Standard batch normalization and PReLU follow art methods. The learning rate decays with poly scheduler, i.e.,
Eq. (9). Hs is defined as  power
curr iter
 3×3 curr lr = init lr × 1 − , (15)
F (F ), if s = 1; max iter
Hs (F ) = (10)
F̂ 5×5 (F 1×1 (F )), if s > 1.
where init lr = 5 × 10−4 and power = 0.9 are used.
For the first stage, the input is the color image that only has b) Datasets: We conduct experiments on six popular
three channels, so we directly use a vanilla strided convolution. datasets, namely ECSSD [85], DUT-O (i.e., DUT-OMRON)
b) Decoder Network: Given the four output features [5], DUTS [86], HKU-IS [49], SOD [87], and THUR15K [88].
{Fs : s = 1, 2, 3, 4} of the encoder network, the goal of the These six datasets contain 1000, 5168, 15572, 4447, 300, and
decoder is to integrate features of different scales step by step 6232 pairs of natural images and saliency maps, respectively.
and gradually resume the spatial resolution. For the fusion of The DUTS [86] dataset is divided into 10553 training images
the top features F3 and F4 , we first upsample F4 by a factor of and 5019 test images. Following recent works [7], [12], [57],
2 to match the resolution of F3 , and then a 1 × 1 convolution [59], [64], we train models on the DUTS training set and
is applied to adjust the number of channels. Afterwards, the evaluate models on the DUTS test set (DUTS-TE) as well as
feature map is equally split into two feature maps in terms of the other five datasets.
the channel dimension, with each split refined at a different c) Evaluation Metrics: For evaluation, we adopt two
scale. Formally, we have widely-used metrics, i.e., F-measure and mean absolute error
(MAE). F-measure, denoted by Fβ , is based on the precision
U4 = Upsample(F4 ), (11)
and recall of the prediction, like
e 1 ), F 3×3 (U
Q3 = Concat(F 3×3 (U e 2 )), (12)
4 4
2 (1 + β 2 ) × Precision × Recall
e1 e2 1×1
Fβ = , (16)
where U4 and U4 denote the two channel splits of F (U4 ). β 2 × Precision + Recall
Finally, features are integrated by a simple element-wise where β 2 is set to 0.3 to highlight the precision. MAE is a
summation, i.e., pixel-wise average of the absolute prediction error, which can
D3 = ψ(BN(Q3 ) + BN(F 1×1 (F3 ))), (13) be formulated as
H W
where BN and ψ are standard batch normalization [71] and 1 XX
MAE(P , G) = |Pij − Gij | , (17)
nonlinear activation function [72]. D3 is passed to the bottom HW i=1 j=1
stages. For the feature fusion of the bottom stages, Eq. (11)
is simply adapted to Us = Upsample(Ds ), and all of the where P is the predicted saliency map, G is the corresponding
remaining operations are preserved. In the end, we obtain the ground-truth, H is the image height, and W is the image
integrated features {Ds : s = 1, 2, 3, 4} at different scales width.
(D4 = F4 ), which will be re-visited when calculating the d) Lightweight Measures: The lightweight setting is the
final training loss. core consideration of this paper. Here, we elaborate on the
c) Deep Supervision & Loss Function: We employ deep lightweight measures. If a model has specified its input dimen-
supervision [83] to ease the optimization of the latent units. sions, we will use its default settings for testing. Otherwise,
For each fused feature Ds , s = 1, 2, 3, 4, we project it to we adopt 336 × 336 as its input size to test its speed and
a single-channel feature map via a point-wise convolution. compute its number of FLOPs. The CPU speed in this paper
A Sigmoid activation function is utilized to get the saliency is tested on an Intel i7-8700K CPU, and the GPU speed is
predictions Ps . We use the binary cross entropy (BCE) loss tested using an NVIDIA TITAN Xp GPU.
for supervision, which is formulated as
4 B. Performance Comparison
X
L = BCE(P1 , G) + λ BCE(Ps , G), (14) a) Comparison with Former SOD Methods: We compare
s=2 the proposed HVPNet with 15 state-of-the-art SOD meth-
where G refers to the ground-truth, and λ is a hyper-parameter ods. Table II shows the quantitative results. Our method
that is empirically set to 0.4 as in [75]. Note that P1 is the achieves comparable performance with previous state-of-the-
output saliency map of the proposed HVPNet. art BASNet [28] and EGNet [64], but significantly reduces
the parameters and flops. For example, our method achieves
92.5% F-measure on ECSSD, slightly lower than the 93.8%
IV. E XPERIMENTS
F-measure of EGNet, but we only need 1.1% parameters of
A. Experimental Configurations EGNet. We also achieve the fastest speed and minimal FLOPs
a) Implementation Details: We implement the proposed than previous methods. Specifically, we can reach 333.2fps,
HVPNet using the popular PyTorch library. By default, we while previous methods can only reach the best speed of 68fps.
train our model using Adam optimizer with the weight decay The number of FLOPs of HVPNet is only 1.1G. Since the
of 10−4 , and the batch size of 20. Our model and its variants number of FLOPs is related to energy consumption, the small
are trained from scratch for 50 epochs for ablation studies. number of FLOPs of HVPNet makes it friendly to mobile
When comparing with the state-of-the-art methods, we pretrain applications.
IEEE TRANSACTIONS ON CYBERNETICS 6

TABLE II
C OMPARISON WITH EXISTING SOD METHODS . T HE NUMBER OF FLOP S IS COMPUTED USING A 336 × 336 INPUT EXCEPT THAT A METHOD HAS
SPECIFIED ITS OWN INPUT DIMENSIONS . W E LABEL THE BEST PERFORMANCE IN EACH COLUMN IN BOLD . H ERE , THE MAIN ADVANTAGE OF OUR
APPROACH LIES IN THE TRADE - OFF BETWEEN ACCURACY AND EFFICIENCY.

#Param FLOPs Speed ECSSD DUT-O DUTS-TE HKU-IS SOD THUR15K


Methods
(M) (G) (FPS) Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
DRFI [6] - - 0.1 0.777 0.161 0.652 0.138 0.649 0.154 0.774 0.146 0.704 0.217 0.670 0.150
DCL [56] 66.24 224.9 1.4 0.895 0.080 0.733 0.095 0.785 0.082 0.892 0.063 0.831 0.131 0.747 0.096
DHSNet [15] 94.04 15.8 10.0 0.903 0.062 - - 0.807 0.066 0.889 0.053 0.822 0.128 0.752 0.082
RFCN [23] 134.69 102.8 0.4 0.896 0.097 0.738 0.095 0.782 0.089 0.892 0.080 0.802 0.161 0.754 0.100
NLDF [25] 35.49 263.9 18.5 0.902 0.066 0.753 0.080 0.806 0.065 0.902 0.048 0.837 0.123 0.762 0.080
DSS [62] 62.23 114.6 7.0 0.915 0.056 0.774 0.066 0.827 0.056 0.913 0.041 0.842 0.122 0.770 0.074
Amulet [14] 33.15 45.3 9.7 0.913 0.061 0.743 0.098 0.778 0.085 0.897 0.051 0.795 0.144 0.755 0.094
UCF [24] 23.98 61.4 12.0 0.901 0.071 0.730 0.120 0.772 0.112 0.888 0.062 0.805 0.148 0.758 0.112
SRM [59] 43.74 20.3 12.3 0.914 0.056 0.769 0.069 0.826 0.059 0.906 0.046 0.840 0.126 0.778 0.077
PiCANet [12] 32.85 37.1 5.6 0.923 0.049 0.766 0.068 0.837 0.054 0.916 0.042 0.836 0.102 0.783 0.083
BRN [7] 126.35 24.1 3.6 0.919 0.043 0.774 0.062 0.827 0.050 0.910 0.036 0.843 0.103 0.769 0.076
C2S [9] 137.03 20.5 16.7 0.907 0.057 0.759 0.072 0.811 0.062 0.898 0.046 0.819 0.122 0.775 0.083
RAS [61] 20.13 35.6 20.4 0.916 0.058 0.785 0.063 0.831 0.059 0.913 0.045 0.847 0.123 0.772 0.075
CPD [84] 29.23 59.5 68.0 0.930 0.044 0.794 0.057 0.861 0.043 0.924 0.033 0.848 0.113 0.795 0.068
BASNet [28] 87.06 127.3 36.2 0.938 0.040 0.805 0.056 0.859 0.048 0.928 0.032 0.849 0.112 0.783 0.073
EGNet [64] 108.07 270.8 12.7 0.938 0.044 0.794 0.056 0.870 0.044 0.928 0.034 0.859 0.110 0.800 0.070
HVPNet (OURS) 1.23 1.1 333.2 0.925 0.055 0.799 0.064 0.839 0.058 0.915 0.045 0.826 0.122 0.787 0.076

TABLE III
C OMPARISON BETWEEN THE PROPOSED HVPN ET AND EXISTING LIGHTWEIGHT BACKBONE NETWORKS . W E REFORM THESE LIGHTWEIGHT BACKBONE
NETWORKS FOR SOD BY VIEWING THEM AS THE ENCODER AND ADDING THE SAME DECODER AS HVPN ET TO THEM . T HE BEST PERFORMANCE IS
HIGHLIGHTED IN BOLD .

ECSSD DUT-O DUTS-TE HKU-IS SOD THUR15K


Backbone
Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
MobileNet [33] 0.892 0.127 0.743 0.091 0.792 0.090 0.885 0.098 0.765 0.208 0.759 0.090
MobileNetV2 [34] 0.903 0.066 0.760 0.072 0.804 0.067 0.896 0.053 0.807 0.137 0.768 0.082
ShuffleNet [36] 0.913 0.061 0.764 0.069 0.813 0.063 0.901 0.051 0.815 0.130 0.771 0.080
ShuffleNetV2 [35] 0.898 0.070 0.751 0.076 0.789 0.072 0.885 0.059 0.785 0.147 0.756 0.087
HVPNet (OURS) 0.925 0.055 0.799 0.064 0.839 0.058 0.915 0.045 0.826 0.122 0.787 0.076

Fig. 2. Illustration of the trade-off among F-measure, the number of parameters, FLOPs, and speed. Here, the F-measure is the average of all six datasets for
test. Note that the horizon axis is logarithmic.

To better illustrate the trade-off between the accuracy and between the predicted saliency maps of various methods, we
efficiency, we plot three figures in Fig. 2, showing the F- calculate the similarity of each predicted saliency map to the
measure against the number of parameters, the number of corresponding ground-truth saliency map. Here, we adopt three
FLOPs, and speed, respectively. Here, we adopt the average similarity metrics, including Pearson’s Correlation Coefficient
F-measure over all six datasets for clarity. In the figures of F- (PCC or CC), Similarity (or histogram intersection, denoted
measure vs.parameters and F-measure vs.FLOPs, HVPNet lies as SIM), and SSIM [89]. Please refer to the survey paper [90]
at the top left, which demonstrates its extremely lightweight of saliency metrics for more details about PCC/CC and SIM,
setting and good accuracy. In the figure of F-measure vs.speed, while SSIM [89] is a well-known metric for structural simi-
HVPNet lies at the top right, which demonstrates its good larity measurement. From Fig. 3, we can observe that in spite
trade-off between accuracy and speed. Therefore, we can come of the extremely lightweight setting, HVPNet outperforms
to the conclusion that HVPNet achieves a good trade-off previous methods in strange objects, (lines 1-2), confusing
among accuracy, the number of parameters, the number of scenarios (lines 3-4), thin objects (line 5), complex background
FLOPs, and the speed. (lines 6-7), indistinguishable boundaries (line 8), and large
In Fig. 3, we display some qualitative comparison with objects (lines 9-10). This further demonstrates the superiority
state-of-the-art SOD methods. To clearly show the difference of HVPNet.
IEEE TRANSACTIONS ON CYBERNETICS 7

TABLE IV
A BLATION STUDIES . “PA”, “SE”, “DC”, “CA”, “SA”, “DP”, AND “IP” REFER TO PARALLEL CONNECTION , SERIES CONNECTION , DENSE CONNECTION ,
CHANNEL - WISE ATTENTION , SPATIAL ATTENTION , DROPOUT, AND I MAGE N ET PRETRAINING , RESPECTIVELY.

Component ECSSD DUT-O DUTS-TE HKU-IS SOD THUR15K


No.
PA SE DC CA SA DP IP Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
1 ! 0.906 0.065 0.769 0.073 0.799 0.071 0.892 0.056 0.798 0.138 0.759 0.087
2 ! 0.893 0.075 0.750 0.080 0.777 0.080 0.879 0.064 0.792 0.152 0.749 0.091
3 ! ! 0.909 0.062 0.772 0.072 0.807 0.068 0.899 0.053 0.798 0.138 0.766 0.085
4 ! ! ! 0.911 0.062 0.776 0.071 0.807 0.069 0.898 0.053 0.811 0.135 0.765 0.084
5 ! ! ! 0.908 0.062 0.768 0.073 0.804 0.069 0.895 0.054 0.809 0.136 0.765 0.085
6 ! ! ! ! 0.912 0.062 0.772 0.071 0.808 0.068 0.898 0.053 0.801 0.138 0.766 0.084
7 ! ! ! ! ! 0.910 0.065 0.781 0.070 0.814 0.068 0.900 0.054 0.809 0.141 0.769 0.083
8 ! ! ! ! ! ! 0.925 0.055 0.799 0.064 0.839 0.058 0.915 0.045 0.826 0.122 0.787 0.076

TABLE V
A BLATION STUDIES . “ KERNEL ORDER ” MEANS THE ORDER OF r IN E Q . (1) IN EACH HVP MODULE . W E USE “# MODULES ” TO DEPICT THE NUMBERS OF
HVP MODULES FOR FOUR STAGES OF THE ENCODER . T HE DEFAULT CONFIGURATION IS WITH A SEQUENCE OF r VALUES OF 7, 5, 3, 1 AND THE
NUMBERS OF MODULES OF 1, 1, 3, 5.

ECSSD DUT-O DUTS-TE HKU-IS SOD THUR15K


Config.
Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓ Fβ ↑ MAE ↓
Default Config. 0.910 0.065 0.781 0.070 0.814 0.068 0.900 0.054 0.809 0.141 0.769 0.083
1, 3, 5, 7 0.901 0.070 0.769 0.078 0.796 0.075 0.891 0.059 0.808 0.138 0.761 0.089
1, 5, 3, 7 0.900 0.070 0.768 0.077 0.790 0.077 0.889 0.060 0.795 0.145 0.762 0.089
kernel order
3, 7, 1, 5 0.902 0.069 0.765 0.079 0.794 0.076 0.890 0.059 0.795 0.146 0.760 0.089
5, 1, 7, 3 0.905 0.068 0.774 0.076 0.801 0.073 0.892 0.057 0.803 0.137 0.762 0.088
9, 7, 5, 3, 1 0.911 0.066 0.778 0.074 0.807 0.071 0.899 0.055 0.814 0.141 0.766 0.087
dilation rates
5, 3, 1 0.910 0.065 0.772 0.075 0.800 0.073 0.894 0.056 0.809 0.141 0.764 0.088
1, 1, 2, 2 0.908 0.065 0.773 0.074 0.800 0.072 0.897 0.055 0.798 0.141 0.765 0.087
#modules
1, 1, 3, 8 0.906 0.064 0.772 0.076 0.796 0.073 0.893 0.057 0.801 0.141 0.765 0.089
×0.75 0.906 0.067 0.768 0.077 0.796 0.076 0.894 0.058 0.797 0.143 0.763 0.089
#filters
×1.25 0.915 0.062 0.786 0.072 0.816 0.068 0.902 0.053 0.822 0.137 0.773 0.083

b) Comparison with Lightweight Backbones: Although setting kernel sizes in the descending order for each HVP mod-
so far there is no lightweight SOD model, there exist ule consistently outperforms the other variants, implying the
lightweight backbone networks designed for efficient image correctness of the reverse hierarchy theory [40]. Secondly,
classification. Here, we add our lightweight decoder to four enhancing model capacity, i.e., incorporating another pRF
lightweight backbones for SOD, including MoblieNet [33], or increasing the number of filters, can slightly improve
MobileNetV2 [34], ShuffleNet [36], and ShuffleNetV2 [35]. performance, but leads to inefficiency, which is opposite to
We adopt the same training settings as HVPNet to train these the purpose of our design. Considering the trade-off between
baselines. Table III demonstrates the evaluation results. We accuracy and efficiency, we adopt the default settings as in
can observe that HVPNet achieves better results than directly Table I.
applying lightweight backbones for SOD. This demonstrates
that lightweight SOD is a worth studying and promising D. Evaluation for Eye Fixation Prediction
research field. This also proves that the proposed method is Another task that is highly related to SOD is eye fixation
nontrivial. prediction. Unlike SOD that requires to segment the whole
salient objects from an image, eye fixation prediction only
aims at finding eye fixation points without the requirement
C. Ablation Study
for the segmentation of objects. In some studies, eye fixation
a) Effectiveness of Each Module Component: Table IV prediction is also called saliency prediction. Here, we call
verifies the effectiveness of each component of the proposed it eye fixation prediction to distinguish it from SOD. To
HVPNet. Our efforts start with the design of a parallel demonstrate the superiority of the proposed HVPNet, we also
version of the HVP module. We find the densely-connected evaluate it for eye fixation prediction on the well-known SAL-
series version of the HVP module can outperform parallel ICON 2017 benchmark [97]. SALICON 2017 contains 10000
connection, which verifies that processing visual conception in training images and 5000 validation images with ground-truth
a hierarchical manner is more effective. Then, we incorporate annotations. The test set with 5000 images is released without
the spatial and channel attention mechanisms into our encoder. ground-truth because it is an online competition. All images
The results verify that including spatial and channel attention have the same resolution of 480×640. We train HVPNet for
simultaneously is beneficial to hierarchical visual perception 10 epochs with a batch size of 8 and the standard loss function
learning. We also investigate the impact of different training in [94], [95]. Other training settings are kept the same as SOD.
strategies, i.e., dropout and ImageNet [91] pretraining. We find For the evaluation metrics, we adopt four standard metrics in
both strategies can improve the generalization ability of our eye fixation prediction, including NSS, CC, AUC, and sAUC,
model in a majority of the experimental settings. using the code provided by the SALICON 2017 benchmark
b) Configurations of HVPNet: Table V demonstrates the [97]. Please refer to the survey paper [90] for more details
ablation results of various network configurations. Firstly, about these metrics.
IEEE TRANSACTIONS ON CYBERNETICS 8

0.3336 0.2228 0.4095 0.7843 0.9093 0.4923 0.7800 0.7864 0.1636 0.1073 0.1725 0.9573
0.3697 0.3243 0.5583 0.7389 0.8312 0.6524 0.7221 0.7237 0.2341 0.2627 0.3092 0.9018
0.4789 0.5769 0.5127 0.6851 0.6746 0.6870 0.4266 0.6900 0.5806 0.5343 0.4944 0.8088

0.7891 0.5882 0.6730 0.5920 0.5664 0.4428 0.6156 0.5097 0.6791 0.5517 0.4100 0.9763
0.5486 0.4701 0.4642 0.3857 0.3600 0.3023 0.4222 0.3471 0.5280 0.3837 0.2706 0.9474
0.2944 0.6863 0.6312 0.5475 0.5332 0.6703 0.5557 0.6515 0.8171 0.7213 0.5721 0.9372

0.7354 0.6212 0.6617 0.7333 0.6793 0.6788 0.8017 0.5745 0.6960 0.6724 0.8361 0.9362
0.6289 0.4910 0.5358 0.6028 0.4869 0.5879 0.6258 0.4595 0.5552 0.5677 0.7279 0.8835
0.4439 0.5770 0.6100 0.6814 0.4248 0.7945 0.6515 0.5867 0.6570 0.7673 0.7020 0.7712

0.6955 0.7199 0.6196 0.6356 0.6549 0.6414 0.8288 0.5934 0.6740 0.6936 0.9174 0.9750
0.5172 0.5647 0.4713 0.4866 0.4742 0.5104 0.6352 0.4577 0.5224 0.5512 0.8037 0.9374
0.5346 0.7617 0.7409 0.7502 0.6312 0.7976 0.7552 0.7577 0.7721 0.8037 0.8227 0.9144

0.5817 0.6388 0.7104 0.8136 0.7248 0.6988 0.5218 0.6576 0.6516 0.4878 0.4583 0.9371
0.4406 0.5201 0.5587 0.6405 0.5376 0.5548 0.4716 0.5223 0.5095 0.4540 0.4429 0.8384
0.3395 0.7373 0.8213 0.7553 0.6921 0.8571 0.8001 0.8333 0.7668 0.8238 0.8020 0.8497

-0.1214 0.0953 -0.0042 0.3997 0.2062 0.1799 -0.1512 -0.0677 -0.0375 0.2059 -0.0515 0.9452
0.0131 0.1267 0.0678 0.2151 0.1182 0.1691 0.0010 0.0456 0.0060 0.2157 0.0081 0.7550
0.0933 0.7800 0.5442 0.4968 0.2562 0.7397 0.4285 0.4543 0.8351 0.7662 0.6748 0.7327

0.6559 0.6431 0.5221 0.6715 0.4707 0.6159 0.7127 0.5462 0.6028 0.5105 0.5901 0.9421
0.4641 0.4986 0.4217 0.4845 0.3678 0.5188 0.5264 0.4025 0.4698 0.3915 0.4455 0.8600
0.2299 0.6273 0.5741 0.5117 0.3547 0.7644 0.5782 0.5792 0.6544 0.6501 0.6065 0.6724

0.7494 0.7133 0.8165 0.7019 0.7687 0.6837 0.8239 0.6492 0.6621 0.7242 0.7373 0.9576
0.7882 0.7889 0.8563 0.7520 0.8097 0.7399 0.8756 0.7038 0.7663 0.7768 0.7809 0.9630
0.6481 0.6882 0.7776 0.6805 0.6706 0.7617 0.7067 0.6719 0.7426 0.7762 0.7570 0.8254

0.3683 0.0415 -0.1280 0.6532 0.5465 -0.2745 0.1406 -0.2405 -0.1366 0.1029 -0.1533 0.9632
0.5376 0.0983 0.0673 0.7200 0.5476 0.0113 0.4239 0.0059 0.0553 0.4046 0.0130 0.9153
0.1950 0.4748 0.4333 0.6125 0.3949 0.4467 0.3385 0.4732 0.5169 0.5990 0.5361 0.8199

-0.1937 -0.0856 0.6280 -0.3636 -0.2132 0.5406 -0.1222 0.5311 -0.0194 0.4664 0.2950 0.9841
0.0183 0.0002 0.5529 0.0213 0.2994 0.4165 0.1066 0.4408 0.0252 0.3709 0.3169 0.9719
0.2198 0.5235 0.6464 0.2458 0.1832 0.7324 0.4911 0.5593 0.5535 0.6391 0.5438 0.8968

Image RFCN DSS SRM Amulet UCF BRN PiCA C2S RAS CPD EGNet Ours GT

Fig. 3. Qualitative comparison with state-of-the-art SOD methods. The red, green, and pink numbers denote the PCC, SIM, and SSIM values between each
predicted saliency map and the corresponding ground-truth (GT), respectively.

TABLE VI
E VALUATION FOR EYE FIXATION PREDICTION ON THE SALICON 2017 BENCHMARK . T HE NUMBER OF FLOP S IS COMPUTED USING A 480×640 INPUT
IMAGE . W E HIGHLIGHT THE BEST PERFORMANCE IN EACH COLUMN IN BOLD . T HE PROPOSED HVPN ET ACHIEVES SIMILAR RESULTS TO THE BEST
PERFORMANCE WITH AN EXTREMELY LIGHTWEIGHT SETTING .

#Param FLOPs VALIDATION SET TEST SET


Method Backbone Year
(M) (G) NSS ↑ CC ↑ AUC ↑ sAUC ↑ NSS ↑ CC ↑ AUC ↑ sAUC ↑
MLNet [92] ResNet50 2016 15.42 123.2 1.422 0.584 0.769 0.697 1.453 0.583 0.764 0.687
SalGAN [93] ResNet50 2017 31.78 94.1 1.635 0.796 0.846 0.716 1.662 0.798 0.847 0.700
SAM [94] ResNet50 2018 70.09 343.6 1.966 0.900 0.866 0.758 1.990 0.899 0.865 0.741
EML-NET [95] ResNet50 2018 23.54 25.3 2.002 0.879 0.861 0.757 2.018 0.874 0.858 0.740
DINet [96] ResNet50 2019 27.03 156.7 1.957 0.907 0.864 0.759 1.972 0.907 0.863 0.741
HVPNet (OURS) - - 1.23 3.0 1.981 0.873 0.865 0.757 2.003 0.869 0.863 0.740

The evaluation results are summarized in Table VI. We Along this path, we present a novel HVP module to imitate
compare the proposed HVPNet with recent state-of-the-art eye the primate visual cortex for hierarchical visual perception
fixation prediction methods, including MLNet [92], SalGAN learning. Building on the proposed HVP module, the proposed
[93], SAM [94], EML-NET [95], and DINet [96]. We can HVPNet can achieve comparable accuracy with state-of-the-
find that HVPNet achieves similar results to the best per- art SOD models while maintaining much faster speed, much
formance in terms of all evaluation metrics but with an ex- fewer parameters, and FLOPs. To the best of our knowledge,
tremely lightweight setting, i.e., significantly fewer parameters this is the first attempt in SOD towards accuracy-efficiency
and FLOPs. Therefore, we can come to the conclusion that trade-off and lightweight models. We demonstrate that directly
HVPNet achieves a good trade-off between effectiveness and applying lightweight backbones [33]–[36] for SOD leads to
efficiency for both SOD and eye fixation prediction, making suboptimal performance, which suggests lightweight SOD is
it possible to be applied in practical applications. worth studying and should be set up as a new research
direction. Through this study, we expect to arouse the research
for lightweight SOD that has the potential to promote more
V. C ONCLUSION
practical SOD systems.
Instead of only focusing on model accuracy, in this paper,
we explore a new direction for SOD, i.e., lightweight SOD, R EFERENCES
which aims at achieving a good trade-off among accuracy, ef- [1] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up
ficiency, the number of parameters, and the number of FLOPs. attention useful for object recognition?” in IEEE Conf. Comput. Vis.
IEEE TRANSACTIONS ON CYBERNETICS 9

Pattern Recog., vol. 2, 2004, pp. 37–44. [27] N. D. Bruce, C. Catton, and S. Janjic, “A deeper look at saliency: Feature
[2] W. Zhou, H. Li, and Q. Tian, “Recent advance in content-based image contrast, semantics, and beyond,” in IEEE Conf. Comput. Vis. Pattern
retrieval: A literature survey,” arXiv preprint arXiv:1706.06064, 2017. Recog., 2016, pp. 516–524.
[3] P. Li, D. Wang, L. Wang, and H. Lu, “Deep visual tracking: Review and [28] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand,
experimental comparison,” Pattern Recognition, vol. 76, pp. 323–338, “BASNet: Boundary-aware salient object detection,” in IEEE Conf.
2018. Comput. Vis. Pattern Recog., 2019, pp. 7479–7489.
[4] L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework for visual [29] M. Feng, H. Lu, and E. Ding, “Attentive feedback network for boundary-
saliency detection with applications to image thumbnailing,” in Int. Conf. aware salient object detection,” in IEEE Conf. Comput. Vis. Pattern
Comput. Vis., 2009, pp. 2232–2239. Recog., 2019, pp. 1623–1632.
[5] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection [30] S. Wang, S. Yang, M. Wang, and L. Jiao, “New contour cue-based
via graph-based manifold ranking,” in IEEE Conf. Comput. Vis. Pattern hybrid sparse learning for salient object detection,” IEEE Transactions
Recog., 2013, pp. 3166–3173. on Cybernetics, 2019.
[6] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object [31] K. Yan, X. Wang, J. Kim, and D. Feng, “A new aggregation of DNN
detection: A discriminative regional feature integration approach,” in sparse and dense labeling for saliency detection,” IEEE Transactions on
IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 2083–2090. Cybernetics, 2020.
[7] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji, [32] H. Li, G. Li, and Y. Yu, “ROSA: Robust salient object detection against
“Detect globally, refine locally: A novel approach to saliency detection,” adversarial attacks,” IEEE Transactions on Cybernetics, 2019.
in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 3127–3135.
[33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
[8] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang, “A bi-directional message
T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convo-
passing model for salient object detection,” in IEEE Conf. Comput. Vis.
lutional neural networks for mobile vision applications,” arXiv preprint
Pattern Recog., 2018, pp. 1741–1750.
arXiv:1704.04861, 2017.
[9] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowledge
transfer for salient object detection,” in Eur. Conf. Comput. Vis., 2018, [34] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
pp. 355–370. “MobileNetV2: Inverted residuals and linear bottlenecks,” in IEEE Conf.
[10] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attention Comput. Vis. Pattern Recog., 2018, pp. 4510–4520.
guided recurrent network for salient object detection,” in IEEE Conf. [35] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet v2: Practical
Comput. Vis. Pattern Recog., 2018, pp. 714–722. guidelines for efficient cnn architecture design,” in Eur. Conf. Comput.
[11] W. Wang, J. Shen, X. Dong, and A. Borji, “Salient object detection Vis., 2018, pp. 116–131.
driven by fixation prediction,” in IEEE Conf. Comput. Vis. Pattern [36] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
Recog., 2018, pp. 1711–1720. efficient convolutional neural network for mobile devices,” in IEEE Conf.
[12] N. Liu, J. Han, and M.-H. Yang, “PiCANet: Learning pixel-wise Comput. Vis. Pattern Recog., 2018, pp. 6848–6856.
contextual attention for saliency detection,” in IEEE Conf. Comput. Vis. [37] I. González-Dı́az, V. Buso, and J. Benois-Pineau, “Perceptual modeling
Pattern Recog., 2018, pp. 3089–3098. in the problem of active object recognition in visual scenes,” Pattern
[13] M. A. Islam, M. Kalash, and N. D. Bruce, “Revisiting salient object Recogn., vol. 56, pp. 129–141, 2016.
detection: Simultaneous detection, ranking, and subitizing of multiple [38] D. J. Felleman and D. E. Van, “Distributed hierarchical processing in
salient objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. the primate cerebral cortex,” Cerebral Cortex, vol. 1, no. 1, pp. 1–47,
7142–7150. 1991.
[14] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating [39] T. Serre, “Hierarchical models of the visual system,” Encyclopedia of
multi-level convolutional features for salient object detection,” in Int. Computational Neuroscience, pp. 1309–1318, 2015.
Conf. Comput. Vis., 2017, pp. 202–211. [40] S. Hochstein and M. Ahissar, “View from the top: Hierarchies and
[15] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network for reverse hierarchies in the visual system,” Neuron, vol. 36, no. 5, pp.
salient object detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 791–804, 2002.
2016, pp. 678–686. [41] J. J. DiCarlo, D. Zoccolan, and N. C. Rust, “How does the brain solve
[16] S. He, J. Jiao, X. Zhang, G. Han, and R. W. Lau, “Delving into salient visual object recognition?” Neuron, vol. 73, no. 3, pp. 415–434, 2012.
object subitizing and detection,” in Int. Conf. Comput. Vis., 2017, pp. [42] B. A. Wandell and J. Winawer, “Computational neuroimaging and
1059–1067. population receptive fields,” Trends in Cognitive Sciences, vol. 19, no. 6,
[17] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object pp. 349–357, 2015.
detection with pyramid attention and salient edges,” in IEEE Conf. [43] S. Liu, D. Huang, and Y. Wang, “Receptive field block net for accurate
Comput. Vis. Pattern Recog., 2019, pp. 1448–1457. and fast object detection,” in Eur. Conf. Comput. Vis., 2018, pp. 385–400.
[18] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutual [44] L. Wolf, S. Bileschi, and E. Meyers, “Perception strategies in hierarchi-
learning method for salient object detection with intertwined multi- cal vision systems,” in IEEE Conf. Comput. Vis. Pattern Recog., 2006,
supervision,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. pp. 2153–2160.
8150–8159. [45] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J.
[19] H. Li, G. Li, B. Yang, G. Chen, L. Lin, and Y. Yu, “Depthwise nonlocal Rodriguez-Sanchez, and L. Wiskott, “Deep hierarchies in the primate
module for fast salient object detection using a single thread,” IEEE visual cortex: What can we learn for computer vision?” IEEE Trans.
Transactions on Cybernetics, 2020. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1847–1871, 2012.
[20] S. Chen, B. Wang, X. Tan, and X. Hu, “Embedding attention and residual
[46] Q. Wang, Y. Yuan, P. Yan, and X. Li, “Saliency detection by multiple-
network for accurate salient object detection,” IEEE Trans. Cybernetics,
instance learning,” IEEE Transactions on Cybernetics, vol. 43, no. 2,
vol. 50, no. 5, pp. 2050–2062, 2020.
pp. 660–672, 2013.
[21] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Y. Tang,
“Video saliency detection using object proposals,” IEEE Transactions [47] Q. Wang, Y. Yuan, and P. Yan, “Visual saliency by selective contrast,”
on Cybernetics, vol. 48, no. 11, pp. 3159–3170, 2017. IEEE Trans. Circ. Syst. Video Technol., vol. 23, no. 7, pp. 1150–1155,
[22] Y. Zhou, S. Huo, W. Xiang, C. Hou, and S.-Y. Kung, “Semi-supervised 2012.
salient object detection using a linear feedback control system model,” [48] G. Zhu, Q. Wang, Y. Yuan, and P. Yan, “Learning saliency by MRF
IEEE Transactions on Cybernetics, vol. 49, no. 4, pp. 1173–1185, 2018. and differential threshold,” IEEE Transactions on Cybernetics, vol. 43,
[23] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection no. 6, pp. 2032–2043, 2013.
with recurrent fully convolutional networks,” in Eur. Conf. Comput. Vis., [49] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,”
2016, pp. 825–841. in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 5455–5463.
[24] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain [50] Y. Liu, M.-M. Cheng, X. Zhang, G.-Y. Nie, and M. Wang, “DNA:
convolutional features for accurate saliency detection,” in Int. Conf. Deeply-supervised nonlinear aggregation for salient object detection,”
Comput. Vis., 2017, pp. 212–221. arXiv preprint arXiv:1903.12476, 2019.
[25] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin, [51] Y. Qiu, Y. Liu, and J. Xu, “MiniSeg: An extremely minimum network
“Non-local deep features for salient object detection,” in IEEE Conf. for efficient COVID-19 segmentation,” arXiv preprint arXiv:2004.09750,
Comput. Vis. Pattern Recog., 2017, pp. 6609–6617. 2020.
[26] P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for salient object [52] Y. Qiu, Y. Liu, H. Yang, and J. Xu, “A simple saliency detection
detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2300– approach via automatic top-down feature fusion,” Neurocomputing, vol.
2309. 388, pp. 124–134, 2020.
IEEE TRANSACTIONS ON CYBERNETICS 10

[53] Y. Qiu, Y. Liu, X. Ma, L. Liu, H. Gao, and J. Xu, “Revisiting multi- [79] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE
level feature fusion: A simple yet effective network for salient object Conf. Comput. Vis. Pattern Recog., 2018, pp. 7132–7141.
detection,” in IEEE Int. Conf. Image Process., 2019, pp. 4010–4014. [80] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
[54] Y. Liu, M.-M. Cheng, X. Hu, J.-W. Bian, L. Zhang, X. Bai, and J. Tang, Boltzmann machines,” in Int. Conf. Mach. Learn., 2010, pp. 807–814.
“Richer convolutional features for edge detection,” IEEE Trans. Pattern [81] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
Anal. Mach. Intell., vol. 41, no. 8, pp. 1939–1946, 2019. dinov, “Dropout: a simple way to prevent neural networks from overfit-
[55] Y. Liu, M.-M. Cheng, D.-P. Fan, L. Zhang, J. Bian, and D. Tao, ting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
“Semantic edge detection with diverse deep supervision,” arXiv preprint [82] A. M. Obeso, J. Benois-Pineau, M. S. G. Vázquez, and A. A. R.
arXiv:1804.02864, 2018. Acosta, “Dropping activations in convolutional neural networks with
[56] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” visual attention maps,” in Int. Conf. Content-Based Multimedia Indexing.
in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 478–487. IEEE, 2019, pp. 1–4.
[57] Y. Zeng, H. Lu, L. Zhang, M. Feng, and A. Borji, “Learning to promote [83] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised
saliency detectors,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, nets,” in AISTATS, 2015, pp. 562–570.
pp. 1644–1653. [84] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and
[58] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment: accurate salient object detection,” in IEEE Conf. Comput. Vis. Pattern
Finding the salient objects in images via two-stream fixation-semantic Recog., 2019, pp. 3907–3916.
CNNs,” in Int. Conf. Comput. Vis., 2017, pp. 1050–1058. [85] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
[59] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp. 1155–1162.
refinement model for detecting salient objects in images,” in Int. Conf. [86] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan,
Comput. Vis., 2017, pp. 4019–4028. “Learning to detect salient objects with image-level supervision,” in
[60] T. Zhao and X. Wu, “Pyramid feature attention network for saliency IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 136–145.
detection,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 3085– [87] V. Movahedi and J. H. Elder, “Design and perceptual validation of
3094. performance measures for salient object segmentation,” in IEEE Conf.
[61] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient Comput. Vis. Pattern Recog. Worksh., 2010, pp. 49–56.
object detection,” in Eur. Conf. Comput. Vis., 2018, pp. 234–250. [88] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “SalientShape:
[62] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr, “Deeply Group saliency in image collections,” The Visual Computer, vol. 30,
supervised salient object detection with short connections,” IEEE Trans. no. 4, pp. 443–453, 2014.
Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 815–828, 2019. [89] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[63] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image quality assessment: From error visibility to structural similarity,” IEEE
recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770– Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
778. [90] Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “What do
[64] J.-X. Zhao, J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, different evaluation metrics tell us about saliency models?” IEEE Trans.
“EGNet: Edge guidance network for salient object detection,” in Int. Pattern Anal. Mach. Intell., vol. 41, no. 3, pp. 740–757, 2018.
Conf. Comput. Vis., 2019, pp. 8779–8788. [91] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[65] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis.
neural networks for mobile devices,” in IEEE Conf. Comput. Vis. Pattern Pattern Recog., 2009, pp. 248–255.
Recog., 2016, pp. 4820–4828. [92] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “A deep multi-level
[66] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-Net: network for saliency prediction,” in Int. Conf. Pattern Recog. IEEE,
Imagenet classification using binary convolutional neural networks,” in 2016, pp. 3488–3493.
Eur. Conf. Comput. Vis., 2016, pp. 525–542. [93] J. Pan, E. Sayrol, X. G.-i. Nieto, C. C. Ferrer, J. Torres, K. McGuinness,
[67] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning and N. E. OConnor, “SalGAN: Visual saliency prediction with adversar-
efficient convolutional networks through network slimming,” in Int. ial networks,” in CVPR Scene Understanding Workshop (SUNw), 2017.
Conf. Comput. Vis., 2017, pp. 2736–2744. [94] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human
[68] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance eye fixations via an LSTM-based saliency attentive model,” IEEE Trans.
estimation for neural network pruning,” in IEEE Conf. Comput. Vis. Image Process., vol. 27, no. 10, pp. 5142–5154, 2018.
Pattern Recog., 2019, pp. 11 264–11 272. [95] S. Jia and N. D. Bruce, “EML-NET: An expandable multi-layer network
[69] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Goodman, “How for saliency prediction,” Image and Vision Computing, p. 103887, 2020.
to grow a mind: Statistics, structure, and abstraction,” Science, vol. 331, [96] S. Yang, G. Lin, Q. Jiang, and W. Lin, “A dilated inception network for
no. 6022, pp. 1279–1285, 2011. visual saliency prediction,” IEEE Trans. Multimedia, vol. 22, no. 8, pp.
[70] J. Benois-Pineau and P. Le Callet, Visual Content Indexing and Retrieval 2163–2176, 2019.
with Psycho-Visual Models. Springer, 2017. [97] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “SALICON: Saliency in
[71] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep context,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1072–
network training by reducing internal covariate shift,” in Int. Conf. Mach. 1080.
Learn., 2015, pp. 448–456.
[72] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in Int.
Conf. Comput. Vis., 2015, pp. 1026–1034.
[73] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
atrous convolution for semantic image segmentation,” arXiv preprint
arXiv:1706.05587, 2017.
[74] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“DeepLab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.
[75] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2881–
2890.
[76] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in IEEE Conf. Comput. Vis. Pattern
Recog., 2017, pp. 4700–4708.
[77] C. Morand, J. Benois-Pineau, J.-P. Domenger, J. Zepeda, E. Kijak,
and C. Guillemot, “Scalable object-based video retrieval in HD video
databases,” Signal Processing: Image Communication, vol. 25, no. 6,
pp. 450–465, 2010.
[78] Y. Yuan, Z. Xiong, and Q. Wang, “VSSA-NET: Vertical spatial sequence
attention network for traffic sign detection,” IEEE Trans. Image Process.,
vol. 28, no. 7, pp. 3423–3434, 2019.

View publication stats

You might also like