0% found this document useful (0 votes)
13 views

An_Improved_SSD-like_Deep_Network-Based_Object_Detection_Method_for_Indoor_Scenes

Uploaded by

obaid ur rehman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

An_Improved_SSD-like_Deep_Network-Based_Object_Detection_Method_for_Indoor_Scenes

Uploaded by

obaid ur rehman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/368516410

An Improved SSD-like Deep Network-Based Object Detection Method for Indoor


Scenes

Article in IEEE Transactions on Instrumentation and Measurement · January 2023


DOI: 10.1109/TIM.2023.3244819

CITATIONS READS
18 43

4 authors, including:

Jianjun Ni Yan Chen


Hohai University 8 PUBLICATIONS 126 CITATIONS
96 PUBLICATIONS 1,155 CITATIONS
SEE PROFILE
SEE PROFILE

Simon X. Yang
University of Guelph
690 PUBLICATIONS 12,401 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jianjun Ni on 24 February 2023.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 5006915

An Improved SSD-Like Deep Network-Based


Object Detection Method for Indoor Scenes
Jianjun Ni , Senior Member, IEEE, Kang Shen , Yan Chen , and Simon X. Yang , Senior Member, IEEE

Abstract— The indoor scene object detection technology is of by perceiving the surrounding environment, and to realize
important research significance, which is one of the popular the interaction among robot, human, and environment. Thus,
research topics in the field of scene understanding for indoor how to make robots interact with the environment and people
robots. In recent years, the solutions based on deep learning
have achieved good results in object detection. However, there through the understanding of images like people become a
are still some problems to be further studied in indoor object hot research topic. This is also the main research content of
detection methods, such as lighting problem and occlusion computer vision [5], [6].
problem caused by the complexity of the indoor environment. The aim of computer vision for robots is to enable robots
Aiming at these problems, an improved object detection method to observe and understand the world through vision, and have
based on deep neural networks is proposed in this article,
which uses a framework similar to the single-shot multibox the ability to adapt to the environment independently. This
detector (SSD). In the proposed method, an improved ResNet50 is one of the key research contents of scene understanding
network is used to enhance the transmission of information, for indoor robots [7]. Scene understanding is the premise
and the feature expression capability of the feature extraction of autonomous decision-making and behavior optimization of
network is improved. At the same time, a multiscale contextual intelligent robots in a complex indoor environment and is also
information extraction (MCIE) module is used to extract the
contextual information of the indoor scene, so as to improve the key to determining the function and performance of indoor
the indoor object detection effect. In addition, an improved robots [8], [9].
dual-threshold non-maximum suppression (DT-NMS) algorithm The task of scene understanding is closely related to the
is used to alleviate the occlusion problem in indoor scenes. Finally, information of important objects in the scene, so the research
the public dataset SUN2012 is further screened for the special on object detection has very important practical significance.
application of indoor scene object detection, and the proposed
method is tested on this dataset. The experimental results show Indoor scenes are closely related to people’s lives and are one
that the mean average precision (mAP) of the proposed method of the hot research topics in the field of computer vision. The
can reach 54.10%, which is higher than those of the state-of-the- key of indoor scene object detection technology is to determine
art methods. the location, size, and category of objects, which has important
Index Terms— Deep network, indoor scene, object detection, research significance in the indoor robot field, such as accurate
ResNet50 network, single-shot multibox detector (SSD) algo- positioning and navigation of indoor mobile robots [10], [11].
rithm. Before the emergence of deep learning technology, tradi-
tional object detection methods are mainly based on hand-
I. I NTRODUCTION
designed features, such as scale-invariant feature transform

I NDOOR robots have developed rapidly recently and are


used for some specific scenarios, such as home cleaning
and elderly care service [1], [2], [3], [4]. The further expansion
(SIFT) [12], Haar-like feature (HAAR) [13], histograms of
oriented gradient (HOG) [14], and so on. Due to the factors
of the object itself and the imaging environment, the methods
of robot application scenarios requires that the functions of of manual design features have problems of poor robustness,
robots are no longer limited to mechanized or programed poor generalization, and low detection accuracy [15]. The
operation, narrow human–computer interaction, and so on. It is performance of the traditional methods based on the manual
expected that indoor robots can realize intelligent operation design features is poor in complex indoor scenes, and the
traditional methods are difficult to meet people’s demand for
Manuscript received 18 October 2022; revised 19 January 2023;
accepted 30 January 2023. Date of publication 14 February 2023; date of high-performance object detection.
current version 24 February 2023. This work was supported in part by the With the development of deep learning technology, the
National Natural Science Foundation of China under Grant 61873086 and in object detection algorithms based on deep learning have
part by the Science and Technology Support Program of Changzhou under
Grant CE20215022. The Associate Editor coordinating the review process was become the major method. Object detection algorithms that
Dr. Jing Lei. (Corresponding author: Jianjun Ni.) learn features using convolutional networks can automati-
Jianjun Ni, Kang Shen, and Yan Chen are with the College of cally discover the features required for detecting and clas-
Internet of Things Engineering, Hohai University, Changzhou, Jiangsu
213022, China (e-mail: [email protected]; [email protected]; sifying objects and transform the original input information
[email protected]). into more abstract and higher dimensional features. The
Simon X. Yang is with the Advanced Robotics and Intelligent Systems high-dimensional feature has good ability in feature expression
(ARIS) Laboratory, School of Engineering, University of Guelph, Guelph,
ON N1G 2W1, Canada (e-mail: [email protected]). and generalization, so the performance of the object detection
Digital Object Identifier 10.1109/TIM.2023.3244819 algorithm based on deep learning in complex scenes is better.
1557-9662 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

Thus, the main aim of this study is to improve the performance


of the indoor object detection algorithm based on deep learning
technology.
Researchers have obtained good results in the studies on
indoor object detection based on deep learning, but there
are still many difficulties and deficiencies, such as lighting
problem, occlusion problem, and multiobjects detection prob-
lem. To deal with these problems, an improved deep neural
network-based method is proposed for object detection in
indoor scenes.
The main contributions of this article are summarized as
follows. First, an improved model based on deep networks
for indoor scene object detection is proposed, which uses
the single-shot multibox detector (SSD)-like framework. The
proposed model is an integration of several improved deep
neural networks. Second, in the feature extraction module,
an improved ResNet50 network is used to enhance the trans-
mission of information and improve the ability of the feature Fig. 1. Examples of complex indoor scenes. Lighting problem in (a) and
expression. Third, in the proposed deep network, an improved (b); uneven indoor lighting in (a) and outdoor lighting in (b) cause shadows
in the indoor scene. Occlusion problem in (c) and (d); the window in (c) is
multiscale contextual information extraction (MCIE) module blocked by curtains, and the chair in (d) is blocked by blankets.
is presented to extract the contextual information of indoor
objects, which can improve the accuracy of indoor object
detection. Fourth, at last, an improved dual-threshold non-
maximum suppression (DT-NMS) algorithm is used in the
proposed object detection model to alleviate the occlusion Jiang et al. [21] presented a semantic rule reasoning frame-
problem in complex indoor scenes. work for indoor scene classification. This framework high-
This article is organized as follows. Section II gives out lights the importance of visual attributes and improves the
an overview of the related works. The proposed method is classification performance of indoor scenes. Ji et al. [22]
presented in Section III. In Section IV, the proposed method proposed an indoor scene classification strategy based on a
is compared with the classical and advanced object detection multiple descriptors fusion model. Multimodal fusion has been
methods on the public dataset. Section V gives out the ablation shown to help improve the performance of scene classification
study on the improvements in the proposed model. Section VI tasks. For example, Mosella-Montoro and Ruiz-Hidalgo [23]
discusses the settings of important parameters in the proposed proposed a method based on a 2-D–3-D fusion network for
model. Finally, the conclusion is given in Section VII. indoor scene classification, where the geometric information
of a 3-D point cloud and the 2-D texture information are
II. R ELATED W ORKS fused. Glavan and Talavera [24] presented a indoor scene
recognition model based on the fusion of visual features and
In this section, we briefly review the research tasks of
the transcribed speech to text.
indoor scene perception and the work related to indoor object
In addition to the indoor scene classification tasks, a large
detection based on deep learning methods.
number of tasks about indoor semantic segmentation [25],
layout estimation [26], and scene reconstruction [27] have also
A. Research of Indoor Scene Perception been widely studied. For example, Sun et al. [28] presented a
With the rapid development of robotic technology and the patch graph convolution network framework using the surface
rise of indoor tasks, more and more researchers turn attention patches of indoor field point clouds as data representation to
to indoor scenes understanding that is more closely related perform efficient semantic segmentation results. Zhang et al.
to human life. By understanding the nature of indoor scenes [29] proposed an end-to-end layout estimation framework,
and evaluating the spatial structure of indoor scenes, it can using transfer learning to learn the mapping from edge maps to
provide important geometric details for various activities, such keypoint coordinates, which has achieved good performance of
as indoor navigation, scene recognition, virtual reality, and so the layout estimation on benchmark datasets. Huan et al. [30]
on [16], [17], [18]. proposed a semantic 3-D reconstruction method based on
Compared with the outdoor scenes, there are many limi- a multitask neural network for RGB+depth map (RGB-D)
tations in indoor scenes because of the complex and messy indoor scenes, where a geometric extractor is built to learn
interior decoration. At the same time, different lighting condi- geometry-enhanced feature representations.
tions and occlusion problems in the indoor scene must also be As introduced above, there are lots of related works about
paid attention to. Some examples of complex indoor scenes the indoor scene perception tasks, and the deep learning-based
are shown in Fig. 1. methods are the hot research spots in this field. In these related
Recently, a number of indoor scene classification tasks works, object detection is a foundation for scene classification,
have been extensively studied [19], [20]. For example, semantic segmentation, and so on. Thus, the main objective

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

of this article is to improve the object detection accuracy for methods above have achieved some good results. However,
indoor scenes. there are still some limitations existed in these methods.
For example, the method in [41] only considered the indoor
environment detection sample category of imbalance, and its
B. Indoor Object Detection Methods Based on Deep
own indoor scene was too simple, which cannot reflect the
Learning
indoor scene inherent attributes (i.e., there are a lot of semantic
Indoor scene is the key research topic in the field of object categories and the occlusion problem among the objects, the
detection, which has great practical value for the applications visual features are lack of strong identification ability, and the
of the indoor robot. At the same time, with the development indoor light is not controllable).
of deep learning, deep learning-based algorithms have become There are a lot of research on the contextual information
the main research direction of object detection. We will give around the objects in recent years. For example, Alamri and
out a simple review of the deep learning-based object detection Pugeault [44] put forward a contextual model of rescoring
methods for indoor scenes. and remarking, as a postprocessing step, to improve the
The object detection methods based on deep learning tech- detection performance and help correct the false detection
nology can be roughly divided into two categories according based on the contextual information. Ma et al. [45] constructed
to the characteristics of their process: a single-stage object a fusion subnetwork for the local contextual information and
detection method and a two-stage object detection method. The object–object relationship contextual information, to solve
main representative of the two-stage object detection methods the diversity and complexity of geospatial object appearance.
is the regions with convolutional neural network (R-CNN) Zhang et al. [46] proposed a global context-aware attention
features series, including Fast R-CNN and Faster R-CNN [31], network, which utilized both the global feature pyramid and
[32]. Such methods are based on region proposal network attention strategies. These studies above have proven that the
(RPN) and the theory of neural network classification. There contextual information can eliminate the influence of external
are two stages in the detection process: first, the RPN is used factors and improve the detection accuracy. However, in most
to generate the preselection boxes, and then, the classification studies on contextual information, only the contextual infor-
and regression of the preselection boxes are realized through mation of a certain layer or the last layer of the deep network
the detection network. This kind of detection method has high is extracted, so that the extracted features are scarce, which
detection accuracy, but the detection speed is slow. Single- affects the detection accuracy of small objects or occluded
stage object detection methods are represented by SSD [33], objects. To solve this kind of problem, this article proposed
[34] and you only look once (YOLO) series [35], [36]. an improved MCIE module, to extract multiple layers of
SSD and YOLO series models have been wildly studied and contextual information with different scales, so as to obtain
used [37], which will not be introduced in this article. Here are richer features.
other two classic and common single-stage methods, namely, There are also many studies that focus on the specific
RetinaNet [38] and CenterNet [39], which will be introduced problems of indoor object detection. For example, Wang
briefly. RetinaNet consists of a backbone network and two and Tan [47] studied and validated low confidence in object
prediction branch networks, where a ResNet architecture is detection of robotic visuals in dark environments, and a cycle
used for backbone network, and a pyramid structure is used generative adversarial networks (CycleGANs)-based method
for feature fusion. CenterNet is an anchor-free object detection of brightness migration was presented, which can improve
network, which mainly includes ResNet50 to extract image the detection performance in the dark environment. Huang
features and the deconvolution (Deconv) module to upsample et al. [48] used the multiscale feature fusion method to improve
the feature map. Single-stage object detection methods remove the detection effect of various types of objects, especially for
the RPN stage and can directly obtain the detection results, objects with smaller scales. Zhang et al. [49] presented a set of
so the detection speed is fast, which is suitable for the object detection algorithms based on the point cloud template
requirements of the indoor robots. In this article, we will focus to deal with the problems in the complex indoor environ-
on the single-stage object detection method for indoor scenes ment. These methods above obtained some better results in
and present an SSD-like deep network-based method. a specific problem. However, the comprehensive performance
Recently, there has been lots of improvements of the of these methods should be further improved. For example,
deep learning-based methods for indoor object detection. For the multiscale fusion method in [48] will slow down the
example, Afif et al. [40] proposed a deep convolutional detection speed, and the dataset images used did not consider
neural network for a fully labeled indoor object dataset. the occlusion problem of indoor scenes. The categories of
Chen et al. [41] added the “gradient density” parameter to objects detected in [49] were far fewer than those of indoor
the loss function of YOLOv3 to improve the imbalance of scene objects, and the occlusion rate set in this article was
sample categories and the accuracy of indoor object detection. also much lower than the occlusion situation in real indoor
Ding et al. [42] focused on the region-based CNN detector scenes.
and proposed a geometry-based Faster R-CNN for indoor As introduced above, there are lots of excellent research
object detection. Similar to the geometric property-based results of deep learning-based algorithms for indoor object
approach, Samani et al. [43] proposed to use topological per- detection. However, there are still many problems that need
sistence features that rely on target object shape information to be further studied, which is the main purpose of this
to improve the robustness of indoor object detection. These article.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

Fig. 2. Whole structure of the proposed indoor object detection model based on deep network.

III. P ROPOSED I NDOOR O BJECT D ETECTION M ETHOD


In this article, an improved deep network model for indoor
scene object detection based on multiscale contextual infor-
mation is proposed. The main idea of the proposed method
is that the placement of objects in indoor scenes is affected
by human living habits, so the relative positions between
objects are subject to certain rules. The use of these rules
will help to detect the indoor objects. To realize this idea,
an improved indoor object detection model based on deep
network is proposed in this article, which is shown in Fig. 2. Fig. 3. Network structure of the proposed ResNet50 network in the feature
As shown in Fig. 2, the proposed model adopts an SSD- extraction module.
like framework, and there are three main parts in it. The first
part is the feature extraction module, which is composed of an
by introducing the distance-intersection over union (DIoU)
improved ResNet50 [50] and five additional layers to extract
instead of the traditional intersection over union (IoU) while
object features and obtain feature maps at different scales. The
limiting the detection boxes through two thresholds.
second part is the MCIE module, which uses the improved
multiscale module based on the atrous spatial pyramid pool-
ing (ASPP) structure [51], to further extract the contextual A. Feature Extraction Module Based on Improved ResNet50
information under different scale conditions. The third part is In this article, an improved ResNet50 network is used as
the object detection module. In the object detection module, the backbone for the feature extraction, where the series of
a multiscale feature detection (MSFD) network is used to send layers are corresponding to conv5_x in the ResNet50 network,
the feature maps extracted from the six MCIE modules into the average pooling layers, and fully connected layers. The
the detection module for positioning and classification. The subsequent layers of the ResNet50 network are removed.
detection module obtains the confidence and coordinates of the Only conv1, conv2_x to conv4_x, and its corresponding series
objects in the six feature maps. Then, an improved DT-NMS of layer structures of ResNet50 are used. The stride of the
algorithm is used to filter out the repeatedly detected objects first residual structure in the series of residual structures
while considering the highly overlapping objects. The network corresponding to the conv4_x layer is modified to 1, and
structure proposed above is described in detail as follows. the output of the conv4_x layer is taken as the first feature
Remark 1: The proposed model has a similar framework map layer. The network structure of the proposed ResNet50
to the SSD network. However, there are much more improve- network in the feature extraction module is shown in Fig. 3.
ments in the proposed model, which mainly further improve The receptive field is the size of the region mapped on the
the accuracy of indoor scene object detection by enriching the input image by the pixels of the feature map, which is output
extracted features and improving the postprocessing method. from each layer of the convolutional neural network. Because
In terms of enriching the extracted features, an added shortcut- feature maps of different sizes have different receptive fields,
connected ResNet50 network is selected as the network for objects of different sizes can be detected. As we know, a larger
the feature extraction, while an improved MCIE module is feature map has a small receptive field, which mainly contains
proposed to extract multilayer contextual information at dif- local detail features, and is suitable for detecting relatively
ferent scales. In addition, the postprocessing mode is improved small objects. The smaller feature map has a large receptive

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

Fig. 4. (a) Convolution block and (b) identity block of the improved ResNet50 network.

field, which is suitable for detecting relatively large objects. A


1 × 1 convolution layer is sequentially added to each of the
residual structures of ResNet50, using a shortcut connection
to transfer the input of the previous layer directly to the
later connected layer. Here, the shortcut connection is used to
strengthen information transmission and feature reuse, so that
the extracted features contain more detailed features. For ease
of presentation, we refer to the improved layer structure as
Rconv2_x, Rconv3_x, and Rconv4_x, respectively (see Fig. 3).
The most important contribution of ResNet network is the
introduction of residual blocks, which helps to deal with the
problems of gradient explosion and gradient disappearance.
It can not only train the deeper network, but also ensure the
accuracy of the model. The residual blocks are implemented
through shortcut connection. The two ways of shortcut con-
nection used in this article are shown in Fig. 4, where Fig. 4(a)
and (b) shows the structures of the convolution blocks and the
identity blocks in Fig. 3, respectively. Fig. 5. Structure parameters of the proposed feature extraction module.
In Fig. 4(a), a 1 × 1 convolution layer is used to reduce
the number of channels when the number of input and output
channels is different. The calculation method is as follows: the model [52]. The whole feature extraction network outputs
six feature maps with different sizes, which are the same as
y = F(x) + W x (1) the original SSD. The sizes of these six-scale feature maps
are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1,
where x and y are the input and output vectors of the
respectively. In order to describe the architecture of the feature
considered layer; F(·) refers to the residual function, which
extraction module in more detail, the structure parameters of
is obtained through model learning; and W represents the
the proposed feature extraction module are shown in Fig. 5.
1 × 1 convolution operation, which is used to adjust the
channel dimension of input.
In Fig. 4(b), the number of input and output channels is the B. MCIE Module
same, where the specific calculation method is as follows: Contextual information has been demonstrated by many
y = F(x) + x. (2) studies that it can improve the accuracy of object detec-
tion [53], [54], [55]. In this article, an improved MCIE module
Remark 2: Through the shortcut connection, the input and based on the ASPP structure is used to fully extract the
output of the considered layer perform an elementwise addi- contextual features from the six feature maps generated by the
tion operation. This simple addition operation will not increase feature extraction module. The main reason for using the ASPP
the number of parameters and computational complexity of structure is that it introduces the atrous convolution, which
the network, but it can improve the training speed and effect samples the given input in parallel with atrous convolution
of the model greatly. Moreover, this structure can deal with layers of different rates. In this way, it is equivalent to
the degradation problem well while extracting richer features, capturing the contextual information of the image at multiple
when the number of layers of the network is increased. scales. A schematic of the atrous convolution is shown in
After the proposed ResNet network, a series of additional Fig. 6.
layers are inserted, where the internal composition of each Compared with ordinary convolution, the atrous convolution
additional layer structure is the same (see Fig. 2). In this has another parameter of atrous rate besides the size of the
study, the batch normalization (BN) layer is used to accelerate convolution kernel, which is mainly used to express the size
the convergence speed and improve the detection accuracy of of the sampling interval of the input matrix. The difference

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

parameters, but also deepens the number of layers of


the network [57].
2) An additional 1 × 1 convolution is used before each
traditional convolution, which avoids directly sampling
on the feature map by atrous convolutions and improves
the density of sampling.
3) Then, the shallow feature map information is directly
transferred to the deeper output through a shortcut
connection, the features in the input feature map are
Fig. 6. Schematic of the atrous convolution.
reused, and the possible degradation problem of the deep
network is solved at the same time.
4) Using the adaptive average pooling layer, the feature
between the atrous convolution and ordinary convolution is
map of each channel is compressed to 1 × 1, so as to
that atrous convolution has a larger receptive field, which
extract the features of each channel, and then obtain the
means that more contextual information can be contained
global features. Based on the adaptive average pooling
by the atrous convolution. The receptive field of the atrous
layer, the kernel size and stride do not need to be
convolution is calculated as follows:
specified, but only the final output size needs to be
n−1
Y designed. So, a 1 × 1 convolution layer is used for
RFn = RFn−1 + ((( f K − 1) ∗ D + 1) − 1) Si (3)
further extraction of the features obtained in the previous
i=0
step and dimensionality reduction. At last, the feature
where RFn represents the receptive field of layer n, and the map is upsampled from 1 × 1 back to the original size.
receptive field of layer 0 is 1; f K represents the size of 5) Finally, the features of the above scales are fused,
convolution kernel; D is the atrous rate, and Si represents and the number of channels is reduced by the 1 × 1
the stride of layer i. convolution to the expected value to obtain the output
In the traditional ASPP structure, four parallel atrous con- result.
volution layers with the rates of 6, 12, 18, and 24 are Remark 3: The proposed MCIE module can provide multi-
used. Compared with the ordinary convolution, the atrous scale receptive fields by using the parallel atrous convolution
convolution can obtain a larger receptive field with the same layers with different atrous rates, avoiding the disadvantages
parameters. However, using convolution kernels with the large caused by too large or too small receptive fields to a certain
atrous rates of 6, 12, 18, and 24 leads to too sparse sampling extent. In addition, the number of parameters in the proposed
input information, which will achieve better results for the MCIE module can reduce by about 65.6%, compared with
semantic information acquisition of large targets, but will lose the MCIE structure with four traditional convolution layers
local information for small targets. To deal with this problem, that are not factorized (see the proposed network structure in
a novel MCIE module is presented in this study, which is Fig. 7).
shown in Fig. 7. The proposed MCIE module is introduced as
follows.
C. MSFD Module
1) In the MCIE module, a traditional convolution layer is
1) Structure of the MSFD Module: The last part of the
added before the atrous convolution layer. The main
proposed model is the MSFD module, which extracts location
reason of using this structure is that the traditional
information and classification information based on the six
convolution layer combined with the atrous convolution
feature maps generated by the feature extraction module and
layer of a corresponding atrous rate can be used to
the MCIE module. Its structure is shown in Fig. 8.
simulate the human visual perception mode, to enhance
As shown in Fig. 8, this module is mainly composed of three
the robustness and discriminability of features [56].
branches. The first branch is the default box generation layer,
In this study, the size of the convolution kernel of
which generates default boxes for each position in the feature
the traditional convolution is, respectively, equal to the
map and calculates the normalized top-left and bottom-right
atrous rate of the next layer of atrous convolution, and
coordinates of each default box relative to the input image
the atrous convolutions with rate = 3, 5, 9, and 17 are
and the coordinate variance value. The second branch passes
used, where rate = 3 and rate = 5 are used to learn the
through a 3 × 3 convolution to generate a feature for the
features of small-scale objects, and rate = 9 and rate =
bounding box regression. The size of this feature is as follows:
17 are used to learn the features of large-scale objects.
In order to further reduce the network parameters and SizeOutput2 = [1, 4 × n, h, w] (4)
save the computing cost of the network, the larger
where n is the number of default boxes; h and w are the height
traditional convolutions added to the MCIE module are
and width of the feature map, respectively. The third branch of
factorized into smaller convolutions. Namely, two con-
the MSFD module also passes through a 3 × 3 convolution to
secutive layers of 3 × 3 convolution are used to replace
generate a feature for softmax to classify target and nontarget,
of a 5 × 5 convolution, four layers of 3 × 3 convo-
where the size of the feature is as follows:
lution are used to replace a 9 × 9 convolution, and
so on. In this way, it not only reduces the number of SizeOutput3 = [1, c × n, h, w] (5)

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

Fig. 7. Structure of the proposed MCIE module.

Fig. 9. Examples of the object detection. (a) Based on the traditional NMS
algorithm. (b) Based on the soft-NMS algorithm.
Fig. 8. Structure of the MSFD module.

where c is the number of target categories. In this article, Therefore, the traditional NMS algorithm is to perform
c = 28 (including 27 object categories and one background). a bisection operation according to the requirements of a
Thus, the detection layer mainly has three tasks: generating threshold. However, the distribution of indoor scene objects is
default boxes, giving the confidence score of the target cate- complex, and the object occlusion or stacking problems often
gory, and calculating the coordinate position information of the occur. Such a simple threshold segmentation of the traditional
bounding boxes. The boxes are then further filtered with the NMS algorithm will cause missed detection of the objects with
object confidence in the bounding boxes, and then processed occlusion problem. When two objects with very close distance
by an improved DT-NMS algorithm. Finally, the box with the appear in the image, the box with a lower score is likely
highest score is filtered to obtain the object detection result. to be directly suppressed, because the IoU ratio between the
2) Improved DT-NMS Algorithm: The final step of the box and the highest score is greater than the preset threshold.
MSFD module is to filter out the repeatedly predicted tar- An example of object detection based on the traditional NMS
get objects while considering the highly overlapping objects. algorithm is shown in Fig. 9(a).
In most of the object detection methods, the traditional NMS As shown in Fig. 9(a), when the distance of the cushions is
is often used to process the detection boxes, which is described very close, the NMS algorithm will only retain the detection
as follows: box with the highest score, resulting in the missing detection
( problem of the target object. To deal with the problem of
si , IoU(M, Bi ) < Nt missing object detection, some improvements to the NMS
sf = (6)
0, IoU(M, Bi ) ≥ Nt algorithm are proposed, such as soft-NMS algorithm, DIoU-
NMS, and so on [58], [59]. However, there are still some prob-
where si is the original score of the ith detection box; s f is the lems existing in these methods. For example, the soft-NMS
final score of the detection box; M is the detection box with algorithm still has the problem of missed detection, and its
the highest score; Bi is the ith box to be detected; oU(M, Bi ) rate of repeated detection and misdetection is relatively high
is the IoU ratio between M and box Bi ; and Nt is a preset [see Fig. 9(b)]. To balance the problems of missed detection,
threshold, which is used to retain the original scores of the repeated detection, and misdetection of objects in the indoor
detection boxes whose IoU does not exceed this threshold. complex scene, an improved DT-NMS algorithm is presented

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

in this study, which is described by the following:



si ,
 DIoU(M, Bi ) < Nt
s f = si × δ, Nt ≤ DIoU(M, Bi ) ≤ Ni (7)
DIoU(M, Bi ) > Ni

0,

δ = 1 − DIoU(M, Bi ) (8)
where Ni is a new preset threshold, which is to give the
undetermined detection boxes an upper limit value of DIoU;
DIoU(M, Bi ) is the distance-IoU, which can be formally
defined as follows [60]:
DIoU(M, Bi ) = IoU(M, Bi ) − RDIoU (m, bi ) (9)
where RDIoU (m, bi ) is the penalty term for m and bi , which
can be calculated by the following:
ρ 2 (m, bi )
RDIoU (m, bi ) = (10)
c2
where m and bi are the centers of M and B, respectively;
ρ(·) is the function of Euclidean distance; and c refers to the
diagonal length of the smallest enclosing box covering the
two boxes. The main reason for using DIoU in this article is
that both the overlap area and the center distance between two
boxes are considered in DIoU, which has been proven to be
a better suppression criterion [60], [61].
In this study, Ni is a DIoU upper limit for the pending
detection boxes. For the detection box with too high DIoU,
it can be regarded as repeated detection of the same object.
The confidence score is directly set to zero, and these detection
boxes will be suppressed. Therefore, the setting of the upper
limit threshold of DIoU can solve the repeat detection problem
to some extent. For detection boxes with a DIoU less than Fig. 10. Pseudocode of the improved DT-NMS algorithm.
Ni , the improved DT-NMS algorithm adds a penalized decay
function to the traditional NMS algorithm. It is specified that
the larger the DIoU with the highest detection box, the smaller is used to extract the general features of the target objects and
the confidence value; then, the boxes less than the confidence obtain the feature maps at six different scales.
threshold will be judged as redundant boxes, and those greater Step 3: The feature maps at six different scales obtained in
than the confidence threshold will be retained. Most of the the previous step are input into the MCIE module to extract
retained boxes are the correct detection of occluded objects, the contextual information with different scales, and six new
which can solve a part of the missed detection problem. feature maps containing multiscale contextual information are
Remark 4: Based on the proposed DT-NMS algorithm, the obtained.
condition of multiple thresholds restricts the number of dense Step 4: Then, through the MSFD module, the location
redundant detection boxes and also ensures that the number of information and classification information of the feature map
detection boxes of occlusion objects is preserved. Meanwhile, are extracted.
DIoU is introduced to consider the center distance between Step 5: Finally, the results of indoor object detection are
two boxes and the overlap area simultaneously, which is output by using the improved DT-NMS algorithm.
beneficial to object detection for cases with occlusion. Thus,
the problems of repeat, missing, and misdetection can be IV. E XPERIMENTS
solved effectively.
The pseudocode of the improved DT-NMS algorithm is shown A. Dataset and Metrics
in Fig. 10. There are lots of public datasets that are used for indoor
The workflow of the whole deep network-based indoor scene training and testing, such as the indoorCVPR_09 and
object detection method proposed in this study is introduced NYU depth dataset v2 (NYUv2) datasets [62], [63]. How-
as follows. ever, the indoorCVPR_09 mainly aims at 3-D reconstruction,
Step 1: The indoor scene images for object detection are and NYUv2 focuses on semantic segmentation of indoor
fed into the deep network proposed in this study. scenes. In this article, SUN2012 is used as the experimental
Step 2: The feature extraction module consisting of the dataset, which is an object detection dataset in the exten-
improved ResNet50 module, as well as the additional layers, sive Scene UNderstanding (SUN) database, containing 16 873

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

Fig. 11. Distribution of the 27 class objects in the experimental dataset.

images, covering many scenes (including indoor and outdoor where k refers to the number of the detected object cate-
scenes) [64]. After removing the pictures of outdoor scenes gories (k = 27 in this article); APi is the area under the
in SUN2012, 7108 pictures of indoor scenes are retained. precision–recall curve for the ith category (see [45] and [65]
To ensure the universality of the detected objects, the object for details).
categories in indoorCVPR_09, NYUv2, and SUN2012 are In this study, we also evaluate the detection speed of the
compared, and 27 common indoor object categories are used method by the detection time per image (defined as Dt ), which
for training and evaluation in this study. The distribution of is used as the auxiliary evaluation metric. Dt is defined as
the 27 class objects in the experimental dataset is shown in follows:
Fig. 11. T
One of the reasons for choosing these 27 categories of Dt = (12)
Num
indoor objects is that we hope to study as many indoor scenes where T refers to the total time that the network test validation
as possible, including more indoor objects. These 27 indoor set runs after the training is completed, and Num is the number
objects are common and typical representative objects in of images in the validation set (Num = 2133 in this study).
different indoor scenes and frequently appear in the bedroom,
living room, bathroom, dining room, kitchen, office, and other
indoor scenes. For example, “chair,” “table,” and “sofa,” often B. Experimental Settings and Implementation Details
appear in the living room, “lamp,” “bed,” and “pillow” are The experiments are conducted on a computer with the
the representative objects of the bedroom, and “wall,” “floor,” Windows 10 Operating System. The proposed deep network
and “ceiling” are the basic objects of indoor scenes. Another is implemented on the framework PyTorch1.7 with Python3.7.
reason is to compare the superiority of the proposed network. The parameters of the improved indoor object detection deep
So, some challenging objects are selected, which include many network and the computer configurations for the experiments
objects that are difficult to detect. For example, “bottle” and are listed in Table I. In the object detection experiments,
“faucet” tend to be small objects in pictures, “mirror” is considering the size of the six feature maps generated by the
easily affected by light reflection, objects, such as “lamp” feature extraction module, some atrous convolutions with some
and “plant,” have intraclass differences, and the same “lamp” larger atrous rates are unnecessary, because their receptive
category may have different colors and shapes. fields will far exceed the size of the feature map. So, the
With regard to evaluation metrics, mean average precision unnecessary atrous convolutions in the MCIE module will be
(mAP) is used as the main evaluation criterion, which is often removed, to preserve the effective contextual information and
used in multiobject detection tasks. mAP can be calculated by improve the quality of the network. During the experiment,
the following: the parameters of the MCIE module are listed in Table II.
The ratio of the training set and the validation set in the
Pk
i=1 APi
whole dataset is 7:3. The image size of the input for the deep
mAP = (11) network is 300 × 300. The initial learning rate of the proposed
k

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

TABLE I TABLE III


PARAMETERS OF THE I MPROVED I NDOOR O BJECT D ETECTION D EEP N ET- C OMPARISON E XPERIMENTAL R ESULTS BASED ON
WORK AND THE C OMPUTER C ONFIGURATIONS FOR THE E XPERIMENTS D IFFERENT D EEP N ETWORKS

TABLE II
PARAMETER S ETTING OF THE MCIE M ODULE IN THE P ROPOSED D EEP
N ETWORK

with the proposed method on the same dataset tested in


Section IV-B. These methods used for comparison include
the following: SSD (with 35 Conv and five MaxPool) [34],
[37], Faster R-CNN [with 16 Conv, four Pool, and two full
connection (FC)] [31], [32], RetinaNet (70 Conv and one
MaxPool) [38], CenterNet (62 Conv, one MaxPool, and three
Deconv) [39], and the latest YOLOv5 (with 11 Conv, eight
cross-stage-partial-connections bottleneck (BottleneckCSP),
one Focus, and one SPP) [35], [36]. The methods above are
widely studied and used in the field of object detection, which
have good performance. So, they are used for comparison
experiments. The experimental results are listed in Table III,
and some of the detection results are shown in Fig. 13.
The results in Table III show that our proposed method
Fig. 12. Loss function curve and the change curve of the learning rate of achieves the best results in the detection of most objects, with
the improved deep network after 50 epochs on the training set. a final mAP of 54.10%, which is 14.01% higher than the state-
of-the-art YOLOv5 (relative value). In addition, the detection
time for one image of the proposed method is relatively short,
model is set as 0.0005, and the learning rate is adjusted every
which is in the middle position compared with other networks.
five epochs, with a multiple of 0.3. The loss function curve and
The YOLOv5 network has the second best effect, but it takes
the change curve of the learning rate of the proposed model
about twice the time per image as long as our proposed
after 50 epochs on the training set are shown in Fig. 12. The
network.
results in Fig. 12 show that the loss of training also stabilized
From the results of Table III, we can also see that the
after the 20th epoch. In terms of training time, the proposed
proposed method in this article can well distinguish between
deep network takes about 4 min to train one epoch in this
“cushion” and “pillow,” which are similar between classes and
study.
are easily occluded (see Fig. 13). At the same time, our method
has achieved the best detection results for objects, such as
C. Comparison Experiments “mirror,” that are vulnerable to light (see Table III). For small
To demonstrate the effectiveness of the improved method, objects, such as “bottle” and “faucet,” our method can obtain
some comparative experiments are conducted. In this study, the second best results, which is near to the best results of the
some other deep network-based methods are compared YOLOv5.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

Fig. 13. Some detection results of different methods on the SUN2012 dataset.

The results in Fig. 13 show that the objects, such as [37], [66]. Its structure is a straight tube type, which has
occluded chairs, pillows, and cushions, can be detected using five sets of convolutional layers and three fully connected
our method. In addition, in the last three columns of the sixth layers, using the maximum pooling layer in each group to
row in Fig. 13, our method achieves better detection results for reduce the spatial dimension. Recently, the ResNet network
small objects, such as faucets, bottles, and distant microwaves. is presented to give consideration to detection accuracy and
The results in these comparison experiments show that the speed [67], [68], and the most commonly used networks in
improved method proposed in this article can get the best object detection are ResNet50 and ResNet101. In order to
comprehensive results for object detection in indoor scenes. verify the superiority of the improved backbone network for
the feature extraction, the backbone of the SSD framework
V. A BLATION S TUDY in the proposed deep network-based method is changed into
The comprehensive performance of the improved method VGG16, ResNet50, ResNet101, and the improved ResNet101,
proposed in this article has been tested and validated through respectively. Here, the improved ResNet101 means that the
the comparative experiments in Section IV. To evaluate the same improvements on ResNet50 in this article are also
performance of the key improvements in the proposed deep applied to ResNet101 (see Section III-A for details). The total
network-based method, some ablation experiments are con- network structure of these methods is the same as that of
ducted in this section. These key improvements include the the proposed method (see Fig. 2). The ablation experimental
feature extraction network, the MCIE module, and the DT- results are shown in Table IV.
NMS algorithm. The results in Table IV show that in complex indoor scenes,
the detection network using the VGG16 as the backbone in the
feature extraction network has a poor feature extraction ability
A. On the Feature Extraction Module for objects, and it cannot show a good object detection perfor-
The feature extraction module of the proposed network mance. The difference between ResNet101 and ResNet50 lies
based on the SSD framework consists of a backbone and five only in the number of residual units. ResNet50 has 16 residual
additional layers. As we know, the backbone of the traditional units, and ResNet101 has 33 residual units, which means that
SSD network is visual geometry group network-16 (VGG16) the ResNet101 network extracts high-dimensional features and

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

TABLE IV TABLE V
A BLATION E XPERIMENTAL R ESULTS ON THE F EATURE E XTRACTION A BLATION E XPERIMENTAL R ESULTS ON THE MCIE M ODULE
M ODULE U SING D IFFERENT BACKBONE N ETWORKS

TABLE VI
A BLATION E XPERIMENTAL R ESULTS ON THE DT-NMS A LGORITHM

is more likely to lose small area features and object location


information [69]. Therefore, the input of the following five
additional layers is more abstract when ResNet101 is used as
backbone network (see Fig. 2), which will lead to the result
that the feature information of some objects (especially small
objects) cannot be obtained directly, making the final detection The results in Table V show that when the MCIE module is
accuracy of these objects very low. Specifically, for this study, not used, the mAP of the model will be influenced seriously,
the mAPs of the method using ResNet101 as the backbone although the detection speed will faster a little than the models
for feature extraction for small objects (such as “bottle” and using the contextual information extraction module. Compared
“faucet”), medium objects (such as “chair” and “table”), and with the model using the general ASPP structure, the mAP and
large objects (such as “wall” and “floor”) are 0%, 5.48%, speed of the proposed model increase, which indicate that our
and 57.40%, respectively. Thus, the total mAP of the method proposed MCIE module has excellent performance in the task
using ResNet101 is very low in the object detection task of of object detection for indoor scenes.
this study. The experimental results in Table IV show that
the total mAP of the method using the improved ResNet101 C. On the DT-NMS Algorithm
as backbone network is still very low (16.16%). However,
from the results, we can see that the mAP of the improved Another important improvement of the proposed method
ResNet101 increases 71.55% (relative value) compared with in this study is the improved DT-NMS algorithm, which is
the general ResNet101, which verifies the effectiveness of our used to filter out repeatedly predicted objects while consid-
improvements on the backbone network. ering highly overlapping targets. To discuss the effects of
The results of Table IV show that the effect of object this improved DT-NMS algorithm, an ablation experiment is
detection using ResNet50 in the feature extraction module conducted, where the improved DT-NMS algorithm in the
is better than that of VGG16 and ResNet101. In addition, proposed method is replaced by the NMS and soft-NMS
the object detection method using ResNet50 has the fastest algorithm, respectively, used as the postprocessing method.
detection speed. This is the main reason why the proposed The method with the NMS and soft-NMS algorithms is defined
method is based on the ResNet50 and improves it for feature as M-NMS and M-soft-NMS, respectively. The ablation exper-
extraction. The improved ResNet50 is increased by 25.23% imental results are listed in Table VI.
(relative value) in the mAP as compared with the general From the results in Table VI, it can be seen that the
ResNet50, although it increases the time to detect each image mAP of the method with the NMS algorithm is low, but
by 0.021 s. the detection time is the shortest. On the contrary, the mAP
of the method with the soft-NMS algorithm is high, but the
B. On the MCIE Module speed is slow. This is due to the simple bisection operation of
the NMS algorithm, which will easily cause missed detection
The design of the MCIE module takes into account the of the target objects, so the mAP of M-NMS is low. However,
matching relationship between the receptive field size of using the soft-NMS algorithm, the number of detection boxes
different atrous convolutions and the size of the feature map rises sharply, which is also easy to cause repeated detection
to be extracted. The contextual information is extracted by of the objects, so the detection speed is slow. The results
parallel atrous convolutions with different branch numbers, show that the method with the improved DT-NMS algorithm
and the redundant information is suppressed to the greatest can get the best object detection results, and the detection
extent. In this section, the superiority of the MCIE module time is 39.20% (relative value) faster than that of M-soft-
proposed in this article is verified by ablation experiment. NMS. It shows that the improved DT-NMS algorithm can
In this ablation experiment, all the models have the same well balance the problems of repeated detection and missed
structure with the proposed model, except the contextual detection for objects.
information extraction module. In this ablation study, the
model without the MCIE module (called No-MCIE) and the
model with the general ASPP structure for the contextual VI. D ISCUSSIONS
information extraction (called G-ASPP, see [51] for details) are In this section, some important parameter settings of the
compared with the proposed model. The results of the ablation proposed model are discussed, including the settings of the
experiment on the MCIE module are shown in Table V. MCIE module and the thresholds in the improved DT-NMS.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

TABLE VII
C OMPARISON E XPERIMENTAL R ESULTS U SING D IFFERENT PARAMETER
S ETTINGS IN THE MCIE M ODULE

Fig. 14. Experimental results of the joint parameter adjustment method for
Ni and Nt .
A. About the Settings of the MCIE Module
To discuss the setting of the MCIE module in the proposed
method, five comparative experiments are conducted. The method. The experimental results are shown in Fig. 14. Here,
purpose of these experiments is to analyze the performance of only the metric mAP is used as the judgment condition. The
the MCIE module we proposed, while verifying the validity of results in Fig. 14 show that the better detection results can be
the atrous rate of atrous convolutions used in the MCIE module obtained when Nt = 0.4 than other values of Nt . Based on
by adding or decreasing the atrous convolution branches. The this experiment, we set Nt = 0.4 and Ni = 0.95. At this time,
comparison results of different settings of the MCIE module the max mAP is reached.
are listed in Table VII.
It can be seen from the experimental results in Table VII
VII. C ONCLUSION
that when using setting1, a large amount of useless redundant
information is extracted by using the same parallel atrous In this article, the object detection problem in indoor scenes
convolution branch for all the six feature maps, which does not based on deep learning networks is studied, and an improved
improve the detection effect, and the time to detect each image SSD-like deep network-based method is proposed. In the pro-
is also the longest. When the atrous convolution branches in posed method, more enriched features are extracted using an
the MCIE module decrease, the detection speed is accelerated improved ResNet50 residual network; then, the MCIE module
continuously, and the mAP will change. With setting3 of the is adopted to extract multiscale contextual features, and six
atrous rate for atrous convolutions in the MCIE module, the different scale feature maps containing multiscale contextual
deep network has the best effect on indoor object detection. information are used to detect the objects in the indoor
However, when the atrous convolution branches in the MCIE scene. In addition, to improve the accuracy and efficiency
module are further reduced, the contextual information is not of deep learning-based methods in indoor object detection,
adequately extracted, and the mAP of the detection is sharply an improved DT-NMS algorithm is used as the postprocessing
reduced. Therefore, to comprehensively consider the effect and method. Finally, various comparative experiments are per-
speed of the indoor object detection, the appropriate atrous rate formed, and the experimental results show that the proposed
of atrous convolutions in the MCIE module is set as setting3 indoor object detection method outperforms than the state-of-
in this study (see Table II). the-art deep learning-based methods in object detection tasks
in indoor scenes.
In future work, further research on small objects in indoor
B. About the Settings of the Thresholds in the Improved
scenes should be studied, and the object detection problem
DT-NMS
under some difficult situations should be specifically studied,
There are two thresholds used in the improved DT-NMS, such as the low light condition and reflective condition. On the
namely, Ni and Nt [see (7)]. In order to choose a suitable value other hand, the real-time problem of indoor object detection
for each threshold, the joint parameter adjustment method should also be further considered. In addition, the public
is used to determine the final thresholds in this study, and datasets for 2-D object detection in indoor scenes should also
some experiments are conducted under the same condition be enriched to get more effective detection results.
introduced in Section IV-B. In these experiments, the structure
of the proposed model keeps unchanged, except choosing
different threshold values. In this study, the value of Nt R EFERENCES
is increased by 0.05 from 0.3 to 0.5 according the related [1] G. Wang, Y. Hu, X. Wu, and H. Wang, “Residual 3-D scene flow
reference [58]. Because the threshold Ni in the improved learning with context-aware feature extraction,” IEEE Trans. Instrum.
DT-NMS algorithm is used to directly inhibit the repeated Meas., vol. 71, pp. 1–9, 2022.
[2] S. Liu, G. Tian, Y. Zhang, M. Zhang, and S. Liu, “Service planning
detection boxes with DIoU greater than Ni , its value is oriented efficient object search: A knowledge-based framework for home
set between 0.8 and 0.99 in the joint parameter adjustment service robot,” Expert Syst. Appl., vol. 187, Jan. 2022, Art. no. 115853.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023

[3] K. Wang, X. Li, J. Yang, J. Wu, and R. Li, “Temporal action detection [25] B. Yang, J. Li, and H. Zhang, “Resilient indoor localization system based
based on two-stream you only look once network for elderly care on UWB and visual–inertial sensors for complex environments,” IEEE
service robot,” Int. J. Adv. Robotic Syst., vol. 18, no. 4, Jul. 2021, Trans. Instrum. Meas., vol. 70, 2021, Art. no. 8504014.
Art. no. 172988142110383. [26] Q. Fu, H. Fu, Z. Deng, and X. Li, “Indoor layout programming
[4] P. Nazemzadeh, F. Moro, D. Fontanelli, D. Macii, and L. Palopoli, via virtual navigation detectors,” Sci. China Inf. Sci., vol. 65, no. 8,
“Indoor positioning of a robotic walking assistant for large public envi- Aug. 2022.
ronments,” IEEE Trans. Instrum. Meas., vol. 64, no. 11, pp. 2965–2976, [27] J. Li, W. Gao, Y. Wu, Y. Liu, and Y. Shen, “High-quality indoor scene
Nov. 2015. 3D reconstruction with RGB-D cameras: A brief review,” Comput. Vis.
[5] R. Ma, Z. Zhang, and E. Chen, “Human motion gesture recogni- Media, vol. 8, no. 3, pp. 369–393, Sep. 2022.
tion based on computer vision,” Complexity, vol. 2021, Feb. 2021, [28] Y. Sun, Y. Miao, J. Chen, and R. Pajarola, “PGCNet: Patch graph
Art. no. 6679746. convolutional network for point cloud segmentation of indoor scenes,”
[6] S. Cui, R. Wang, J. Hu, C. Zhang, L. Chen, and S. Wang, “Self- Vis. Comput., vol. 36, nos. 10–12, pp. 2407–2418, Oct. 2020.
supervised contact geometry learning by GelStereo visuotactile sensing,” [29] W. Zhang, Q. Zhang, W. Zhang, J. Gu, and Y. Li, “From edge to
IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9, 2022. keypoint: An end-to-end framework for indoor layout estimation,” IEEE
[7] Z. Huang, C. Lv, Y. Xing, and J. Wu, “Multi-modal sensor fusion-based Trans. Multimedia, vol. 23, pp. 4483–4490, 2021.
deep neural network for end-to-end autonomous driving with scene [30] L. Huan, X. Zheng, and J. Gong, “GeoRec: Geometry-enhanced seman-
understanding,” IEEE Sensors J., vol. 21, no. 10, pp. 11781–11790, tic 3D reconstruction of RGB-D indoor scenes,” ISPRS J. Photogramm.
May 2021. Remote Sens., vol. 186, pp. 301–314, Apr. 2022.
[8] Y. Cheng, Y. Yang, H.-B. Chen, N. Wong, and H. Yu, “S3-Net: [31] J. Ni, K. Shen, Y. Chen, W. Cao, and S. X. Yang, “An improved deep
A fast scene understanding network by single-shot segmentation for network-based scene classification method for self-driving cars,” IEEE
autonomous driving,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 5, Trans. Instrum. Meas., vol. 71, pp. 1–14, 2022.
pp. 1–19, Oct. 2021. [32] C. Dai et al., “Video scene segmentation using tensor-train faster-RCNN
[9] J. Fan, P. Zheng, and S. Li, “Vision-based holistic scene understanding for multimedia IoT systems,” IEEE Internet Things J., vol. 8, no. 12,
towards proactive human–robot collaboration,” Robot. Comput.-Integr. pp. 9697–9705, Jun. 2021.
Manuf., vol. 75, Jun. 2022, Art. no. 102304. [33] M. Sogabe, N. Ito, T. Miyazaki, T. Kawase, T. Kanno, and
[10] X. Zhao, T. Zuo, and X. Hu, “OFM-SLAM: A visual semantic SLAM K. Kawashima, “Detection of instruments inserted into eye in cataract
for dynamic indoor environments,” Math. Problems Eng., vol. 2021, surgery using single-shot multibox detector,” Sensors Mater., vol. 34,
Apr. 2021, Art. no. 5538840. no. 1, pp. 47–54, 2022.
[11] M. Y. Moemen, H. Elghamrawy, S. N. Givigi, and A. Noureldin, [34] X. Lu, J. Ji, Z. Xing, and Q. Miao, “Attention and feature fusion SSD for
“3-D reconstruction and measurement system based on multimobile remote sensing object detection,” IEEE Trans. Instrum. Meas., vol. 70,
robot machine vision,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–9, pp. 1–9, 2021.
2021. [35] Z. Li et al., “Toward efficient safety helmet detection based on YOLOV5
[12] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” with hierarchical positive sample selection and box density filtering,”
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. IEEE Trans. Instrum. Meas., vol. 71, 2022, Art. no. 2508314.
[13] D. D’Souza and R. V. Yampolskiy, “Baseline avatar face detection using [36] D. Zheng et al., “A defect detection method for rail surface and fasteners
an extended set of haar-like features,” in Proc. 23rd Midwest Artif. Intell. based on deep convolutional neural network,” Comput. Intell. Neurosci.,
Cogn. Sci. Conf., Cincinnati, OH, USA, vol. 841, Apr. 2012, pp. 72–77. vol. 2021, Aug. 2021, Art. no. 2565500.
[14] R. Matsumura and A. Hanazawa, “Human detection using color contrast- [37] J. Ni, Y. Chen, Y. Chen, J. Zhu, D. Ali, and W. Cao, “A survey on
based histograms of oriented gradients,” Int. J. Innov. Comput., Inf. theories and applications for self-driving cars based on deep learning
Control, vol. 15, no. 4, pp. 1211–1222, 2019. methods,” Appl. Sci., vol. 10, no. 8, p. 2749, Apr. 2020.
[15] L. Zheng, T. Zhou, R. Jiang, and Y. Peng, “Survey of video object [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
detection algorithms based on deep learning,” in Proc. 4th Int. Conf. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Algorithms, Comput. Artif. Intell., Sanya, China, Dec. 2021, pp. 1–6. Venice, Italy, Oct. 2017, pp. 2999–3007.
[39] Y. Shi, Y. Guo, Z. Mi, and X. Li, “Stereo CenterNet-based 3D
[16] J. Akilandeswari, G. Jothi, A. Naveenkumar, R. S. Sabeenian, P. Iyyanar,
object detection for autonomous driving,” Neurocomputing, vol. 471,
and M. E. Paramasivam, “Design and development of an indoor navi-
pp. 219–229, Jan. 2022.
gation system using denoising autoencoder based convolutional neural
network for visually impaired people,” Multimedia Tools Appl., vol. 81, [40] M. Afif, R. Ayachi, E. Pissaloux, Y. Said, and M. Atri, “Indoor
no. 3, pp. 3483–3514, Jan. 2022. objects detection and recognition for an ICT mobility assistance of
visually impaired people,” Multimedia Tools Appl., vol. 79, nos. 41–42,
[17] B. Anbarasu and G. Anitha, “Indoor scene recognition for micro aerial
pp. 31645–31662, Nov. 2020.
vehicles navigation using enhanced-GIST descriptors,” Defence Sci. J.,
vol. 68, no. 2, pp. 129–137, 2018. [41] M. Chen, X. Ren, and Z. Yan, “Real-time indoor object detection based
on deep learning and gradient harmonizing mechanism,” in Proc. IEEE
[18] F. Vittori, I. Pigliautile, and A. L. Pisello, “Subjective thermal response 9th Data Driven Control Learn. Syst. Conf. (DDCLS), Liuzhou, China,
driving indoor comfort perception: A novel experimental analysis cou- Nov. 2020, pp. 772–777.
pling building information modelling and virtual reality,” J. Building
Eng., vol. 41, Sep. 2021, Art. no. 102368. [42] X. Ding, B. Li, and J. Wang, “Geometric property-based convolutional
neural network for indoor object detection,” Int. J. Adv. Robotic Syst.,
[19] Y. Li, Z. Zhang, Y. Cheng, L. Wang, and T. Tan, “MAPNet: Multi- vol. 18, no. 1, Jan. 2021, Art. no. 172988142199332.
modal attentive pooling network for RGB-D indoor scene classification,” [43] E. U. Samani, X. Yang, and A. G. Banerjee, “Visual object recognition
Pattern Recognit., vol. 90, pp. 436–449, Jun. 2019. in indoor environments using topologically persistent features,” IEEE
[20] T. Ran, L. Yuan, and J. B. Zhang, “Scene perception based visual Robot. Autom. Lett., vol. 6, no. 4, pp. 7509–7516, Oct. 2021.
navigation of mobile robot in indoor environment,” ISA Trans., vol. 109, [44] F. Alamri and N. Pugeault, “Contextual relabelling of detected objects,”
pp. 389–400, Mar. 2021. in Proc. Joint IEEE 9th Int. Conf. Develop. Learn. Epigenetic Robot.
[21] J. Jiang, P. Liu, Z. Ye, W. Zhao, and X. Tang, “A hierarchical inferential (ICDL-EpiRob), Oslo, Norway, Aug. 2019, pp. 313–319.
method for indoor scene classification,” Int. J. Appl. Math. Comput. Sci., [45] W. Ma, Q. Guo, Y. Wu, W. Zhao, X. Zhang, and L. Jiao, “A novel multi-
vol. 27, no. 4, pp. 839–852, Dec. 2017. model decision fusion network for object detection in remote sensing
[22] P. Ji, D. Qin, P. Feng, T. Lan, G. Sun, and M. J. Khan, “Research images,” Remote Sens., vol. 11, p. 737, Apr. 2019.
on indoor scene classification mechanism based on multiple descriptors [46] W. Zhang, C. Fu, H. Xie, M. Zhu, M. Tie, and J. Chen, “Global context
fusion,” Mobile Inf. Syst., vol. 2020, Mar. 2020, Art. no. 4835198. aware RCNN for object detection,” Neural Comput. Appl., vol. 33,
[23] A. Mosella-Montoro and J. Ruiz-Hidalgo, “2D–3D geometric fusion net- no. 18, pp. 11627–11639, Sep. 2021.
work using multi-neighbourhood graph convolution for RGB-D indoor [47] F. Wang and J. T. C. Tan, “The comparison of light compensation method
scene classification,” Inf. Fusion, vol. 76, pp. 46–54, Dec. 2021. and CycleGAN method for deep learning based object detection of
[24] A. Glavan and E. Talavera, “InstaIndoor and multi-modal deep learning mobile robot vision under inconsistent illumination conditions in virtual
for indoor scene recognition,” Neural Comput. Appl., vol. 34, no. 9, environment,” in Proc. IEEE 9th Annu. Int. Conf. Cyber Technol. Autom.,
pp. 6861–6877, May 2022. Control, Intell. Syst. (CYBER), Suzhou, China, Jul. 2019, pp. 271–276.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915

[48] L. Huang et al., “Multi-scale feature fusion convolutional neural net- Jianjun Ni (Senior Member, IEEE) received the
work for indoor small target detection,” Frontiers Neurorobot., vol. 16, Ph.D. degree from the School of Information and
May 2022, Art. no. 881021. Electrical Engineering, China University of Mining
[49] D. Zhang et al., “Indoor target detection and pose estimation based and Technology, Xuzhou, China, in 2005.
on point cloud,” in Proc. 3rd Int. Conf. Comput., Commun. Mecha- He was a Visiting Professor with the Advanced
tronics Eng., Dec. 2021, vol. 2218, no. 1, doi: 10.1088/1742- Robotics and Intelligent Systems (ARIS) Labora-
6596/2218/1/012037. tory, University of Guelph, Guelph, ON, Canada,
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for from 2009 to 2010. He is currently a Professor with
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. the College of Internet of Things Engineering, Hohai
(CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778. University, Changzhou, China. He has authored or
[51] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, coauthored over 100 papers in related international
“DeepLab: Semantic image segmentation with deep convolutional nets, conferences and journals. His research interests include control systems,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern neural networks, robotics, machine intelligence, and multiagent system.
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2017. Dr. Ni serves as an associate editor and a reviewer for a number of
[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep international journals.
network training by reducing internal covariate shift,” in Proc. 32nd Int.
Conf. Mach. Learn. (ICML), Lile, France, vol. 1, Jul. 2015, pp. 448–456.
[53] G. Chen et al., “A survey of the four pillars for small object detection:
Multiscale representation, contextual information, super-resolution, and
region proposal,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 52, no. 2,
pp. 936–953, Feb. 2022.
Kang Shen received the B.S. degree from Hohai
[54] S. Wang, J. Cheng, H. Liu, F. Wang, and H. Zhou, “Pedestrian detection
University, Changzhou, China, in 2020, where she is
via body part semantic and contextual information with DNN,” IEEE
currently pursuing the M.S. degree with the Depart-
Trans. Multimedia, vol. 20, no. 11, pp. 3148–3159, Nov. 2018.
ment of Detection Technology and Automatic Equip-
[55] W.-S. Zheng, S. Gong, and T. Xiang, “Quantifying and transferring
ment, College of Internet of Things Engineering.
contextual information in object detection,” IEEE Trans. Pattern Anal.
Her research interests include self-driving, robot
Mach. Intell., vol. 34, no. 4, pp. 762–777, Apr. 2012.
control, and machine learning.
[56] S. Liu, D. Huang, and Y. Wang, “Receptive field block net for accurate
and fast object detection,” in Proc. 15th Eur. Conf. Comput. Vis. (Lecture
Notes in Computer Science), vol. 11215. Munich, Germany: Springer,
Sep. 2018, pp. 404–419.
[57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016,
pp. 2818–2826.
[58] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS–
improving object detection with one line of code,” in Proc. 16th IEEE Yan Chen received the B.S. degree from Hohai
Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 5562–5570. University, Changzhou, China, in 2017, where he is
[59] M. Gong, D. Wang, X. Zhao, H. Guo, D. Luo, and M. Song, currently pursuing the Ph.D. degree in the Internet
“A review of non-maximum suppression algorithms for deep learning of Things (IoT) technology and application with the
target detection,” Proc. SPIE, vol. 11763, Mar. 2021, Art. no. 1176332. College of Internet of Things Engineering.
[60] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU His research interests include simultaneous local-
loss: Faster and better learning for bounding box regression,” in Proc. ization and mapping, robot control, and machine
34th AAAI Conf. Artif. Intell. (AAAI), New York, NY, USA, Feb. 2020, learning.
pp. 12993–13000.
[61] D. Yuan, X. Shu, N. Fan, X. Chang, Q. Liu, and Z. He, “Accurate
bounding-box regression with distance-IoU loss for visual tracking,”
J. Vis. Commun. Image Represent., vol. 83, Feb. 2022, Art. no. 103428.
[62] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, Jun. 2009,
pp. 413–420.
[63] Z. Lu and Y. Chen, “Pyramid frequency network with spatial attention
residual refinement module for monocular depth estimation,” J. Electron. Simon X. Yang (Senior Member, IEEE) received
Imag., vol. 31, no. 2, Mar. 2022, Art. no. 023005. the B.Sc. degree in engineering physics from Beijing
[64] L. Benrais and N. Baha, “High level visual scene classification using University, Beijing, China, in 1987, the first M.Sc.
background knowledge of objects,” Multimedia Tools Appl., vol. 81, degree in biophysics from the Chinese Academy
no. 3, pp. 3663–3692, Jan. 2022. of Sciences, Beijing, in 1990, the second M.Sc.
[65] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and degree in electrical engineering from the University
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. of Houston, Houston, TX, USA, in 1996, and the
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2009. Ph.D. degree in electrical and computer engineering
[66] J. Yang, W. Y. He, T. L. Zhang, C. L. Zhang, L. Zeng, and from the University of Alberta, Edmonton, AB,
B. F. Nan, “Research on subway pedestrian detection algorithms based Canada, in 1999.
on SSD model,” IET Intell. Transp. Syst., vol. 14, no. 11, pp. 1491–1496, He is currently a Professor and the Head of the
Nov. 2020. Advanced Robotics and Intelligent Systems Laboratory, University of Guelph,
[67] Y. Yang, H. Wang, D. Jiang, and Z. Hu, “Surface detection of solid Guelph, ON, Canada. His research interests include robotics, intelligent
wood defects based on SSD improved with ResNet,” Forests, vol. 12, systems, sensors and multisensor fusion, wireless sensor networks, control
no. 10, p. 1419, Oct. 2021. systems, transportation, and computational neuroscience.
[68] X. Gao, J. Xu, C. Luo, J. Zhou, P. Huang, and J. Deng, “Detection of Dr. Yang has been very active in professional activities. He was the
lower body for AGV based on SSD algorithm with ResNet,” Sensors, General Chair of the 2011 IEEE International Conference on Logistics
vol. 22, no. 5, p. 2008, Mar. 2022. and Automation and the Program Chair of the 2015 IEEE International
[69] Y. Gong, X. Yu, Y. Ding, X. Peng, J. Zhao, and Z. Han, “Effective Conference on Information and Automation. He serves as an Editor-in-Chief
fusion factor in FPN for tiny object detection,” in Proc. IEEE Winter for the International Journal of Robotics and Automation and an Associate
Conf. Appl. Comput. Vis. (WACV), Waikoloa, HI, USA, Jan. 2021, Editor for the IEEE T RANSACTIONS ON C YBERNETICS and several other
pp. 1159–1167. journals.

Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like