An_Improved_SSD-like_Deep_Network-Based_Object_Detection_Method_for_Indoor_Scenes
An_Improved_SSD-like_Deep_Network-Based_Object_Detection_Method_for_Indoor_Scenes
net/publication/368516410
CITATIONS READS
18 43
4 authors, including:
Simon X. Yang
University of Guelph
690 PUBLICATIONS 12,401 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jianjun Ni on 24 February 2023.
Abstract— The indoor scene object detection technology is of by perceiving the surrounding environment, and to realize
important research significance, which is one of the popular the interaction among robot, human, and environment. Thus,
research topics in the field of scene understanding for indoor how to make robots interact with the environment and people
robots. In recent years, the solutions based on deep learning
have achieved good results in object detection. However, there through the understanding of images like people become a
are still some problems to be further studied in indoor object hot research topic. This is also the main research content of
detection methods, such as lighting problem and occlusion computer vision [5], [6].
problem caused by the complexity of the indoor environment. The aim of computer vision for robots is to enable robots
Aiming at these problems, an improved object detection method to observe and understand the world through vision, and have
based on deep neural networks is proposed in this article,
which uses a framework similar to the single-shot multibox the ability to adapt to the environment independently. This
detector (SSD). In the proposed method, an improved ResNet50 is one of the key research contents of scene understanding
network is used to enhance the transmission of information, for indoor robots [7]. Scene understanding is the premise
and the feature expression capability of the feature extraction of autonomous decision-making and behavior optimization of
network is improved. At the same time, a multiscale contextual intelligent robots in a complex indoor environment and is also
information extraction (MCIE) module is used to extract the
contextual information of the indoor scene, so as to improve the key to determining the function and performance of indoor
the indoor object detection effect. In addition, an improved robots [8], [9].
dual-threshold non-maximum suppression (DT-NMS) algorithm The task of scene understanding is closely related to the
is used to alleviate the occlusion problem in indoor scenes. Finally, information of important objects in the scene, so the research
the public dataset SUN2012 is further screened for the special on object detection has very important practical significance.
application of indoor scene object detection, and the proposed
method is tested on this dataset. The experimental results show Indoor scenes are closely related to people’s lives and are one
that the mean average precision (mAP) of the proposed method of the hot research topics in the field of computer vision. The
can reach 54.10%, which is higher than those of the state-of-the- key of indoor scene object detection technology is to determine
art methods. the location, size, and category of objects, which has important
Index Terms— Deep network, indoor scene, object detection, research significance in the indoor robot field, such as accurate
ResNet50 network, single-shot multibox detector (SSD) algo- positioning and navigation of indoor mobile robots [10], [11].
rithm. Before the emergence of deep learning technology, tradi-
tional object detection methods are mainly based on hand-
I. I NTRODUCTION
designed features, such as scale-invariant feature transform
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
of this article is to improve the object detection accuracy for methods above have achieved some good results. However,
indoor scenes. there are still some limitations existed in these methods.
For example, the method in [41] only considered the indoor
environment detection sample category of imbalance, and its
B. Indoor Object Detection Methods Based on Deep
own indoor scene was too simple, which cannot reflect the
Learning
indoor scene inherent attributes (i.e., there are a lot of semantic
Indoor scene is the key research topic in the field of object categories and the occlusion problem among the objects, the
detection, which has great practical value for the applications visual features are lack of strong identification ability, and the
of the indoor robot. At the same time, with the development indoor light is not controllable).
of deep learning, deep learning-based algorithms have become There are a lot of research on the contextual information
the main research direction of object detection. We will give around the objects in recent years. For example, Alamri and
out a simple review of the deep learning-based object detection Pugeault [44] put forward a contextual model of rescoring
methods for indoor scenes. and remarking, as a postprocessing step, to improve the
The object detection methods based on deep learning tech- detection performance and help correct the false detection
nology can be roughly divided into two categories according based on the contextual information. Ma et al. [45] constructed
to the characteristics of their process: a single-stage object a fusion subnetwork for the local contextual information and
detection method and a two-stage object detection method. The object–object relationship contextual information, to solve
main representative of the two-stage object detection methods the diversity and complexity of geospatial object appearance.
is the regions with convolutional neural network (R-CNN) Zhang et al. [46] proposed a global context-aware attention
features series, including Fast R-CNN and Faster R-CNN [31], network, which utilized both the global feature pyramid and
[32]. Such methods are based on region proposal network attention strategies. These studies above have proven that the
(RPN) and the theory of neural network classification. There contextual information can eliminate the influence of external
are two stages in the detection process: first, the RPN is used factors and improve the detection accuracy. However, in most
to generate the preselection boxes, and then, the classification studies on contextual information, only the contextual infor-
and regression of the preselection boxes are realized through mation of a certain layer or the last layer of the deep network
the detection network. This kind of detection method has high is extracted, so that the extracted features are scarce, which
detection accuracy, but the detection speed is slow. Single- affects the detection accuracy of small objects or occluded
stage object detection methods are represented by SSD [33], objects. To solve this kind of problem, this article proposed
[34] and you only look once (YOLO) series [35], [36]. an improved MCIE module, to extract multiple layers of
SSD and YOLO series models have been wildly studied and contextual information with different scales, so as to obtain
used [37], which will not be introduced in this article. Here are richer features.
other two classic and common single-stage methods, namely, There are also many studies that focus on the specific
RetinaNet [38] and CenterNet [39], which will be introduced problems of indoor object detection. For example, Wang
briefly. RetinaNet consists of a backbone network and two and Tan [47] studied and validated low confidence in object
prediction branch networks, where a ResNet architecture is detection of robotic visuals in dark environments, and a cycle
used for backbone network, and a pyramid structure is used generative adversarial networks (CycleGANs)-based method
for feature fusion. CenterNet is an anchor-free object detection of brightness migration was presented, which can improve
network, which mainly includes ResNet50 to extract image the detection performance in the dark environment. Huang
features and the deconvolution (Deconv) module to upsample et al. [48] used the multiscale feature fusion method to improve
the feature map. Single-stage object detection methods remove the detection effect of various types of objects, especially for
the RPN stage and can directly obtain the detection results, objects with smaller scales. Zhang et al. [49] presented a set of
so the detection speed is fast, which is suitable for the object detection algorithms based on the point cloud template
requirements of the indoor robots. In this article, we will focus to deal with the problems in the complex indoor environ-
on the single-stage object detection method for indoor scenes ment. These methods above obtained some better results in
and present an SSD-like deep network-based method. a specific problem. However, the comprehensive performance
Recently, there has been lots of improvements of the of these methods should be further improved. For example,
deep learning-based methods for indoor object detection. For the multiscale fusion method in [48] will slow down the
example, Afif et al. [40] proposed a deep convolutional detection speed, and the dataset images used did not consider
neural network for a fully labeled indoor object dataset. the occlusion problem of indoor scenes. The categories of
Chen et al. [41] added the “gradient density” parameter to objects detected in [49] were far fewer than those of indoor
the loss function of YOLOv3 to improve the imbalance of scene objects, and the occlusion rate set in this article was
sample categories and the accuracy of indoor object detection. also much lower than the occlusion situation in real indoor
Ding et al. [42] focused on the region-based CNN detector scenes.
and proposed a geometry-based Faster R-CNN for indoor As introduced above, there are lots of excellent research
object detection. Similar to the geometric property-based results of deep learning-based algorithms for indoor object
approach, Samani et al. [43] proposed to use topological per- detection. However, there are still many problems that need
sistence features that rely on target object shape information to be further studied, which is the main purpose of this
to improve the robustness of indoor object detection. These article.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Fig. 2. Whole structure of the proposed indoor object detection model based on deep network.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
Fig. 4. (a) Convolution block and (b) identity block of the improved ResNet50 network.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
Fig. 9. Examples of the object detection. (a) Based on the traditional NMS
algorithm. (b) Based on the soft-NMS algorithm.
Fig. 8. Structure of the MSFD module.
where c is the number of target categories. In this article, Therefore, the traditional NMS algorithm is to perform
c = 28 (including 27 object categories and one background). a bisection operation according to the requirements of a
Thus, the detection layer mainly has three tasks: generating threshold. However, the distribution of indoor scene objects is
default boxes, giving the confidence score of the target cate- complex, and the object occlusion or stacking problems often
gory, and calculating the coordinate position information of the occur. Such a simple threshold segmentation of the traditional
bounding boxes. The boxes are then further filtered with the NMS algorithm will cause missed detection of the objects with
object confidence in the bounding boxes, and then processed occlusion problem. When two objects with very close distance
by an improved DT-NMS algorithm. Finally, the box with the appear in the image, the box with a lower score is likely
highest score is filtered to obtain the object detection result. to be directly suppressed, because the IoU ratio between the
2) Improved DT-NMS Algorithm: The final step of the box and the highest score is greater than the preset threshold.
MSFD module is to filter out the repeatedly predicted tar- An example of object detection based on the traditional NMS
get objects while considering the highly overlapping objects. algorithm is shown in Fig. 9(a).
In most of the object detection methods, the traditional NMS As shown in Fig. 9(a), when the distance of the cushions is
is often used to process the detection boxes, which is described very close, the NMS algorithm will only retain the detection
as follows: box with the highest score, resulting in the missing detection
( problem of the target object. To deal with the problem of
si , IoU(M, Bi ) < Nt missing object detection, some improvements to the NMS
sf = (6)
0, IoU(M, Bi ) ≥ Nt algorithm are proposed, such as soft-NMS algorithm, DIoU-
NMS, and so on [58], [59]. However, there are still some prob-
where si is the original score of the ith detection box; s f is the lems existing in these methods. For example, the soft-NMS
final score of the detection box; M is the detection box with algorithm still has the problem of missed detection, and its
the highest score; Bi is the ith box to be detected; oU(M, Bi ) rate of repeated detection and misdetection is relatively high
is the IoU ratio between M and box Bi ; and Nt is a preset [see Fig. 9(b)]. To balance the problems of missed detection,
threshold, which is used to retain the original scores of the repeated detection, and misdetection of objects in the indoor
detection boxes whose IoU does not exceed this threshold. complex scene, an improved DT-NMS algorithm is presented
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
δ = 1 − DIoU(M, Bi ) (8)
where Ni is a new preset threshold, which is to give the
undetermined detection boxes an upper limit value of DIoU;
DIoU(M, Bi ) is the distance-IoU, which can be formally
defined as follows [60]:
DIoU(M, Bi ) = IoU(M, Bi ) − RDIoU (m, bi ) (9)
where RDIoU (m, bi ) is the penalty term for m and bi , which
can be calculated by the following:
ρ 2 (m, bi )
RDIoU (m, bi ) = (10)
c2
where m and bi are the centers of M and B, respectively;
ρ(·) is the function of Euclidean distance; and c refers to the
diagonal length of the smallest enclosing box covering the
two boxes. The main reason for using DIoU in this article is
that both the overlap area and the center distance between two
boxes are considered in DIoU, which has been proven to be
a better suppression criterion [60], [61].
In this study, Ni is a DIoU upper limit for the pending
detection boxes. For the detection box with too high DIoU,
it can be regarded as repeated detection of the same object.
The confidence score is directly set to zero, and these detection
boxes will be suppressed. Therefore, the setting of the upper
limit threshold of DIoU can solve the repeat detection problem
to some extent. For detection boxes with a DIoU less than Fig. 10. Pseudocode of the improved DT-NMS algorithm.
Ni , the improved DT-NMS algorithm adds a penalized decay
function to the traditional NMS algorithm. It is specified that
the larger the DIoU with the highest detection box, the smaller is used to extract the general features of the target objects and
the confidence value; then, the boxes less than the confidence obtain the feature maps at six different scales.
threshold will be judged as redundant boxes, and those greater Step 3: The feature maps at six different scales obtained in
than the confidence threshold will be retained. Most of the the previous step are input into the MCIE module to extract
retained boxes are the correct detection of occluded objects, the contextual information with different scales, and six new
which can solve a part of the missed detection problem. feature maps containing multiscale contextual information are
Remark 4: Based on the proposed DT-NMS algorithm, the obtained.
condition of multiple thresholds restricts the number of dense Step 4: Then, through the MSFD module, the location
redundant detection boxes and also ensures that the number of information and classification information of the feature map
detection boxes of occlusion objects is preserved. Meanwhile, are extracted.
DIoU is introduced to consider the center distance between Step 5: Finally, the results of indoor object detection are
two boxes and the overlap area simultaneously, which is output by using the improved DT-NMS algorithm.
beneficial to object detection for cases with occlusion. Thus,
the problems of repeat, missing, and misdetection can be IV. E XPERIMENTS
solved effectively.
The pseudocode of the improved DT-NMS algorithm is shown A. Dataset and Metrics
in Fig. 10. There are lots of public datasets that are used for indoor
The workflow of the whole deep network-based indoor scene training and testing, such as the indoorCVPR_09 and
object detection method proposed in this study is introduced NYU depth dataset v2 (NYUv2) datasets [62], [63]. How-
as follows. ever, the indoorCVPR_09 mainly aims at 3-D reconstruction,
Step 1: The indoor scene images for object detection are and NYUv2 focuses on semantic segmentation of indoor
fed into the deep network proposed in this study. scenes. In this article, SUN2012 is used as the experimental
Step 2: The feature extraction module consisting of the dataset, which is an object detection dataset in the exten-
improved ResNet50 module, as well as the additional layers, sive Scene UNderstanding (SUN) database, containing 16 873
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
images, covering many scenes (including indoor and outdoor where k refers to the number of the detected object cate-
scenes) [64]. After removing the pictures of outdoor scenes gories (k = 27 in this article); APi is the area under the
in SUN2012, 7108 pictures of indoor scenes are retained. precision–recall curve for the ith category (see [45] and [65]
To ensure the universality of the detected objects, the object for details).
categories in indoorCVPR_09, NYUv2, and SUN2012 are In this study, we also evaluate the detection speed of the
compared, and 27 common indoor object categories are used method by the detection time per image (defined as Dt ), which
for training and evaluation in this study. The distribution of is used as the auxiliary evaluation metric. Dt is defined as
the 27 class objects in the experimental dataset is shown in follows:
Fig. 11. T
One of the reasons for choosing these 27 categories of Dt = (12)
Num
indoor objects is that we hope to study as many indoor scenes where T refers to the total time that the network test validation
as possible, including more indoor objects. These 27 indoor set runs after the training is completed, and Num is the number
objects are common and typical representative objects in of images in the validation set (Num = 2133 in this study).
different indoor scenes and frequently appear in the bedroom,
living room, bathroom, dining room, kitchen, office, and other
indoor scenes. For example, “chair,” “table,” and “sofa,” often B. Experimental Settings and Implementation Details
appear in the living room, “lamp,” “bed,” and “pillow” are The experiments are conducted on a computer with the
the representative objects of the bedroom, and “wall,” “floor,” Windows 10 Operating System. The proposed deep network
and “ceiling” are the basic objects of indoor scenes. Another is implemented on the framework PyTorch1.7 with Python3.7.
reason is to compare the superiority of the proposed network. The parameters of the improved indoor object detection deep
So, some challenging objects are selected, which include many network and the computer configurations for the experiments
objects that are difficult to detect. For example, “bottle” and are listed in Table I. In the object detection experiments,
“faucet” tend to be small objects in pictures, “mirror” is considering the size of the six feature maps generated by the
easily affected by light reflection, objects, such as “lamp” feature extraction module, some atrous convolutions with some
and “plant,” have intraclass differences, and the same “lamp” larger atrous rates are unnecessary, because their receptive
category may have different colors and shapes. fields will far exceed the size of the feature map. So, the
With regard to evaluation metrics, mean average precision unnecessary atrous convolutions in the MCIE module will be
(mAP) is used as the main evaluation criterion, which is often removed, to preserve the effective contextual information and
used in multiobject detection tasks. mAP can be calculated by improve the quality of the network. During the experiment,
the following: the parameters of the MCIE module are listed in Table II.
The ratio of the training set and the validation set in the
Pk
i=1 APi
whole dataset is 7:3. The image size of the input for the deep
mAP = (11) network is 300 × 300. The initial learning rate of the proposed
k
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
TABLE II
PARAMETER S ETTING OF THE MCIE M ODULE IN THE P ROPOSED D EEP
N ETWORK
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
Fig. 13. Some detection results of different methods on the SUN2012 dataset.
The results in Fig. 13 show that the objects, such as [37], [66]. Its structure is a straight tube type, which has
occluded chairs, pillows, and cushions, can be detected using five sets of convolutional layers and three fully connected
our method. In addition, in the last three columns of the sixth layers, using the maximum pooling layer in each group to
row in Fig. 13, our method achieves better detection results for reduce the spatial dimension. Recently, the ResNet network
small objects, such as faucets, bottles, and distant microwaves. is presented to give consideration to detection accuracy and
The results in these comparison experiments show that the speed [67], [68], and the most commonly used networks in
improved method proposed in this article can get the best object detection are ResNet50 and ResNet101. In order to
comprehensive results for object detection in indoor scenes. verify the superiority of the improved backbone network for
the feature extraction, the backbone of the SSD framework
V. A BLATION S TUDY in the proposed deep network-based method is changed into
The comprehensive performance of the improved method VGG16, ResNet50, ResNet101, and the improved ResNet101,
proposed in this article has been tested and validated through respectively. Here, the improved ResNet101 means that the
the comparative experiments in Section IV. To evaluate the same improvements on ResNet50 in this article are also
performance of the key improvements in the proposed deep applied to ResNet101 (see Section III-A for details). The total
network-based method, some ablation experiments are con- network structure of these methods is the same as that of
ducted in this section. These key improvements include the the proposed method (see Fig. 2). The ablation experimental
feature extraction network, the MCIE module, and the DT- results are shown in Table IV.
NMS algorithm. The results in Table IV show that in complex indoor scenes,
the detection network using the VGG16 as the backbone in the
feature extraction network has a poor feature extraction ability
A. On the Feature Extraction Module for objects, and it cannot show a good object detection perfor-
The feature extraction module of the proposed network mance. The difference between ResNet101 and ResNet50 lies
based on the SSD framework consists of a backbone and five only in the number of residual units. ResNet50 has 16 residual
additional layers. As we know, the backbone of the traditional units, and ResNet101 has 33 residual units, which means that
SSD network is visual geometry group network-16 (VGG16) the ResNet101 network extracts high-dimensional features and
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
TABLE IV TABLE V
A BLATION E XPERIMENTAL R ESULTS ON THE F EATURE E XTRACTION A BLATION E XPERIMENTAL R ESULTS ON THE MCIE M ODULE
M ODULE U SING D IFFERENT BACKBONE N ETWORKS
TABLE VI
A BLATION E XPERIMENTAL R ESULTS ON THE DT-NMS A LGORITHM
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
TABLE VII
C OMPARISON E XPERIMENTAL R ESULTS U SING D IFFERENT PARAMETER
S ETTINGS IN THE MCIE M ODULE
Fig. 14. Experimental results of the joint parameter adjustment method for
Ni and Nt .
A. About the Settings of the MCIE Module
To discuss the setting of the MCIE module in the proposed
method, five comparative experiments are conducted. The method. The experimental results are shown in Fig. 14. Here,
purpose of these experiments is to analyze the performance of only the metric mAP is used as the judgment condition. The
the MCIE module we proposed, while verifying the validity of results in Fig. 14 show that the better detection results can be
the atrous rate of atrous convolutions used in the MCIE module obtained when Nt = 0.4 than other values of Nt . Based on
by adding or decreasing the atrous convolution branches. The this experiment, we set Nt = 0.4 and Ni = 0.95. At this time,
comparison results of different settings of the MCIE module the max mAP is reached.
are listed in Table VII.
It can be seen from the experimental results in Table VII
VII. C ONCLUSION
that when using setting1, a large amount of useless redundant
information is extracted by using the same parallel atrous In this article, the object detection problem in indoor scenes
convolution branch for all the six feature maps, which does not based on deep learning networks is studied, and an improved
improve the detection effect, and the time to detect each image SSD-like deep network-based method is proposed. In the pro-
is also the longest. When the atrous convolution branches in posed method, more enriched features are extracted using an
the MCIE module decrease, the detection speed is accelerated improved ResNet50 residual network; then, the MCIE module
continuously, and the mAP will change. With setting3 of the is adopted to extract multiscale contextual features, and six
atrous rate for atrous convolutions in the MCIE module, the different scale feature maps containing multiscale contextual
deep network has the best effect on indoor object detection. information are used to detect the objects in the indoor
However, when the atrous convolution branches in the MCIE scene. In addition, to improve the accuracy and efficiency
module are further reduced, the contextual information is not of deep learning-based methods in indoor object detection,
adequately extracted, and the mAP of the detection is sharply an improved DT-NMS algorithm is used as the postprocessing
reduced. Therefore, to comprehensively consider the effect and method. Finally, various comparative experiments are per-
speed of the indoor object detection, the appropriate atrous rate formed, and the experimental results show that the proposed
of atrous convolutions in the MCIE module is set as setting3 indoor object detection method outperforms than the state-of-
in this study (see Table II). the-art deep learning-based methods in object detection tasks
in indoor scenes.
In future work, further research on small objects in indoor
B. About the Settings of the Thresholds in the Improved
scenes should be studied, and the object detection problem
DT-NMS
under some difficult situations should be specifically studied,
There are two thresholds used in the improved DT-NMS, such as the low light condition and reflective condition. On the
namely, Ni and Nt [see (7)]. In order to choose a suitable value other hand, the real-time problem of indoor object detection
for each threshold, the joint parameter adjustment method should also be further considered. In addition, the public
is used to determine the final thresholds in this study, and datasets for 2-D object detection in indoor scenes should also
some experiments are conducted under the same condition be enriched to get more effective detection results.
introduced in Section IV-B. In these experiments, the structure
of the proposed model keeps unchanged, except choosing
different threshold values. In this study, the value of Nt R EFERENCES
is increased by 0.05 from 0.3 to 0.5 according the related [1] G. Wang, Y. Hu, X. Wu, and H. Wang, “Residual 3-D scene flow
reference [58]. Because the threshold Ni in the improved learning with context-aware feature extraction,” IEEE Trans. Instrum.
DT-NMS algorithm is used to directly inhibit the repeated Meas., vol. 71, pp. 1–9, 2022.
[2] S. Liu, G. Tian, Y. Zhang, M. Zhang, and S. Liu, “Service planning
detection boxes with DIoU greater than Ni , its value is oriented efficient object search: A knowledge-based framework for home
set between 0.8 and 0.99 in the joint parameter adjustment service robot,” Expert Syst. Appl., vol. 187, Jan. 2022, Art. no. 115853.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
5006915 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
[3] K. Wang, X. Li, J. Yang, J. Wu, and R. Li, “Temporal action detection [25] B. Yang, J. Li, and H. Zhang, “Resilient indoor localization system based
based on two-stream you only look once network for elderly care on UWB and visual–inertial sensors for complex environments,” IEEE
service robot,” Int. J. Adv. Robotic Syst., vol. 18, no. 4, Jul. 2021, Trans. Instrum. Meas., vol. 70, 2021, Art. no. 8504014.
Art. no. 172988142110383. [26] Q. Fu, H. Fu, Z. Deng, and X. Li, “Indoor layout programming
[4] P. Nazemzadeh, F. Moro, D. Fontanelli, D. Macii, and L. Palopoli, via virtual navigation detectors,” Sci. China Inf. Sci., vol. 65, no. 8,
“Indoor positioning of a robotic walking assistant for large public envi- Aug. 2022.
ronments,” IEEE Trans. Instrum. Meas., vol. 64, no. 11, pp. 2965–2976, [27] J. Li, W. Gao, Y. Wu, Y. Liu, and Y. Shen, “High-quality indoor scene
Nov. 2015. 3D reconstruction with RGB-D cameras: A brief review,” Comput. Vis.
[5] R. Ma, Z. Zhang, and E. Chen, “Human motion gesture recogni- Media, vol. 8, no. 3, pp. 369–393, Sep. 2022.
tion based on computer vision,” Complexity, vol. 2021, Feb. 2021, [28] Y. Sun, Y. Miao, J. Chen, and R. Pajarola, “PGCNet: Patch graph
Art. no. 6679746. convolutional network for point cloud segmentation of indoor scenes,”
[6] S. Cui, R. Wang, J. Hu, C. Zhang, L. Chen, and S. Wang, “Self- Vis. Comput., vol. 36, nos. 10–12, pp. 2407–2418, Oct. 2020.
supervised contact geometry learning by GelStereo visuotactile sensing,” [29] W. Zhang, Q. Zhang, W. Zhang, J. Gu, and Y. Li, “From edge to
IEEE Trans. Instrum. Meas., vol. 71, pp. 1–9, 2022. keypoint: An end-to-end framework for indoor layout estimation,” IEEE
[7] Z. Huang, C. Lv, Y. Xing, and J. Wu, “Multi-modal sensor fusion-based Trans. Multimedia, vol. 23, pp. 4483–4490, 2021.
deep neural network for end-to-end autonomous driving with scene [30] L. Huan, X. Zheng, and J. Gong, “GeoRec: Geometry-enhanced seman-
understanding,” IEEE Sensors J., vol. 21, no. 10, pp. 11781–11790, tic 3D reconstruction of RGB-D indoor scenes,” ISPRS J. Photogramm.
May 2021. Remote Sens., vol. 186, pp. 301–314, Apr. 2022.
[8] Y. Cheng, Y. Yang, H.-B. Chen, N. Wong, and H. Yu, “S3-Net: [31] J. Ni, K. Shen, Y. Chen, W. Cao, and S. X. Yang, “An improved deep
A fast scene understanding network by single-shot segmentation for network-based scene classification method for self-driving cars,” IEEE
autonomous driving,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 5, Trans. Instrum. Meas., vol. 71, pp. 1–14, 2022.
pp. 1–19, Oct. 2021. [32] C. Dai et al., “Video scene segmentation using tensor-train faster-RCNN
[9] J. Fan, P. Zheng, and S. Li, “Vision-based holistic scene understanding for multimedia IoT systems,” IEEE Internet Things J., vol. 8, no. 12,
towards proactive human–robot collaboration,” Robot. Comput.-Integr. pp. 9697–9705, Jun. 2021.
Manuf., vol. 75, Jun. 2022, Art. no. 102304. [33] M. Sogabe, N. Ito, T. Miyazaki, T. Kawase, T. Kanno, and
[10] X. Zhao, T. Zuo, and X. Hu, “OFM-SLAM: A visual semantic SLAM K. Kawashima, “Detection of instruments inserted into eye in cataract
for dynamic indoor environments,” Math. Problems Eng., vol. 2021, surgery using single-shot multibox detector,” Sensors Mater., vol. 34,
Apr. 2021, Art. no. 5538840. no. 1, pp. 47–54, 2022.
[11] M. Y. Moemen, H. Elghamrawy, S. N. Givigi, and A. Noureldin, [34] X. Lu, J. Ji, Z. Xing, and Q. Miao, “Attention and feature fusion SSD for
“3-D reconstruction and measurement system based on multimobile remote sensing object detection,” IEEE Trans. Instrum. Meas., vol. 70,
robot machine vision,” IEEE Trans. Instrum. Meas., vol. 70, pp. 1–9, pp. 1–9, 2021.
2021. [35] Z. Li et al., “Toward efficient safety helmet detection based on YOLOV5
[12] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” with hierarchical positive sample selection and box density filtering,”
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. IEEE Trans. Instrum. Meas., vol. 71, 2022, Art. no. 2508314.
[13] D. D’Souza and R. V. Yampolskiy, “Baseline avatar face detection using [36] D. Zheng et al., “A defect detection method for rail surface and fasteners
an extended set of haar-like features,” in Proc. 23rd Midwest Artif. Intell. based on deep convolutional neural network,” Comput. Intell. Neurosci.,
Cogn. Sci. Conf., Cincinnati, OH, USA, vol. 841, Apr. 2012, pp. 72–77. vol. 2021, Aug. 2021, Art. no. 2565500.
[14] R. Matsumura and A. Hanazawa, “Human detection using color contrast- [37] J. Ni, Y. Chen, Y. Chen, J. Zhu, D. Ali, and W. Cao, “A survey on
based histograms of oriented gradients,” Int. J. Innov. Comput., Inf. theories and applications for self-driving cars based on deep learning
Control, vol. 15, no. 4, pp. 1211–1222, 2019. methods,” Appl. Sci., vol. 10, no. 8, p. 2749, Apr. 2020.
[15] L. Zheng, T. Zhou, R. Jiang, and Y. Peng, “Survey of video object [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
detection algorithms based on deep learning,” in Proc. 4th Int. Conf. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Algorithms, Comput. Artif. Intell., Sanya, China, Dec. 2021, pp. 1–6. Venice, Italy, Oct. 2017, pp. 2999–3007.
[39] Y. Shi, Y. Guo, Z. Mi, and X. Li, “Stereo CenterNet-based 3D
[16] J. Akilandeswari, G. Jothi, A. Naveenkumar, R. S. Sabeenian, P. Iyyanar,
object detection for autonomous driving,” Neurocomputing, vol. 471,
and M. E. Paramasivam, “Design and development of an indoor navi-
pp. 219–229, Jan. 2022.
gation system using denoising autoencoder based convolutional neural
network for visually impaired people,” Multimedia Tools Appl., vol. 81, [40] M. Afif, R. Ayachi, E. Pissaloux, Y. Said, and M. Atri, “Indoor
no. 3, pp. 3483–3514, Jan. 2022. objects detection and recognition for an ICT mobility assistance of
visually impaired people,” Multimedia Tools Appl., vol. 79, nos. 41–42,
[17] B. Anbarasu and G. Anitha, “Indoor scene recognition for micro aerial
pp. 31645–31662, Nov. 2020.
vehicles navigation using enhanced-GIST descriptors,” Defence Sci. J.,
vol. 68, no. 2, pp. 129–137, 2018. [41] M. Chen, X. Ren, and Z. Yan, “Real-time indoor object detection based
on deep learning and gradient harmonizing mechanism,” in Proc. IEEE
[18] F. Vittori, I. Pigliautile, and A. L. Pisello, “Subjective thermal response 9th Data Driven Control Learn. Syst. Conf. (DDCLS), Liuzhou, China,
driving indoor comfort perception: A novel experimental analysis cou- Nov. 2020, pp. 772–777.
pling building information modelling and virtual reality,” J. Building
Eng., vol. 41, Sep. 2021, Art. no. 102368. [42] X. Ding, B. Li, and J. Wang, “Geometric property-based convolutional
neural network for indoor object detection,” Int. J. Adv. Robotic Syst.,
[19] Y. Li, Z. Zhang, Y. Cheng, L. Wang, and T. Tan, “MAPNet: Multi- vol. 18, no. 1, Jan. 2021, Art. no. 172988142199332.
modal attentive pooling network for RGB-D indoor scene classification,” [43] E. U. Samani, X. Yang, and A. G. Banerjee, “Visual object recognition
Pattern Recognit., vol. 90, pp. 436–449, Jun. 2019. in indoor environments using topologically persistent features,” IEEE
[20] T. Ran, L. Yuan, and J. B. Zhang, “Scene perception based visual Robot. Autom. Lett., vol. 6, no. 4, pp. 7509–7516, Oct. 2021.
navigation of mobile robot in indoor environment,” ISA Trans., vol. 109, [44] F. Alamri and N. Pugeault, “Contextual relabelling of detected objects,”
pp. 389–400, Mar. 2021. in Proc. Joint IEEE 9th Int. Conf. Develop. Learn. Epigenetic Robot.
[21] J. Jiang, P. Liu, Z. Ye, W. Zhao, and X. Tang, “A hierarchical inferential (ICDL-EpiRob), Oslo, Norway, Aug. 2019, pp. 313–319.
method for indoor scene classification,” Int. J. Appl. Math. Comput. Sci., [45] W. Ma, Q. Guo, Y. Wu, W. Zhao, X. Zhang, and L. Jiao, “A novel multi-
vol. 27, no. 4, pp. 839–852, Dec. 2017. model decision fusion network for object detection in remote sensing
[22] P. Ji, D. Qin, P. Feng, T. Lan, G. Sun, and M. J. Khan, “Research images,” Remote Sens., vol. 11, p. 737, Apr. 2019.
on indoor scene classification mechanism based on multiple descriptors [46] W. Zhang, C. Fu, H. Xie, M. Zhu, M. Tie, and J. Chen, “Global context
fusion,” Mobile Inf. Syst., vol. 2020, Mar. 2020, Art. no. 4835198. aware RCNN for object detection,” Neural Comput. Appl., vol. 33,
[23] A. Mosella-Montoro and J. Ruiz-Hidalgo, “2D–3D geometric fusion net- no. 18, pp. 11627–11639, Sep. 2021.
work using multi-neighbourhood graph convolution for RGB-D indoor [47] F. Wang and J. T. C. Tan, “The comparison of light compensation method
scene classification,” Inf. Fusion, vol. 76, pp. 46–54, Dec. 2021. and CycleGAN method for deep learning based object detection of
[24] A. Glavan and E. Talavera, “InstaIndoor and multi-modal deep learning mobile robot vision under inconsistent illumination conditions in virtual
for indoor scene recognition,” Neural Comput. Appl., vol. 34, no. 9, environment,” in Proc. IEEE 9th Annu. Int. Conf. Cyber Technol. Autom.,
pp. 6861–6877, May 2022. Control, Intell. Syst. (CYBER), Suzhou, China, Jul. 2019, pp. 271–276.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
NI et al.: IMPROVED SSD-LIKE DEEP NETWORK-BASED OBJECT DETECTION METHOD FOR INDOOR SCENES 5006915
[48] L. Huang et al., “Multi-scale feature fusion convolutional neural net- Jianjun Ni (Senior Member, IEEE) received the
work for indoor small target detection,” Frontiers Neurorobot., vol. 16, Ph.D. degree from the School of Information and
May 2022, Art. no. 881021. Electrical Engineering, China University of Mining
[49] D. Zhang et al., “Indoor target detection and pose estimation based and Technology, Xuzhou, China, in 2005.
on point cloud,” in Proc. 3rd Int. Conf. Comput., Commun. Mecha- He was a Visiting Professor with the Advanced
tronics Eng., Dec. 2021, vol. 2218, no. 1, doi: 10.1088/1742- Robotics and Intelligent Systems (ARIS) Labora-
6596/2218/1/012037. tory, University of Guelph, Guelph, ON, Canada,
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for from 2009 to 2010. He is currently a Professor with
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. the College of Internet of Things Engineering, Hohai
(CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778. University, Changzhou, China. He has authored or
[51] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, coauthored over 100 papers in related international
“DeepLab: Semantic image segmentation with deep convolutional nets, conferences and journals. His research interests include control systems,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern neural networks, robotics, machine intelligence, and multiagent system.
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2017. Dr. Ni serves as an associate editor and a reviewer for a number of
[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep international journals.
network training by reducing internal covariate shift,” in Proc. 32nd Int.
Conf. Mach. Learn. (ICML), Lile, France, vol. 1, Jul. 2015, pp. 448–456.
[53] G. Chen et al., “A survey of the four pillars for small object detection:
Multiscale representation, contextual information, super-resolution, and
region proposal,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 52, no. 2,
pp. 936–953, Feb. 2022.
Kang Shen received the B.S. degree from Hohai
[54] S. Wang, J. Cheng, H. Liu, F. Wang, and H. Zhou, “Pedestrian detection
University, Changzhou, China, in 2020, where she is
via body part semantic and contextual information with DNN,” IEEE
currently pursuing the M.S. degree with the Depart-
Trans. Multimedia, vol. 20, no. 11, pp. 3148–3159, Nov. 2018.
ment of Detection Technology and Automatic Equip-
[55] W.-S. Zheng, S. Gong, and T. Xiang, “Quantifying and transferring
ment, College of Internet of Things Engineering.
contextual information in object detection,” IEEE Trans. Pattern Anal.
Her research interests include self-driving, robot
Mach. Intell., vol. 34, no. 4, pp. 762–777, Apr. 2012.
control, and machine learning.
[56] S. Liu, D. Huang, and Y. Wang, “Receptive field block net for accurate
and fast object detection,” in Proc. 15th Eur. Conf. Comput. Vis. (Lecture
Notes in Computer Science), vol. 11215. Munich, Germany: Springer,
Sep. 2018, pp. 404–419.
[57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016,
pp. 2818–2826.
[58] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS–
improving object detection with one line of code,” in Proc. 16th IEEE Yan Chen received the B.S. degree from Hohai
Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 5562–5570. University, Changzhou, China, in 2017, where he is
[59] M. Gong, D. Wang, X. Zhao, H. Guo, D. Luo, and M. Song, currently pursuing the Ph.D. degree in the Internet
“A review of non-maximum suppression algorithms for deep learning of Things (IoT) technology and application with the
target detection,” Proc. SPIE, vol. 11763, Mar. 2021, Art. no. 1176332. College of Internet of Things Engineering.
[60] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-IoU His research interests include simultaneous local-
loss: Faster and better learning for bounding box regression,” in Proc. ization and mapping, robot control, and machine
34th AAAI Conf. Artif. Intell. (AAAI), New York, NY, USA, Feb. 2020, learning.
pp. 12993–13000.
[61] D. Yuan, X. Shu, N. Fan, X. Chang, Q. Liu, and Z. He, “Accurate
bounding-box regression with distance-IoU loss for visual tracking,”
J. Vis. Commun. Image Represent., vol. 83, Feb. 2022, Art. no. 103428.
[62] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Miami, FL, USA, Jun. 2009,
pp. 413–420.
[63] Z. Lu and Y. Chen, “Pyramid frequency network with spatial attention
residual refinement module for monocular depth estimation,” J. Electron. Simon X. Yang (Senior Member, IEEE) received
Imag., vol. 31, no. 2, Mar. 2022, Art. no. 023005. the B.Sc. degree in engineering physics from Beijing
[64] L. Benrais and N. Baha, “High level visual scene classification using University, Beijing, China, in 1987, the first M.Sc.
background knowledge of objects,” Multimedia Tools Appl., vol. 81, degree in biophysics from the Chinese Academy
no. 3, pp. 3663–3692, Jan. 2022. of Sciences, Beijing, in 1990, the second M.Sc.
[65] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and degree in electrical engineering from the University
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. of Houston, Houston, TX, USA, in 1996, and the
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2009. Ph.D. degree in electrical and computer engineering
[66] J. Yang, W. Y. He, T. L. Zhang, C. L. Zhang, L. Zeng, and from the University of Alberta, Edmonton, AB,
B. F. Nan, “Research on subway pedestrian detection algorithms based Canada, in 1999.
on SSD model,” IET Intell. Transp. Syst., vol. 14, no. 11, pp. 1491–1496, He is currently a Professor and the Head of the
Nov. 2020. Advanced Robotics and Intelligent Systems Laboratory, University of Guelph,
[67] Y. Yang, H. Wang, D. Jiang, and Z. Hu, “Surface detection of solid Guelph, ON, Canada. His research interests include robotics, intelligent
wood defects based on SSD improved with ResNet,” Forests, vol. 12, systems, sensors and multisensor fusion, wireless sensor networks, control
no. 10, p. 1419, Oct. 2021. systems, transportation, and computational neuroscience.
[68] X. Gao, J. Xu, C. Luo, J. Zhou, P. Huang, and J. Deng, “Detection of Dr. Yang has been very active in professional activities. He was the
lower body for AGV based on SSD algorithm with ResNet,” Sensors, General Chair of the 2011 IEEE International Conference on Logistics
vol. 22, no. 5, p. 2008, Mar. 2022. and Automation and the Program Chair of the 2015 IEEE International
[69] Y. Gong, X. Yu, Y. Ding, X. Peng, J. Zhao, and Z. Han, “Effective Conference on Information and Automation. He serves as an Editor-in-Chief
fusion factor in FPN for tiny object detection,” in Proc. IEEE Winter for the International Journal of Robotics and Automation and an Associate
Conf. Appl. Comput. Vis. (WACV), Waikoloa, HI, USA, Jan. 2021, Editor for the IEEE T RANSACTIONS ON C YBERNETICS and several other
pp. 1159–1167. journals.
Authorized licensed use limited to: Hohai University Library. Downloaded on February 24,2023 at 08:04:43 UTC from IEEE Xplore. Restrictions apply.
View publication stats