0% found this document useful (0 votes)
47 views

Image-Based Indoor Localization Using Smartphone Camera

1. The document presents an image-based indoor localization approach using a smartphone camera that can determine position by taking a picture of the surrounding environment. 2. It proposes classifying scenes using deep belief networks and solving the camera position using spatial reference points extracted from depth images via the perspective-n-point algorithm. 3. Experiments on public data and real scenes showed the approach can achieve submeter positioning accuracy, providing an infrastructure-free solution for applications like self-driving, robot navigation, and augmented reality.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Image-Based Indoor Localization Using Smartphone Camera

1. The document presents an image-based indoor localization approach using a smartphone camera that can determine position by taking a picture of the surrounding environment. 2. It proposes classifying scenes using deep belief networks and solving the camera position using spatial reference points extracted from depth images via the perspective-n-point algorithm. 3. Experiments on public data and real scenes showed the approach can achieve submeter positioning accuracy, providing an infrastructure-free solution for applications like self-driving, robot navigation, and augmented reality.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Hindawi

Wireless Communications and Mobile Computing


Volume 2021, Article ID 3279059, 9 pages
https://doi.org/10.1155/2021/3279059

Research Article
Image-Based Indoor Localization Using Smartphone Camera

Shuang Li,1,2 Baoguo Yu,1 Yi Jin ,3 Lu Huang,1,2 Heng Zhang,1,2 and Xiaohu Liang1,2
1
State Key Laboratory of Satellite Navigation System and Equipment Technology, China
2
Southeast University, China
3
Beijing Jiaotong University, China

Correspondence should be addressed to Yi Jin; [email protected]

Received 17 April 2021; Revised 30 May 2021; Accepted 20 June 2021; Published 5 July 2021

Academic Editor: Mohammad R. Khosravi

Copyright © 2021 Shuang Li et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the increasing demand for location-based services such as railway stations, airports, and shopping malls, indoor positioning
technology has become one of the most attractive research areas. Due to the effects of multipath propagation, wireless-based indoor
localization methods such as WiFi, bluetooth, and pseudolite have difficulty achieving high precision position. In this work, we
present an image-based localization approach which can get the position just by taking a picture of the surrounding
environment. This paper proposes a novel approach which classifies different scenes based on deep belief networks and solves
the camera position with several spatial reference points extracted from depth images by the perspective-n-point algorithm. To
evaluate the performance, experiments are conducted on public data and real scenes; the result demonstrates that our approach
can achieve submeter positioning accuracy. Compared with other methods, image-based indoor localization methods do not
require infrastructure and have a wide range of applications that include self-driving, robot navigation, and augmented reality.

1. Introduction ing the development of indoor positioning technology.


Besides, these kinds of methods can only output the position
According to statistics, more than 80 percent of people’s (X, Y, and Z coordinates) but not the view angle (pitch, yaw,
living time is in an indoor environment such as shopping and roll angles).
malls, airports, libraries, campuses, and hospitals. The pur- The vision-based positioning method is a kind of passive
pose of the indoor localization system is to provide accurate positioning technology which can achieve high positioning
positions in large buildings. It is vital to applications such accuracy and does not need extra infrastructure. Moreover,
as evacuation of trapped people at fire scenes, tracking of it can not only output the position but also the view angle
valuable assets, and indoor service robot. For these applica- at the same time. Therefore, it has gradually become a
tions to be widely accepted, indoor localization requires an hotspot of indoor positioning technology [7, 8]. Such
accurate and reliable position estimation scheme [1]. methods typically involve four steps: first, establishing an
In order to provide a stable indoor location service, a indoor image dataset collected by depth cameras with exact
large number of technologies are researched including positional information; second, comparing the images
pseudolite, bluetooth, ultrasonic, WiFi, ultra wideband, and collected by a camera to the images in the database which
LED [2, 3]. It is almost impossible to obtain very accurate established the last step; third, retrieving some of the most
results for a radio-based approach in view of the multipath similar pictures, then extracting the feature and matching
interference through arrival time and arrival angle methods. the points; at last, solving the perspective-n-point problem
The time-varying indoor environment and the movement [9–12]. However, the application of scene recognition to
of pedestrians also have adverse effects on the stability of mobile location implies several challenges [13–15]. The com-
fingerprint information [4–6]. In addition, the high cost of plex three-dimensional shape of the environment results in
hardware equipment, construction, and installation as well occlusions, overlaps, shadows, and reflections which require
as maintenance and update is also an important factor limit- a robust description of the scene [16]. To address these issues,
2 Wireless Communications and Mobile Computing

we propose a particularly efficient approach based on a 3.1. Framework Overview. The whole pipeline of the visual
deep belief network with local binary pattern feature localization system is shown in Figure 1. In the following,
descriptors. It enables us to find out the most similar we briefly provide an overview of our system.
pictures quickly. In addition, we restrict the search space In the offline stage, the RGB-D cameras are held to collect
according to adaptive visibility constraints which allows us enough RGB images and depth images around the indoor
to cope with extensive maps. environment. At the same time, the pose of the camera and
the 3D point cloud are constructed. The RGB image is used
2. Related Work as a learning dataset to train the network model, and then,
the network model parameters are saved until the loss func-
Before presenting the proposed approach, we review previ- tion value does not decrease. In the online stage, after the pre-
ous work on image-based localization methods and divide vious step is completed, anyone enters the room, downloads
these methods into three categories roughly. the trained network model parameters to the mobile phone,
Manual mark-based localization methods completely rely and takes a picture with the mobile phone, and the most
on the natural features of the image which lacks robustness, similar image is identified according to the deep learning net-
especially under conditions of varying illumination. In order work. The unmatched points are eliminated, and the pixel
to improve the robustness and accuracy of the reference coordinates of the matched points and the depth of the
point, special coding marks are used to meet the higher posi- corresponding points are extracted. According to the pin-
tioning requirements of the system. There are three benefits: hole imaging model, the n-point perspective projection
simplify the automatic detection of corresponding points, problem-solving method can be used to calculate the pose
introduce system dimensions, and distinguish and identify of the mobile phone in the world coordinate system. Finally,
targets by using a unique code for each mark. Common types the posture is converted into a real position and displayed
of marks include concentric rings, QR codes, or patterns on the map.
composed of colored dots. The advantage is raising the recog-
nition rate and effectively reducing the complexity of posi- 3.2. Camera Calibration and Image Correction. Due to the
tioning methods. The disadvantage is that the installation processing error and installation error of camera lens, the
and maintenance costs are high, some targets are easily image has radial distortion and tangential distortion. There-
obstructed, and the scope of application is limited [17, 18]. fore, we must calibrate the camera and correct the images in
Natural mark-based localization methods usually detect the preprocessing stage. The checkerboard contains some cal-
objects on the image and match them with an existing build- ibration reference points, and the coordinates of each point are
ing database. The database contains the location information disturbed by the same noise. Establishing the function γ:
of the natural marks in the building. The advantage of this n m  2
method is that it does not require additional local infrastruc- γ= 〠〠 
pij − p∧ðA, Ri , t i , Pi Þ , ð1Þ
ture. In other words, the reference object is actually a series i=1 j=1
of digital reference points (control points in photogramme-
try) in the database. Therefore, this type of system is suitable where pij is the coordinate of the projection points on image i
for large-scale coverage without increasing too much cost.
The disadvantage is that the recognition algorithm is for reference point j in the three-dimensional space. Ri and t i
complex and easy to be affected by the environment, the are the rotation and translation vectors of image i. Pi is the
characteristics are easy to change, and the dataset needs to three-dimensional coordinate of reference point i in the world
be updated [19–22]. coordinate system. p̂ðA, Ri , t i , Pi Þ is the two-dimensional
Learning-based localization methods have emerged in coordinate in the image coordinate system.
the past few years. It is an end-to-end method that directly 3.3. Scene Recognition. In this section, we use the deep belief
obtains 6dof pose, which has been proposed to solve loop- network (DBN) to categorize the different indoor scenes. The
closure detection and pose estimation [23]. This method framework includes image preprocessing, LBP feature
does not require feature extraction, feature matching, and extracting, DBN training, and scene classification.
complex geometric calculations and is intuitive and concise.
It is robust in weak textures, repeated textures, motion blur, 3.3.1. Local Binary Pattern. The improved LBP feature is
and lighting changes. In the training phase, the calculative insensitive to rotation and illumination changes. The LBP
scale is very large, and GPU servers are usually required, operator can be specifically described as the following: the
which cannot run smoothly on mobile platforms [20]. In gray values in the window center pixel are defined as the
many scenarios, learning-based features are not as effective threshold, and the gray values of the surrounding 8 pixels
as traditional features such as SIFT, and the interpretability are, respectively, compared with the threshold in a clockwise
is poor [24–27]. direction, and if the gray value is bigger than the threshold,
then mark the pixel as 1; otherwise, mark 0, and then get
3. Framework and Method an 8-bit binary number through the comparison. After the
decimal conversion, get the LBP value of the center pixel in
In this section, first, we introduce the overview of the frame- this window. The value reflects the texture information of
work. Then, the key modules are explained in more detail in the point at this position. The calculation process is shown
the subsequent sections. in Figure 2.
Wireless Communications and Mobile Computing 3

visible layer and a hidden layer. The neurons in the same


Image capture
layer and the neurons in different layers are connected to
Establishment each other. There are two types of neuron output states:
indoor
image library
active and inactive, represented by numbers 1 and 0. The
Scene recognition advantage of the Boltzmann machine is its powerful unsuper-
vised learning ability, which can learn complex rules from a
large amount of data; the disadvantages are the huge amount
Train model
Extract feature of calculation and the long training time. The restricted
points and match Boltzmann machine canceled the connection between neu-
rons in the same layer; each hidden unit and visible layer unit
are independent of each other. Roux and Bengio theoretically
Camera pose solving prove that as long as the number of neurons in the hidden
layer and the training samples are sufficient, the arbitrary
Off line On line discrete distribution can be fitted. The structure of BM and
RBM is shown in Figure 5.
Figure 1: The framework of the visual localization system.
The joint configuration energy of its visible and hidden
layers is defined as

m n m n
1 2 2 0 0 0
Eðv, hjθÞ = − 〠 bi vi − 〠 c j h j − 〠 〠 vi wij h j , ð3Þ
Threshold i=1 j=1 i=1 j=1
9 5 6 1 1

5 3 1 1 0 0 where θ = fW ij , bi , c j g are parameters in RBM, bi is bias of


visible layer i, c j is bias of visible layer j, and wij is the weight.
Figure 2: Local binary pattern calculation process. The output of the hidden layer unit is
n
The formula of local binary pattern: h j = 〠 vi wij + b j : ð4Þ
j=1
N−1
LBPðxc , yc Þ = 〠 2n sðin − ic Þ, When the parameters are known, based on the above
n=0 energy function, the joint probability distribution of ðv, hÞ
( ð2Þ
1, if x ≥ 0,
sðxÞ = e−Eðv,hjθÞ
0, else, Pðv, hjθÞ = ,
Z ðθ Þ
ð5Þ
where ðxc , yc Þ is the horizontal and vertical coordinate of the Z ðθÞ = 〠 e−Eðv,hjθÞ ,
v ,h
center pixel; N is number 8; ic , in are the gray values of the
center pixel and the neighborhood pixel, respectively; and s
where ZðθÞ is the normalization factor. Distribution of v is
ð⋅Þ is the two-valued symbol function.
PðvjθÞ, joint probability distribution Pðv, hjθÞ:
The earliest proposed LBP operator can only cover a
small range of images, so the optimization and improvement
1
methods for the LBP operator are constantly proposed by PðvjθÞ = 〠 pðv, hjθÞ = 〠 e−Eðv,hjθÞ : ð6Þ
researchers. We adopt the method which improves the h
Z ðθÞ h
insufficiency of the window size of the original LBP operator
by replacing the traditional square neighborhood with a
circular neighborhood and expanding the window size as Since the activation state of each hidden unit and visible
shown in Figure 3. unit is conditionally independent, therefore, when the state
In order to make the LBP operator have rotation invari- of the visible and hidden units is given, the activation proba-
ance, the circular neighborhood is rotated clockwise to obtain bility of the first implicit unit and visible elements is
a series of binary strings, and the minimum binary value is !
m
obtained, and then, the value is converted into decimal,  
P h j = 1jv, θ = σ b j + 〠 vi wij ,
which is the LBP value of the point. The process of obtaining i=1
the rotation-invariant LBP operator is shown in Figure 4. ! ð7Þ
n
3.3.2. Deep Belief Network. The deep belief network consists Pðvi = 1jh, θÞ = σ ci + 〠 h j wij ,
of a multirestricted Boltzmann machine (RBM) and a back- j=1
propagation (BP) neural network. The Boltzmann machine
is a neural network based on learning rules. It consists of a where σðxÞ = 1/ð1 + e−x Þ is the sigmoid activation function.
4 Wireless Communications and Mobile Computing

(a) LBP15 (b) LBP25 (c) LBP216

Figure 3: Three types of LBP.

sponding projection point in the image coordinate system


255 are u~i = ðui , vi ÞT , and the corresponding homogeneous
coordinates are Pwi = ðxi , yi , z i , 1ÞT and ui = ðui , vi , 1ÞT .
The correspondence between the reference point Pwi and
the projection point ui :
240 120 60 30 15 135 195

λi ui = K ½R t Pwi , ð8Þ
15
1
2
where λi is the depth of the reference point and K is the
Figure 4: Rotation-invariant LBP schematic. internal parameter matrix of the camera:

2 3
f 0 u0
3.4. Feature Point Detection and Matching. In this paper, we 6 7
propose a multifeature point fusion algorithm. The combina- K =6
40 f v0 7
5, ð9Þ
tion of the edge detection algorithm and the ORB detection
algorithm enables the detection algorithm to extract the edge 0 0 1
information, thereby increasing the number of matching
points with fewer textures. The feature points of the edge
are obtained by the Canny algorithm to ensure that the object where f = f u = f v is the focal length of the camera and
with less texture has feature points. ORB have scale and rota- ðu0 , v0 Þ = ð0, 0Þ is the optical center coordinate.
tion invariance, and the speed is faster than SIFT. The BRIEF First, select four noncoplanar virtual control points in
description algorithm is used to construct the feature point the world coordinate system. The relationship between
descriptor [28–31]. the virtual control points and their projection points is
The Brute force algorithm is adopted as the feature shown in Figure 7.
matching strategy. It calculates the Hamming distance In Figure 7, Cw1 = ½0, 0, 0, 1T , Cw2 = ½1, 0, 0, 1T , Cw3 =
between each point of the template image and each feature ½0, 1, 0, 1T , and Cw4 = ½0, 0, 1, 1T . fCcj , j = 1, 2, 3, 4g are
point of the sample image. Then compare the minimum homogeneous coordinates of the virtual control point in the
Hamming distance value with the threshold value; if the dis- ~ c , j = 1, 2, 3, 4g is the corre-
camera coordinate system, fC j
tance is less than the threshold value, regard these two points
as the matching points; otherwise, they are not matching sponding nonhomogeneous coordinate, fC j , j = 1, 2, 3, 4g is
points. The framework of feature extraction and matching the homogeneous coordinate of the projection point corre-
is shown in Figure 6. sponding in the image coordinate system, and fC ~ j , j = 1, 2,
3, 4g is the corresponding nonhomogeneous coordinate.
3.5. Pose Estimation. The core idea is to select four noncopla- fPci , i = 1, 2, ⋯, ng is the homogeneous coordinate of the
nar virtual control points; then, all the spatial reference ~ ci , i
reference point in the camera coordinate system; fP
points are represented by the four virtual control points,
= 1, 2, ⋯, ng is the corresponding nonhomogeneous coor-
and then, the coordinates of the virtual control points are
dinate. The relationship between the spatial reference
solved by the correspondence between the spatial reference
points and the control points in the world coordinate is
points and the projection points, thereby obtaining the coor-
as follows:
dinates of all the spatial reference points. Finally, the rotation
matrix and the translation vector are solved. The specific
algorithm is described as follows. 4
Given n reference points, the world coordinate is P ~w Pwi = 〠 αij Cwj , i = 1, 2, ⋯, n, ð10Þ
i
= ðxi , yi , z i ÞT , i = 1, 2, ⋯, n. The coordinates of the corre- j=1
Wireless Communications and Mobile Computing 5

h1 h2 hn h1 h2 hn

c1 c2 ... cn c1 c2 ... cn

w w

b1 b2 b3 ... bm b1 b2 b3 ... bm

v1 v2 v3 vm v1 v2 v3 vm
BM RBM

Figure 5: Boltzmann machine and restricted Boltzmann machine. v is the visible layer, m indicates the number of input data, h is the hidden
layer, and w is the connection weight between two layers,∀i, j, vi ∈ f0, 1g, h j ∈ f0, 1g.

Canny edge feature


detection

Image BRIEF point Feature matching


feature description strategy

FAST point feature


detection

Figure 6: The process of multifeature fusion extraction and matching.

Then, obtain the equation:


4
C 2w 〠 αij f xcj − αij ui z cj = 0,
C3w j=1
O
C1w
z ð13Þ
4
〠 αij f ycj − αij vi z cj = 0:
C 4w j=1
x
c T c c T
y Assume Z = ½Z cT cT cT cT c
1 , Z2 , Z3 , Z4  , Z j = ½ f x j , f y j , z j  , j =

Figure 7: Virtual control point and its projection point


1, 2, 3, 4, then the equations are obtained from the correspon-
correspondence. dence between spatial points and image points as follows:

MZ = 0: ð14Þ
where vector ½αi1 , αi2 , αi3 , αi4 T is the coordinate of the
Euclidean space based on the control point Cci . From The solution Z is the kernel space of the matrix M:
the invariance of the linear relationship under the Euclid-
ean transformation, N
Z = 〠 βi W i , ð15Þ
i=1
4
Pci = 〠 αij C cj , i = 1, 2, ⋯, n,
j=1
where W i is the eigenvector of M T M, N is the dimension of
ð11Þ the kernel, and βi is the undetermined coefficient. For a
4 perspective projection model, the value of N is 1, resulting in
~ ci
λ i u i = KP ~ cj ,
= K 〠 αij C i = 1, 2, ⋯, n:
j=1
Z = βW, ð16Þ

T
~ cj = ½xcj , yc , z cj T , then
Assume C where W = ½wT1 , wT2 , wT3 , wT4  , w j = ½w j1 , w j2 , w j3 T ; then, the
j
image coordinates of the four virtual control points are
( )
4
w j1 w j2
λi = 〠 αij z cj : ð12Þ cj = , , 1 , j = 1, 2, 3, 4: ð17Þ
j=1 w j3 w j3
6 Wireless Communications and Mobile Computing

The image coordinates of the four virtual control points


obtained by the solution and the camera focal length obtained
during the calibration process are taken into the absolute posi-
tioning algorithm to obtain the rotation matrix and the trans-
lation vector.

4. Experiments
We conducted two experiments to evaluate the proposed
system. In the first experiment, we compare the proposed
algorithm with other state-of-the-art algorithms on public
datasets and then perform numerical analysis to show the
accuracy of our system. The second experiment evaluated Figure 8: Intel RealSense D435 and Lenovo mobile phone.
the performance of accuracy in the real world.
4.1. Experiment Setup. The experimental devices include an
Android mobile phone (Lenovo Phab 2 Pro) and a depth
camera (Intel RealSense D435) as shown in Figure 8. The
user interface of the proposed visual positioning system on
a smart mobile phone running in an indoor environment is
shown in Figure 9.
4.2. Experiment on Public Dataset. In this experiment, we
adopted the ICL-NUIM dataset which consists of RGB-D
images from camera trajectories from two indoor scenes.
The ICL-NUIM dataset is aimed at benchmarking RGB-D,
Visual Odometry, and SLAM algorithms [32–34]. Two dif-
ferent scenes (the living room and the office room scene)
are provided with ground truth. The living room has 3D sur- Figure 9: The user interface of the proposed visual positioning
face ground truth together with the depth maps as well as system on a smart mobile phone running in an indoor environment.
camera poses and as a result perfectly suits not only for
benchmarking camera trajectory but also for reconstruction. Table 1: Comparison of mean error in ICL-NUIM dataset.
The office room scene comes with only trajectory data and
does not have any explicit 3D model with it. The images were Method Living room Office room
captured at 640∗480 resolutions. PoseNet 0.60 m, 3.64° 0.46 m, 2.97°
Table 1 shows localization results for our approach 4D PoseNet 0.58 m, 3.40° 0.44 m, 2.81°
compared with state-of-the-art methods. The proposed local-
CNN+LSTM 0.54 m, 3.21° 0.41 m, 2.66°
ization method is implemented on Intel Core i5-4460
[email protected] GHz. The total procedure from scene recognition Ours 0.48 m, 3.07° 0.33 m, 2.40°
to pose estimation takes about 0.17 s to output a location for a
single image. divide them into 18 categories. In the online stage, we
4.3. Experiment on Real Scenes. The images are acquired by captured 45 images at different locations on route 1 and 27
a handheld depth camera at a series of locations. The images on route 2. The classification accuracy formula is
image size is 640 × 480 pixels, and the focal length of the
camera is known. Several images of the laboratory are Ni
P= , ð18Þ
shown in Figure 10. N
Using the RTAB-Map algorithm, we get the 3D point
cloud of the laboratory. It is shown in Figure 11. The blue where N i is the correct classified number of scene images and
points are the position of the camera, and the blue line is N is the total number of scene images. The classification
the trajectory. accuracy of our method is 0.925.
The 2D map of our laboratory is shown in Figure 12. The Most mismatched scenes concentrate in the corner,
length and width of the laboratory are 9.7 m and 7.8 m, mainly due to the lack of significant features or mismatches.
respectively. First, select a point in the laboratory as the Several mismatched scenes are shown in Figure 13.
origin of the coordinate system and establish a world coordi- After removing the wrong matched results, the error
nate system. Then, hold the mobile phone, walk along cumulative distribution function graph is shown in Figure 14.
different routes, and take photos, respectively, as indicated The trajectory of the camera is compared with the pre-
by the arrows. defined route. After calculating the Euclidean distance
In the offline stage, we get a total of 144 images. Due to between the results through our method and the true posi-
some images captured at different scenes being similar, we tion, we get the error cumulative distribution function
Wireless Communications and Mobile Computing 7

Figure 10: Images captured from different scenes.

Figure 11: 3D point cloud of laboratory.

Zw O
9.7 m
Xw

7.8 m

Figure 12: Environmental map and walking route.

Figure 13: Mismatched scene.

graph (Figure 14). It can be seen that the average position- Since the original depth images in our experiment are
ing error is 0.61 m. Approximately 58% point positioning based on RTAB-Map, its accuracy is not accurate. For exam-
error is less than 0.5 m, about 77% point error is less than ple, in an indoor environment, intense illumination and
1 m, about 95% point error is less than 2 m, and the max- strong shadows may lead to inconspicuous local features. It
imum error is 2.55 m. is also difficult to construct a good point cloud model. In
8 Wireless Communications and Mobile Computing

Empirical CDF
1

0.9

0.8

0.7

0.6
F(x) 0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3
x

Figure 14: Error cumulative distribution function graph.

the future, we plan to use laser equipment to construct a References


point cloud.
[1] J. Wu, S. Guo, H. Huang, W. Liu, and Y. Xiang, “Information
and communications technologies for sustainable develop-
5. Conclusions and Future Work ment goals: state-of-the-art, needs and perspectives,” IEEE
Communications Surveys & Tutorials, vol. 20, no. 3,
In this article, we have presented an indoor positioning pp. 2389–2406, 2018.
system based only on cameras. The main work is to use deep [2] P. Lazik, N. Rajagopal, O. Shih, B. Sinopoli, and A. Rowe,
learning to identify the category of the scene and use 2D-3D “Alps: s bluetooth and ultrasound platform for mapping and
localization,” in Proceedings of the 13th ACM Conferenceon
matching feature points to calculate the location. We imple-
Embedded Networked Sensor Systems, ACM, pp. 73–84, New
mented the proposed approach on a mobile phone and York, NY, USA, 2015.
achieved a positioning accuracy of decimeter level. The pre-
[3] S. He and S. Chan, “Wi-Fi fingerprint-based indoor position-
liminary indoor positioning experiment result is given in this ing: recent advances and comparisons,” IEEE Communications
paper. But the experimental site is a small-scale place. The Surveys & Tutorials, vol. 18, no. 1, pp. 466–490, 2017.
following work needs to be done in the future: with the rapid [4] C. L. Wu, L. C. Fu, and F. L. Lian, “WLAN location determina-
development of deep learning, it can generate high-level tion in e-home via support vector classification,” in Network-
semantics and effectively solve the limitations caused by arti- ing Sensing and Control, IEEE International Conference,
ficial design features, use a more robust lightweight image 2004, pp. 1026–1031, Taipei, Taiwan, 2004.
retrieval algorithm, and carry out tests under different light- [5] G. Ding, Z. Tan, J. Wu, and J. Zhang, “Efficient indoor finger-
ing and dynamic environments, system tests under large- printing localization technique using regional propagation
scale scenarios, and long-term performance tests. model,” IEICE Transactions on Communications, vol. 8,
pp. 1728–1741, 2014.
[6] G. Ding, Z. Tan, J. Wu, J. Zeng, and L. Zhang, “Indoor finger-
Data Availability printing localization and tracking system using particle swarm
optimization and Kalman filter,” IEICE Transactions on Com-
The data used to support the findings of this study are munications, vol. 3, pp. 502–514, 2015.
included within the article. [7] C. Toft, W. Maddern, A. Torii et al., “Long-term visual locali-
zation revisited,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, p. 1, 2020.
Conflicts of Interest [8] A. Xiao, R. Chen, D. Li, Y. Chen, and D. Wu, “An indoor posi-
tioning system based on static objects in large indoor scenes by
The authors declare that they have no conflicts of interest. using smartphone cameras,” Sensors, vol. 18, no. 7, pp. 2229–
2246, 2018.
[9] E. Deretey, M. T. Ahmed, J. A. Marshall, and M. Greenspan,
Acknowledgments “Visual indoor positioning with a single camerausing PnP,”
in In Proceedings of the 2015 International Conference on
This study was partially supported by the Key Research Indoor Positioning and Indoor Navigation (IPIN), pp. 1–9,
Development Program of Hebei (Project No. 19210906D). Banff, AB, Canada, October 2015.
Wireless Communications and Mobile Computing 9

[10] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametri- [24] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: a convolu-
zation of the perspective-three-point problem for a direct tional network for real-time 6-dof camera relocalization,” in
computation of absolute camera position and orientation,” in IEEE International Conference on Computer Vision (ICCV),
Proceedings of the IEEE Conference on Computer Vision and pp. 2938–2946, Santiago, Chile, 2015.
Pattern Recognition, pp. 2969–2976, Colorado Springs, CO, [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
USA, 2011. L.-C. Chen, “Mobilenetv2: inverted residuals and linear
[11] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localiza- bottlenecks,” in Proceedings of the IEEE conference on
tion using direct 2d-to-3dmatching,” in 2011 IEEE Interna- computer vision and pattern recognition, pp. 4510–4520,
tional Conference on Computer Vision, IEEE, pp. 667–674, Salt Lake City, Utah, 2018.
Barcelona, Spain, 2011. [26] Z. Chen, A. Jacobson, N. Sunderhauf et al., “Deep learning fea-
[12] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide tures at scale for visual place recognition,” in 2017 IEEE Inter-
pose estimation using 3d point clouds,” in European Confer- national Conference on Robotics and Automation (ICRA),
ence on Computer Vision (ECCV), Berlin, Heidelberg, 2012. Singapore, 2017.
[13] M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, [27] S. Lynen, B. Zeisl, D. Aiger et al., “Large-scale, real-time
T. Sattler, and F. Kahl, “Fine-grained segmentation net- visual–inertial localization revisited,” The International Jour-
works: self-supervised segmentation for improved long- nal of Robotics Research, vol. 39, no. 9, pp. 1–24, 2020.
term visual localization,” in Proceedings of the IEEE/CVF [28] M. Dusmanu, I. Rocco, T. Pajdla et al., “D2-net: a trainable cnn
International Conference on Computer Vision, pp. 31–41, for joint description and detection oflocal features,” in Pro-
Seoul, Korea, 2019. ceedings of the IEEE/CVF Conference on Computer Vision
[14] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van and Pattern Recognition, pp. 8092–8101, California, 2019.
Gool, “Night-to-day image translation for retrieval-based [29] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histo-
localization,” in 2019 International Conference on Robotics grams (FPFH) for 3D registration,” in IEEE International Con-
and Automation (ICRA), pp. 5958–5964, Montreal, QC, Can- ference on Robotics and Automation, pp. 1848–1853, Kobe,
ada, 2019. Japan, 2009.
[15] J. X. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, [30] A. Xu and G. Namit, “SURF: speeded-up robust features,”
“Sun database: large-scale scene recognition from abbey to Computer Vision & Image Understanding, vol. 110, no. 3,
zoo,” in Proceedings of IEEE Conference on Computer Vision pp. 404–417, 2008.
and Pattern Recognition, pp. 3485–3492, San Francisco, CA,
[31] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an
USA, 2010.
efficient alternative to SIFT or SURF,” in IEEE International
[16] D. G. Lowe, “Distinctive image features from scale-invariant Conference on Computer Vision, pp. 2564–2571, Barcelona,
keypoints,” International Journal of Computer Vision, vol. 60, Spain, 2012.
no. 2, pp. 91–110, 2004.
[32] A. Handa, T. Whelan, J. Mcdonald, and A. J. Davison, “A
[17] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, benchmark for RGB-D visual odometry, 3D reconstruction
“From coarseto fine: robust hierarchical localization at large and SLAM,” in 2014 IEEE international conference on Robotics
scale,” in Proceedingsof the IEEE Conference on Computer and automation (ICRA), pp. 1524–1531, Hong Kong, China,
Vision and Pattern Recognition, pp. 12716–12725, California, 2014.
2019.
[33] M. Labbe and F. Michaud, “RTAB-Map as an open-source
[18] Q. Niu, M. Li, S. He, C. Gao, S.-H. Gary Chan, and X. Luo, lidar and visual simultaneous localization and mapping library
“Resource efficient and automated image-based indoor locali- for large-scale and long-term online operation,” Journal of
zation,” ACM Transactions on Sensor Networks, vol. 15, no. 2, Field Robotics, vol. 36, no. 2, pp. 416–446, 2019.
pp. 1–31, 2019.
[34] Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view-
[19] Y. Chen, R. Chen, M. Liu, A. Xiao, D. Wu, and S. Zhao, based 3D model retrieval,” ACM Transactions on Multimedia
“Indoor visual positioning aided by CNN-based image Computing, Communications, and Applications (TOMM),
retrieval: training-free, 3D modeling-free,” Sensors, vol. 18, vol. 16, no. 1, pp. 1–21, 2020.
no. 8, pp. 2692–2698, 2018.
[20] A. Kendall and R. Cipolla, “Modelling uncertainty in deep
learning for camera relocalization,” in IEEE International Con-
ference on Robotics & Automation, pp. 4762–4769, Stockholm,
Sweden, 2016.
[21] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prior-
itized matching for large-scale image-based localization,” IEEE
Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 39, no. 9, pp. 1744–1756, 2016.
[22] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale
localization for cameras with known vertical direction,” IEEE
Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 39, no. 7, pp. 1455–1461, 2016.
[23] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera pose voting for
large-scale image-based localization,” in IEEE International
Conference on Computer Vision (ICCV), pp. 2704–2712, Santi-
ago, Chile, 2015.

You might also like