Image-Based Indoor Localization Using Smartphone Camera
Image-Based Indoor Localization Using Smartphone Camera
Research Article
Image-Based Indoor Localization Using Smartphone Camera
Shuang Li,1,2 Baoguo Yu,1 Yi Jin ,3 Lu Huang,1,2 Heng Zhang,1,2 and Xiaohu Liang1,2
1
State Key Laboratory of Satellite Navigation System and Equipment Technology, China
2
Southeast University, China
3
Beijing Jiaotong University, China
Received 17 April 2021; Revised 30 May 2021; Accepted 20 June 2021; Published 5 July 2021
Copyright © 2021 Shuang Li et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the increasing demand for location-based services such as railway stations, airports, and shopping malls, indoor positioning
technology has become one of the most attractive research areas. Due to the effects of multipath propagation, wireless-based indoor
localization methods such as WiFi, bluetooth, and pseudolite have difficulty achieving high precision position. In this work, we
present an image-based localization approach which can get the position just by taking a picture of the surrounding
environment. This paper proposes a novel approach which classifies different scenes based on deep belief networks and solves
the camera position with several spatial reference points extracted from depth images by the perspective-n-point algorithm. To
evaluate the performance, experiments are conducted on public data and real scenes; the result demonstrates that our approach
can achieve submeter positioning accuracy. Compared with other methods, image-based indoor localization methods do not
require infrastructure and have a wide range of applications that include self-driving, robot navigation, and augmented reality.
we propose a particularly efficient approach based on a 3.1. Framework Overview. The whole pipeline of the visual
deep belief network with local binary pattern feature localization system is shown in Figure 1. In the following,
descriptors. It enables us to find out the most similar we briefly provide an overview of our system.
pictures quickly. In addition, we restrict the search space In the offline stage, the RGB-D cameras are held to collect
according to adaptive visibility constraints which allows us enough RGB images and depth images around the indoor
to cope with extensive maps. environment. At the same time, the pose of the camera and
the 3D point cloud are constructed. The RGB image is used
2. Related Work as a learning dataset to train the network model, and then,
the network model parameters are saved until the loss func-
Before presenting the proposed approach, we review previ- tion value does not decrease. In the online stage, after the pre-
ous work on image-based localization methods and divide vious step is completed, anyone enters the room, downloads
these methods into three categories roughly. the trained network model parameters to the mobile phone,
Manual mark-based localization methods completely rely and takes a picture with the mobile phone, and the most
on the natural features of the image which lacks robustness, similar image is identified according to the deep learning net-
especially under conditions of varying illumination. In order work. The unmatched points are eliminated, and the pixel
to improve the robustness and accuracy of the reference coordinates of the matched points and the depth of the
point, special coding marks are used to meet the higher posi- corresponding points are extracted. According to the pin-
tioning requirements of the system. There are three benefits: hole imaging model, the n-point perspective projection
simplify the automatic detection of corresponding points, problem-solving method can be used to calculate the pose
introduce system dimensions, and distinguish and identify of the mobile phone in the world coordinate system. Finally,
targets by using a unique code for each mark. Common types the posture is converted into a real position and displayed
of marks include concentric rings, QR codes, or patterns on the map.
composed of colored dots. The advantage is raising the recog-
nition rate and effectively reducing the complexity of posi- 3.2. Camera Calibration and Image Correction. Due to the
tioning methods. The disadvantage is that the installation processing error and installation error of camera lens, the
and maintenance costs are high, some targets are easily image has radial distortion and tangential distortion. There-
obstructed, and the scope of application is limited [17, 18]. fore, we must calibrate the camera and correct the images in
Natural mark-based localization methods usually detect the preprocessing stage. The checkerboard contains some cal-
objects on the image and match them with an existing build- ibration reference points, and the coordinates of each point are
ing database. The database contains the location information disturbed by the same noise. Establishing the function γ:
of the natural marks in the building. The advantage of this n m 2
method is that it does not require additional local infrastruc- γ= 〠〠
pij − p∧ðA, Ri , t i , Pi Þ , ð1Þ
ture. In other words, the reference object is actually a series i=1 j=1
of digital reference points (control points in photogramme-
try) in the database. Therefore, this type of system is suitable where pij is the coordinate of the projection points on image i
for large-scale coverage without increasing too much cost.
The disadvantage is that the recognition algorithm is for reference point j in the three-dimensional space. Ri and t i
complex and easy to be affected by the environment, the are the rotation and translation vectors of image i. Pi is the
characteristics are easy to change, and the dataset needs to three-dimensional coordinate of reference point i in the world
be updated [19–22]. coordinate system. p̂ðA, Ri , t i , Pi Þ is the two-dimensional
Learning-based localization methods have emerged in coordinate in the image coordinate system.
the past few years. It is an end-to-end method that directly 3.3. Scene Recognition. In this section, we use the deep belief
obtains 6dof pose, which has been proposed to solve loop- network (DBN) to categorize the different indoor scenes. The
closure detection and pose estimation [23]. This method framework includes image preprocessing, LBP feature
does not require feature extraction, feature matching, and extracting, DBN training, and scene classification.
complex geometric calculations and is intuitive and concise.
It is robust in weak textures, repeated textures, motion blur, 3.3.1. Local Binary Pattern. The improved LBP feature is
and lighting changes. In the training phase, the calculative insensitive to rotation and illumination changes. The LBP
scale is very large, and GPU servers are usually required, operator can be specifically described as the following: the
which cannot run smoothly on mobile platforms [20]. In gray values in the window center pixel are defined as the
many scenarios, learning-based features are not as effective threshold, and the gray values of the surrounding 8 pixels
as traditional features such as SIFT, and the interpretability are, respectively, compared with the threshold in a clockwise
is poor [24–27]. direction, and if the gray value is bigger than the threshold,
then mark the pixel as 1; otherwise, mark 0, and then get
3. Framework and Method an 8-bit binary number through the comparison. After the
decimal conversion, get the LBP value of the center pixel in
In this section, first, we introduce the overview of the frame- this window. The value reflects the texture information of
work. Then, the key modules are explained in more detail in the point at this position. The calculation process is shown
the subsequent sections. in Figure 2.
Wireless Communications and Mobile Computing 3
m n m n
1 2 2 0 0 0
Eðv, hjθÞ = − 〠 bi vi − 〠 c j h j − 〠 〠 vi wij h j , ð3Þ
Threshold i=1 j=1 i=1 j=1
9 5 6 1 1
λi ui = K ½R t Pwi , ð8Þ
15
1
2
where λi is the depth of the reference point and K is the
Figure 4: Rotation-invariant LBP schematic. internal parameter matrix of the camera:
2 3
f 0 u0
3.4. Feature Point Detection and Matching. In this paper, we 6 7
propose a multifeature point fusion algorithm. The combina- K =6
40 f v0 7
5, ð9Þ
tion of the edge detection algorithm and the ORB detection
algorithm enables the detection algorithm to extract the edge 0 0 1
information, thereby increasing the number of matching
points with fewer textures. The feature points of the edge
are obtained by the Canny algorithm to ensure that the object where f = f u = f v is the focal length of the camera and
with less texture has feature points. ORB have scale and rota- ðu0 , v0 Þ = ð0, 0Þ is the optical center coordinate.
tion invariance, and the speed is faster than SIFT. The BRIEF First, select four noncoplanar virtual control points in
description algorithm is used to construct the feature point the world coordinate system. The relationship between
descriptor [28–31]. the virtual control points and their projection points is
The Brute force algorithm is adopted as the feature shown in Figure 7.
matching strategy. It calculates the Hamming distance In Figure 7, Cw1 = ½0, 0, 0, 1T , Cw2 = ½1, 0, 0, 1T , Cw3 =
between each point of the template image and each feature ½0, 1, 0, 1T , and Cw4 = ½0, 0, 1, 1T . fCcj , j = 1, 2, 3, 4g are
point of the sample image. Then compare the minimum homogeneous coordinates of the virtual control point in the
Hamming distance value with the threshold value; if the dis- ~ c , j = 1, 2, 3, 4g is the corre-
camera coordinate system, fC j
tance is less than the threshold value, regard these two points
as the matching points; otherwise, they are not matching sponding nonhomogeneous coordinate, fC j , j = 1, 2, 3, 4g is
points. The framework of feature extraction and matching the homogeneous coordinate of the projection point corre-
is shown in Figure 6. sponding in the image coordinate system, and fC ~ j , j = 1, 2,
3, 4g is the corresponding nonhomogeneous coordinate.
3.5. Pose Estimation. The core idea is to select four noncopla- fPci , i = 1, 2, ⋯, ng is the homogeneous coordinate of the
nar virtual control points; then, all the spatial reference ~ ci , i
reference point in the camera coordinate system; fP
points are represented by the four virtual control points,
= 1, 2, ⋯, ng is the corresponding nonhomogeneous coor-
and then, the coordinates of the virtual control points are
dinate. The relationship between the spatial reference
solved by the correspondence between the spatial reference
points and the control points in the world coordinate is
points and the projection points, thereby obtaining the coor-
as follows:
dinates of all the spatial reference points. Finally, the rotation
matrix and the translation vector are solved. The specific
algorithm is described as follows. 4
Given n reference points, the world coordinate is P ~w Pwi = 〠 αij Cwj , i = 1, 2, ⋯, n, ð10Þ
i
= ðxi , yi , z i ÞT , i = 1, 2, ⋯, n. The coordinates of the corre- j=1
Wireless Communications and Mobile Computing 5
h1 h2 hn h1 h2 hn
c1 c2 ... cn c1 c2 ... cn
w w
b1 b2 b3 ... bm b1 b2 b3 ... bm
v1 v2 v3 vm v1 v2 v3 vm
BM RBM
Figure 5: Boltzmann machine and restricted Boltzmann machine. v is the visible layer, m indicates the number of input data, h is the hidden
layer, and w is the connection weight between two layers,∀i, j, vi ∈ f0, 1g, h j ∈ f0, 1g.
MZ = 0: ð14Þ
where vector ½αi1 , αi2 , αi3 , αi4 T is the coordinate of the
Euclidean space based on the control point Cci . From The solution Z is the kernel space of the matrix M:
the invariance of the linear relationship under the Euclid-
ean transformation, N
Z = 〠 βi W i , ð15Þ
i=1
4
Pci = 〠 αij C cj , i = 1, 2, ⋯, n,
j=1
where W i is the eigenvector of M T M, N is the dimension of
ð11Þ the kernel, and βi is the undetermined coefficient. For a
4 perspective projection model, the value of N is 1, resulting in
~ ci
λ i u i = KP ~ cj ,
= K 〠 αij C i = 1, 2, ⋯, n:
j=1
Z = βW, ð16Þ
T
~ cj = ½xcj , yc , z cj T , then
Assume C where W = ½wT1 , wT2 , wT3 , wT4 , w j = ½w j1 , w j2 , w j3 T ; then, the
j
image coordinates of the four virtual control points are
( )
4
w j1 w j2
λi = 〠 αij z cj : ð12Þ cj = , , 1 , j = 1, 2, 3, 4: ð17Þ
j=1 w j3 w j3
6 Wireless Communications and Mobile Computing
4. Experiments
We conducted two experiments to evaluate the proposed
system. In the first experiment, we compare the proposed
algorithm with other state-of-the-art algorithms on public
datasets and then perform numerical analysis to show the
accuracy of our system. The second experiment evaluated Figure 8: Intel RealSense D435 and Lenovo mobile phone.
the performance of accuracy in the real world.
4.1. Experiment Setup. The experimental devices include an
Android mobile phone (Lenovo Phab 2 Pro) and a depth
camera (Intel RealSense D435) as shown in Figure 8. The
user interface of the proposed visual positioning system on
a smart mobile phone running in an indoor environment is
shown in Figure 9.
4.2. Experiment on Public Dataset. In this experiment, we
adopted the ICL-NUIM dataset which consists of RGB-D
images from camera trajectories from two indoor scenes.
The ICL-NUIM dataset is aimed at benchmarking RGB-D,
Visual Odometry, and SLAM algorithms [32–34]. Two dif-
ferent scenes (the living room and the office room scene)
are provided with ground truth. The living room has 3D sur- Figure 9: The user interface of the proposed visual positioning
face ground truth together with the depth maps as well as system on a smart mobile phone running in an indoor environment.
camera poses and as a result perfectly suits not only for
benchmarking camera trajectory but also for reconstruction. Table 1: Comparison of mean error in ICL-NUIM dataset.
The office room scene comes with only trajectory data and
does not have any explicit 3D model with it. The images were Method Living room Office room
captured at 640∗480 resolutions. PoseNet 0.60 m, 3.64° 0.46 m, 2.97°
Table 1 shows localization results for our approach 4D PoseNet 0.58 m, 3.40° 0.44 m, 2.81°
compared with state-of-the-art methods. The proposed local-
CNN+LSTM 0.54 m, 3.21° 0.41 m, 2.66°
ization method is implemented on Intel Core i5-4460
[email protected] GHz. The total procedure from scene recognition Ours 0.48 m, 3.07° 0.33 m, 2.40°
to pose estimation takes about 0.17 s to output a location for a
single image. divide them into 18 categories. In the online stage, we
4.3. Experiment on Real Scenes. The images are acquired by captured 45 images at different locations on route 1 and 27
a handheld depth camera at a series of locations. The images on route 2. The classification accuracy formula is
image size is 640 × 480 pixels, and the focal length of the
camera is known. Several images of the laboratory are Ni
P= , ð18Þ
shown in Figure 10. N
Using the RTAB-Map algorithm, we get the 3D point
cloud of the laboratory. It is shown in Figure 11. The blue where N i is the correct classified number of scene images and
points are the position of the camera, and the blue line is N is the total number of scene images. The classification
the trajectory. accuracy of our method is 0.925.
The 2D map of our laboratory is shown in Figure 12. The Most mismatched scenes concentrate in the corner,
length and width of the laboratory are 9.7 m and 7.8 m, mainly due to the lack of significant features or mismatches.
respectively. First, select a point in the laboratory as the Several mismatched scenes are shown in Figure 13.
origin of the coordinate system and establish a world coordi- After removing the wrong matched results, the error
nate system. Then, hold the mobile phone, walk along cumulative distribution function graph is shown in Figure 14.
different routes, and take photos, respectively, as indicated The trajectory of the camera is compared with the pre-
by the arrows. defined route. After calculating the Euclidean distance
In the offline stage, we get a total of 144 images. Due to between the results through our method and the true posi-
some images captured at different scenes being similar, we tion, we get the error cumulative distribution function
Wireless Communications and Mobile Computing 7
Zw O
9.7 m
Xw
7.8 m
graph (Figure 14). It can be seen that the average position- Since the original depth images in our experiment are
ing error is 0.61 m. Approximately 58% point positioning based on RTAB-Map, its accuracy is not accurate. For exam-
error is less than 0.5 m, about 77% point error is less than ple, in an indoor environment, intense illumination and
1 m, about 95% point error is less than 2 m, and the max- strong shadows may lead to inconspicuous local features. It
imum error is 2.55 m. is also difficult to construct a good point cloud model. In
8 Wireless Communications and Mobile Computing
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x) 0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
x
[10] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametri- [24] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: a convolu-
zation of the perspective-three-point problem for a direct tional network for real-time 6-dof camera relocalization,” in
computation of absolute camera position and orientation,” in IEEE International Conference on Computer Vision (ICCV),
Proceedings of the IEEE Conference on Computer Vision and pp. 2938–2946, Santiago, Chile, 2015.
Pattern Recognition, pp. 2969–2976, Colorado Springs, CO, [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
USA, 2011. L.-C. Chen, “Mobilenetv2: inverted residuals and linear
[11] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localiza- bottlenecks,” in Proceedings of the IEEE conference on
tion using direct 2d-to-3dmatching,” in 2011 IEEE Interna- computer vision and pattern recognition, pp. 4510–4520,
tional Conference on Computer Vision, IEEE, pp. 667–674, Salt Lake City, Utah, 2018.
Barcelona, Spain, 2011. [26] Z. Chen, A. Jacobson, N. Sunderhauf et al., “Deep learning fea-
[12] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide tures at scale for visual place recognition,” in 2017 IEEE Inter-
pose estimation using 3d point clouds,” in European Confer- national Conference on Robotics and Automation (ICRA),
ence on Computer Vision (ECCV), Berlin, Heidelberg, 2012. Singapore, 2017.
[13] M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, [27] S. Lynen, B. Zeisl, D. Aiger et al., “Large-scale, real-time
T. Sattler, and F. Kahl, “Fine-grained segmentation net- visual–inertial localization revisited,” The International Jour-
works: self-supervised segmentation for improved long- nal of Robotics Research, vol. 39, no. 9, pp. 1–24, 2020.
term visual localization,” in Proceedings of the IEEE/CVF [28] M. Dusmanu, I. Rocco, T. Pajdla et al., “D2-net: a trainable cnn
International Conference on Computer Vision, pp. 31–41, for joint description and detection oflocal features,” in Pro-
Seoul, Korea, 2019. ceedings of the IEEE/CVF Conference on Computer Vision
[14] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van and Pattern Recognition, pp. 8092–8101, California, 2019.
Gool, “Night-to-day image translation for retrieval-based [29] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histo-
localization,” in 2019 International Conference on Robotics grams (FPFH) for 3D registration,” in IEEE International Con-
and Automation (ICRA), pp. 5958–5964, Montreal, QC, Can- ference on Robotics and Automation, pp. 1848–1853, Kobe,
ada, 2019. Japan, 2009.
[15] J. X. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, [30] A. Xu and G. Namit, “SURF: speeded-up robust features,”
“Sun database: large-scale scene recognition from abbey to Computer Vision & Image Understanding, vol. 110, no. 3,
zoo,” in Proceedings of IEEE Conference on Computer Vision pp. 404–417, 2008.
and Pattern Recognition, pp. 3485–3492, San Francisco, CA,
[31] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an
USA, 2010.
efficient alternative to SIFT or SURF,” in IEEE International
[16] D. G. Lowe, “Distinctive image features from scale-invariant Conference on Computer Vision, pp. 2564–2571, Barcelona,
keypoints,” International Journal of Computer Vision, vol. 60, Spain, 2012.
no. 2, pp. 91–110, 2004.
[32] A. Handa, T. Whelan, J. Mcdonald, and A. J. Davison, “A
[17] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, benchmark for RGB-D visual odometry, 3D reconstruction
“From coarseto fine: robust hierarchical localization at large and SLAM,” in 2014 IEEE international conference on Robotics
scale,” in Proceedingsof the IEEE Conference on Computer and automation (ICRA), pp. 1524–1531, Hong Kong, China,
Vision and Pattern Recognition, pp. 12716–12725, California, 2014.
2019.
[33] M. Labbe and F. Michaud, “RTAB-Map as an open-source
[18] Q. Niu, M. Li, S. He, C. Gao, S.-H. Gary Chan, and X. Luo, lidar and visual simultaneous localization and mapping library
“Resource efficient and automated image-based indoor locali- for large-scale and long-term online operation,” Journal of
zation,” ACM Transactions on Sensor Networks, vol. 15, no. 2, Field Robotics, vol. 36, no. 2, pp. 416–446, 2019.
pp. 1–31, 2019.
[34] Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view-
[19] Y. Chen, R. Chen, M. Liu, A. Xiao, D. Wu, and S. Zhao, based 3D model retrieval,” ACM Transactions on Multimedia
“Indoor visual positioning aided by CNN-based image Computing, Communications, and Applications (TOMM),
retrieval: training-free, 3D modeling-free,” Sensors, vol. 18, vol. 16, no. 1, pp. 1–21, 2020.
no. 8, pp. 2692–2698, 2018.
[20] A. Kendall and R. Cipolla, “Modelling uncertainty in deep
learning for camera relocalization,” in IEEE International Con-
ference on Robotics & Automation, pp. 4762–4769, Stockholm,
Sweden, 2016.
[21] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prior-
itized matching for large-scale image-based localization,” IEEE
Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 39, no. 9, pp. 1744–1756, 2016.
[22] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale
localization for cameras with known vertical direction,” IEEE
Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 39, no. 7, pp. 1455–1461, 2016.
[23] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera pose voting for
large-scale image-based localization,” in IEEE International
Conference on Computer Vision (ICCV), pp. 2704–2712, Santi-
ago, Chile, 2015.