State of The Art in Vision-Based Localization Tech
State of The Art in Vision-Based Localization Tech
net/publication/351790576
CITATIONS READS
23 1,073
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yusra Alkendi on 06 August 2021.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2020.DOI
ABSTRACT Vision-based localization systems, namely visual odometry (VO) and visual inertial odometry
(VIO), have attracted great attention recently. They are regarded as critical modules for building fully
autonomous systems. The simplicity of visual and inertial state estimators, along with their applicability
in resource-constrained platforms motivated robotic community to research and develop novel approaches
that maximize their robustness and reliability. In this paper, we surveyed state-of-the-art VO and VIO
approaches. In addition, studies related to localization in visually degraded environments are also reviewed.
The reviewed VO techniques and related studies have been analyzed in terms of key design aspects including
appearance, feature, and learning based approaches. On the other hand, research studies related to VIO
have been categorized based on the degree and type of fusion process into loosely-coupled, semi-tightly
coupled, or tightly-coupled approaches and filtering or optimization-based paradigms. This paper provides
an overview of the main components of visual localization, key design aspects highlighting the pros and
cons of each approach, and compares the latest research works in this field. Finally, a detailed discussion of
the challenges associated with the reviewed approaches and future research considerations are formulated.
INDEX TERMS Ego-motion Estimation, GNSS-denied, Self-localization, VIO, Visual Inertial Odometry,
Visual Odometry, VO.
VOLUME 4, 20XX 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Simultaneous Localization and Mapping (SLAM) means a II. GENERAL OVERVIEW OF LOCALIZATION
process of robot’s localization and simultaneously estimating TECHNIQUES
a robot trajectory and building a map of the environment, thus A main common challenge in autonomous navigation, path
VO is a subset of SLAM [9]. SLAM process is achieved, planning, object tracking, and obstacle avoidance platforms
similar to VO, by utilizing the information gained from an is to be able to continuously estimates the robot’s ego-motion
onboard single or multi-sensors. The performance of VO is over time (position and orientation). Global Positioning Sys-
affected significantly by the environmental conditions such as tem (GPS) is a conventional localization technique that has
illumination conditions and the image quality obtained by the been used in various fields of autonomous systems. GPS is
sensor. Furthermore, inertial-based odometry is not affected one type of Global Navigation Satellite System (GNSS). GPS
by the surrounding conditions, however, the performance provides any user, who has a GPS receiver, with positioning
deteriorates with time. By Fusing the data obtained from vi- information with meter level accuracy [17], and has been
sual sensors and the inertial measurements, resulting visual- employed as a self-localization source such as for drone
inertial odometry (VIO) system, overcoming the limitations security applications [18]. On the other hand, GPS suffers
of both individual state estimators. Therefore, the use of IMU from a few limitations that makes it a less reliable alternative
as a complementary sensor to visual-based localization en- sensor for self-localization modules, with a few of these
ables obtaining a more robust and accurate pose estimation. limitations being satellite signal blockage, high noisy data,
multipath effects, low bandwidth, jamming, and inaccuracy
GNSS–denied and low-visibility environments are the
[10], [19]. Although the rapid development of GPS tech-
main challenges in autonomous systems research since they
nologies, i.e., RTK (real-time kinematic) and PPP (precise
affect the sensor input information and critically degrade the
point positioning), are capable of providing positions with
robot’s action. An example of low visibility environment is
a decimeter or centimeter’s level accuracy [20]. The strength
low-light condition which could be solved by using onboard
of GPS satellite signals depends largely on the environmental
illumination [14], [15] or single to multi-sensor modalities
conditions, it is effective in clear sky areas and not suitable
such as LiDAR (light detection and ranging) and thermal
for indoor navigation where it gets affected by the wall and
imager [16]. Other low-visibility conditions are still very
objects. They are not a good candidate for precise localiza-
challenging, including those of smoke or fog-filled condi-
tion which is the main autonomous navigation module.
tions. Normal standard cameras, Radars, or LiDARs are used
In the last decade, many studies have investigated odome-
in such harsh conditions for VO or SLAM, but they deliver
try techniques for SLAM applications [21]. In such systems,
ill–conditioned data, so consequently are not able to estimate
the robot’s position and orientation are calculated based on
a reliable robot pose, and therefore fail to construct the map
the onboard sensor(s) information. As an opposite to GNSS,
of the environment.
the self-contained odometry methods do not rely on external
This paper presents a survey on vision-based navigation sources (i.e., radio signals from satellite in the case of GPS).
paradigms, namely visual odometry and visual inertial odom- Instead, they rely on the use of local sensory information
etry. Our review discusses each approach of the mentioned for determining the robot’s relative position and orientation
paradigms in detail in terms of the key design aspects in the with respect to its starting point. The main components of
main components, and the advantages and disadvantages of any SLAM technique are the map/trajectory initialization,
each category, where applicable. Localization techniques in data association, and loop closure [22]. Odometry algorithm
low visibility conditions are also presented. Towards the end, is employed in SLAM system to localize the moving robot
the challenges associated with state-of-the-art techniques for within the environment. Then, it is fed into the optimization
self-localization are formulated. algorithm for the developed global map to reduce the pre-
diction’s drift accumulated from previously estimated poses.
The rest of the paper is structured as follows. Section II
Therefore, SLAM techniques are able to reduce the accu-
provides a brief review of self-localization schemes for navi-
mulated pose error when the robot returns to a previously
gation in GNSS-denied environments. Section III discusses
observed scene using the history of robot poses in the global
the evolution of VO schemes under two broad paradigms,
map. In addition, odometry algorithms implement local map
i.e., geometric and nongeometric approaches, and evaluates
optimization methods, such as windowed bundle adjustment,
different state-of-the-art implementation choices. Section IV
to optimize the local map only over the last poses, leading to
presents a review of recent works pertaining VIO from the
local map consistency [22], [23]. SLAM aims at maintaining
literature, their design choices, and system performance. Sec-
a global map consistency and odometry method is used
tion V provides the state-of-the-art studies related to localiza-
partially during the SLAM first process which is followed
tion techniques in visually degraded environments. Section
by other steps [24], i.e., local or global map optimization.
VI presents an overview, discussion of the main aspects, and
Odometry techniques are highly dependent on sensor in-
future research perspectives of visual localization. In Section
formation which rely on vision, observation, or inertial mea-
VII, the outcomes of our review are highlighted and future
surements. Fusion of multiple types of sensing data helps in-
research considerations in the area are identified.
crease the system reliability, robustness, resilience to failures,
however, at the cost of the computational complexity effort.
2 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Hence, the overall platform cost would be increased. 4) We present a detailed discussion of vision-based self-
The proposed approaches to odometry techniques were localization systems, as shown in Fig. 13.
surveyed by several researchers in the field and the existing This article serves as a building block for researchers and
solutions and open research problems were addressed [9], developers to understand the basic concept, to compare and
[10], [12], [19], [25], [26]. Figure 1 provides the general categorize existing applied paradigms, and to highlight open
self-localization/Odometry techniques proposed in the lit- research problems to improve the recent self-localization
erature [10]. Mohamed et al. [10] have recently reviewed techniques. In addition, it will provide key systematic points
the odometry methods for navigation and have categorized for the user on how to select an appropriate localization
them based on two main approaches, i.e., GNSS-available method for navigation based on the environmental conditions
and GNSS-denied approaches. They also have classified the and application needs.
GNSS-denied navigation techniques into single and hybrid-
based frameworks. The five main categories of single-based III. VISUAL ODOMETRY
approaches are wheel odometry, inertial odometry, radar VO is defined as the pose estimation process of a robot,
odometry, visual odometry (VO), and laser-based odome- human, or vehicle by evaluating a set of cues (variations)
try. Similarly, hybrid approaches can be categorized into in a sequence of images of the environment obtained from
visual-laser odometry, visual-radar odometry, visual-inertial a single to multiple cameras [9]. In short, VO means local-
odometry (VIO), and radar-inertial odometry techniques. A izing the camera or sensor within the environment. VO is
broader summary of each category was presented along with utilized in many applications such as navigation and control
their advantages and weaknesses. A comparison between the of robotics (i.e, aerial, underwater, and space robotics), auto-
different odometry techniques was also conducted in terms of mobile, wearable computing, industrial manufacturing, and
performance, response time, energy efficiency, accuracy, and etc [23], [27].
robustness. For more detailed information about odometry The concept of VO is similar to the wheel odometry
techniques, interested readers can refer to [10]. incremental estimation of the vehicle’s pose and motion by
For VO, basic concepts and algorithms were described integrating the number of wheel turns over time. Equally, VO
and state-of-the-art proposed techniques were compared by incrementally estimates the pose by evaluating the variations
Scaramuzza et al., 2011 [12] and by Aqel et al., 2016 [19]. of motion induced on a set of images captured by on-board
Poddar et al. [9] have recently reviewed the evolution of VO camera(s). VO is considered as a case of structure from
schemes over the previous few decades and discussed them motion (SfM) technique which is utilized to reconstruct a 3D
under two main categories, geometric and non-geometric ap- scene of the environment and camera poses from a consec-
proaches. A general theoretical background of camera model, utive sequence of frames [12]. A 3D view is reconstructed
feature detection and matching, outlier ejection, and pose by calculating the optical flow of key indicators, in which
estimation frameworks was provided. Furthermore, a list of they are extracted from two consecutive frames using image
publicly available datasets for VO was provided. In 2015, feature detectors (i.e Moravec [28]) and corner detectors
VIO techniques have been reviewed in terms of filtering (i.e Harris [29]). Then, refinement/optimization of the con-
and optimization techniques [25]. Furthermore, for vision- structed 3D structure is done by using the bundle adjust-
based odometry, [26] have briefly provided a survey based ment method [30] or any other offline refinement technique.
on camera-based odometry for micro-aerial vehicle (MAV) There are several ways to perform SfM depending upon
applications in 2016. Their review focused on state-of-the- many factors such as the number of on-boarded cameras,
art studies and evaluation on monocular, RGB-D, and stereo- the number and order of images, and the camera calibration
based odometry approaches. status. The last step in SfM is the refinement and global
optimization of the structure and camera pose, it requires a
A considerable body of research addressing the visual
high computation load, therefore it is performed offline. In
localization problem can be found in the literature. Based
contrast, VO is conducted in real time (online) to estimate
on the aforementioned surveys, an updated review reflecting
the camera pose [31]. VO works effectively in conditions
the recent advances on VO and VIO is highly required for
where the environment offers a sufficient illumination level,
robotics research community. In this survey:
and a static scene with rich textures that are enough to aid
1) We provide a comprehensive review of the most recent observing and extract the apparent motion, and when enough
works related to VO and VIO techniques, focusing on scene is overlapped between consecutive frames.
achievements made in the past five years (2016-2021).
2) We propose our understanding of the most important A. MOTION ESTIMATION
studies and successful works related to VO and VIO. The main pipeline of VO system is provided in Fig. 2. There
3) We conduct an overview of recent adopted approaches are three standard VO motion estimation methods, which are
for localization in low-visibility environments. To the segregated into 2D to 2D, 3D to 2D, and 3D to 3D motion
authors’ knowledge, there is no review has addressed estimation techniques. The methods are used to compute the
localization techniques in low-visibility environments transformation matrix between two consecutive images (the
that reflects the recent advances in the field. current and previous image). They depend on the captured
VOLUME 4, 20XX 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
features and their correspondences whether specified in 2D The cost function is depicted by Eq. 1.
or 3D [1]. Appending these single estimated motions at a
time would help in estimating the full robot trajectory. Lastly, Ttk = argminTtk Σi kPti − Pt−1
i
k2 (1)
bundle adjustment process is performed to iteratively refine
the pose estimated over the last number of frames [12]. where Ttk is the transformation matrix to minimize the pro-
Figure 3 illustrates the VO scheme. At first, a relative pose, jection error between two consecutive frames t-1 and t. Pt
Ti,i+1 , between cameras are determined by matching the is the 2D point image feature at the current frame whereas
location of the corresponding feature points of two consec- Pt−1 is the 2D point reprojected from a 3D point feature
utive 2D images. By using one of the mentioned VO motion into a previous image frame. This approach is also called
estimation methods, the 3D point pose would be computed. the perspective-n-points (PnP) algorithm, as it estimates the
Then the global camera poses, Ci , are computed using the camera pose using a k group of i number of 3D points into
concatenated relative transformations which are relative to an 2D. The minimum set of points required is determined by the
initial reference frame. number of constraints in the system. For instance, a minimal
solution is called perspective-3-point (P3P) [32] utilizing a
set of three 3D points into 2D to estimate the camera pose.
1) 3D to 3D algorithm
In this algorithm, the camera motion relative to an initial state 3) 2D to 2D algorithm
is computed in the following steps. At first, match a set of 3D In this algorithm, there are three main steps that are used to
points extracted from a pair of successive images. Secondly, estimate the motion. Firstly, the essential matrix (E) relates
triangulate the 3D matched features between frames. The the geometric relation of two successive frames and it is
relative camera motion is estimated by the transformation of defined by matching the 2D feature correspondences using
two consecutive frames that is computed based on minimiz- the epipolar constraint, as shown in Fig 4. The essential
ing the Euclidean distance between two corresponding 3D matrix (E) and translation matrix (tˆk ) are defined as Eq. 2
features [12]. and Eq. 3, respectively.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
trajectories, as in the case of monocular VO, is not needed compose 3-D translation and transformation of two succes-
[39]. The mentioned design choices of VO are elaborated in sive images. Motion is estimated in Large Scale Direct-
detail in the following subsections. SLAM [47] by the image alignment method that relies on the
depth map. The proposed framework consists of stereo and
1) Conventional – Appearance-based VO monocular cues and is able to compensate for brightness vari-
The appearance-based VO estimates the camera pose by ations of image frames for a more accurate pose prediction.
analyzing the intensity of the captured image pixels based Furthermore, Engel et al. [48] have examined a direct sparse
on minimizing the photometric error. Unlike the feature- VO method based on optimizing photometric error, similar
based VO, this method uses all geometrical information of to the sparse bundle adjustment scheme, achieving a robust
the captured camera frames, reducing aliasing issues related motion estimate by utilizing all image points, unlike featured-
to similar pattern scenes and enhancing the pose estimate’s based VO which utilizes key geometrical points only.
accuracy and system robustness, especially when utilized For Optical Flow-based method, raw visual pixel data are
for low textured and low visibility environments [40]. Fig- imposed into the optical flow (OF) algorithm, wherein the
ure 6 illustrates the main pipeline of appearance-based VO pixel intensity change of two consecutive frames from the
paradigms. The principle of appearance-based VO can be camera(s) is analyzed to estimate the motion [49]. As the
classified into the region/template matching-based and the illumination of a pixel varies, the camera motion would be
optical flow-based methods. defined by computing the 2D displacement vector of points
For regional-based method, the motion is estimated by projected on two frames. Works of Brox et al. [50] and [51]
concatenating camera poses by performing an alignment pro- provide an example of a widely used OF methods that use
cess for two consecutive images. This technique has extended motion constraints equations. Techniques of optical flow-
its implementation by measuring the invariant similarities of based VO are also called direct methods since they utilize
local areas and using global constraints. Vatani et al. [41] pro- the whole image information and it is used for 2D/3D motion
posed a simple localized approach, relying on a constrained estimation paradigms. Kim et al. [52] proposed a method to
motion of a large vehicle. It used a modified correlation- handle problems of motion cease and changes of illumination
based VO method with respect to the variation in size and conditions, by employing an integrated method of Black
location of the correlation mask based on the vehicle move- and Anandans [53] and Gennert and Negahdaripours [54],
ment and fed a prior suggested prediction area in the mask respectively, to estimate camera motion.
for matching. Hence, its ability to reduce the computational Campbell et al. [55] have employed the optical flow
time makes it more reliable for practical implementation. An method to assess the robot ego-motion parameters. Rotation
extension of this work was proposed by Yu et al. [42] by and translation are estimated by the far and nearby features
utilizing a rotating template instead of a static template to find of the images, respectively. For navigation in an unexplored
the translation and rotation between two consecutive images. environment, Hyslop and Humbert [56] have utilized an
Furthermore, an adaptive template matching method was optical flow approach imposing a wide range of raw visual
proposed by [43] utilizing a smaller mask size and by varying measurements to estimate a 6-DoF motion task. Grabe et
the template location with respect to vehicle acceleration. al. [57] have estimated the continuous motion of a UAV by
Several studies have incorporated visual compass with the employing the optical flow method in a closed-loop operation
template-matching based method for estimating the pixel instead of incremental estimating the motion in frame-to-
displacement between images [40], image rotation for a more frame way. Moreover, they have extended the work of [58]
robust system with respect to accumulated camera calibration to improve velocity estimation by combining features in the
errors over time [37], image rotation and translation employ- optical flow technique. In addition, optical flow algorithms
ing different cameras [36]. have been implemented to aid UAV navigation for other
Studies on robust regional-based matching methods uti- purposes such as object avoidance [59].
lized for other purposes that could be implemented for VO
problems are discussed next. Comport et al. [44] proposed Some limitations of optical flow-based schemes are related
a scheme of utilizing a pair of stereo images and matching to the strength of the environment texture as well as to the
its dense correspondences to estimate the 6-DoF pose. The computational constraint. To overcome and minimize the
process relies on the quadrifocal between the image pixel computational energy consumed, RGB-D camera is utilized
intensities that makes the system more robust under vari- for VO problems and to estimate the motion by minimization
ous conditions of occlusion, pixel-wise displacements, and of the photometric error in the dense map, such as in Kerl
illumination variations. In addition, Comport et al. [45] have et al. [60]. Furthermore, the method proposed by Dryanovski
expanded his work by adding a cost function to minimize the et al. [61] has aligned 3D points on the global map by an
intensity errors of the whole image. Moreover, Lovegrove et iterative closest point (ICP) algorithm. In addition, a fast and
al. [46] have assessed vehicle motion using image alignment low computed VO method was developed by Li and Lee
techniques and aided by the features on road surfaces. [62] where the intensity values of selected key points were
Other studies have also been performed on regional-based analyzed by ICP.
matching techniques by analyzing the motion parallax to
6 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
model. Later, Cvisic and Petrovic [81] have also classified prior frame (feature correspondences) was estimated by min-
features of each bucket, however, into four different groups imizing the photometric error (appearance-based scheme),
to enable the selection of good features for better pose whereas camera pose estimation relative to the structure was
estimation results. assessed by minimizing the reprojection error (feature-based
Maeztu et al. [82] have assessed the complete feature- scheme). Such a hybrid approach improves the estimation ac-
based VO framework utilizing the bucketing method for curacy and eliminates the cost of feature extraction per frame.
tracking and matching using feature descriptors in corre- Silva et al. [87] utilized a dense appearance-based VO to esti-
sponding grids. This approach helped to improve the esti- mate the vehicle-scaled rotation and translation incorporated
mated motion by adding an external block. The purpose of with featured-based method to recover the scaling factor
the external block was to perform parallel computation (as accurately. Moreover, Feng et al. [88] presented a localization
in a multi-core framework) and reduce the outliers. Several system dependant upon the environmental conditions and
studies have improved the results obtained from a feature- consists of parallel direct (appearance-based) and indirect
based VO system, however, not in the feature detection (feature-based) modules. Camera poses would be estimated
or tracking methods. For example, Badino et al. [83] have by the direct method for low texture conditions and would
improved the accuracy of the estimated motion by averaging be shifted to the indirect-based method if enough features
the key feature locations with respect to its all previous oc- were detected within the frame. Alismail et al. [89] proposed
currences. Furthermore, Kreso and Segvic [84] have initially a hybrid framework wherein binary feature correspondences
calibrated and corrected camera parameters by comparing were aligned using the direct-based VO to increase system
and matching the corresponding points between frames em- robustness, especially in low light scenarios.
ploying ground truth motion. Cvisic and Petrovic [81] have
utilized a five-point algorithm to estimate the camera rotation 4) Non-conventional – Machine Learning-based VO
and translation that relied on minimizing the reprojection in- With the development of Machine-learning tools, recent VO
consistency for a combined stereo and monocular VO setup. schemes have shifted towards learning- based approaches for
Camera rotation was estimated by monocular case to over- more accurate motion estimation as well as for achieving
come the error of an imperfect calibration, whereas camera faster processing speed of data. In addition, one of the ad-
translation was estimated by the stereo case to improve the vantages of utilizing VO based learning frameworks is that
results accuracy. the results could be obtained without the need of a prior
The design of the neuromorphic vision sensor, event- knowledge of camera parameters. Once a suitable training
based camera, makes it an ideal alternative and indispensable dataset is available, the developed regression or classification
for platforms that require accurate motion estimation and model would aid and improve ego-motion estimation. For
good tolerance in challenging illumination conditions. Event- example, it could be utilized for scale correction by estimat-
based visual odometry (EVO) approach has been proposed ing the translation and it is robust to deal with noises and
by [85] to compute the camera pose estimation with high outliers by which it is trained. Figure 8 provides a learning
precision and obtain a semi-dense 3D map environment. methodology of learning-based VO paradigm. The network
Due to the event-based camera characteristics, the proposed is trained using sequence of successive frames as the input
pose estimation method was very efficient and feasible to be information to predict depth information, motion parameters,
performed in real-time on a standard CPU. or pose estimation as the ground truth output data.
As an example of the earliest work on learning-based VO,
3) Conventional – Hybrid-based VO Roberts et al. [90] divided each image into blocks. Then, they
For low-textured scenarios, feature-based VO schemes are developed a k-Nearest Neighbor (KNN) regression model
not considered as a robust scheme since only a few features that was trained to compute the optimal flow for each block.
are to be detected and tracked. On the other hand, the Motion was then estimated by a voting system between
appearance-based VO schemes exploit all image information distinct blocks. Moreover, Roberts et al. [91] proposed an-
for detecting and matching process between frames, leading other learning-based method to estimate the optical flow by
to a more efficient outcome at the cost of a considerable com- a linear subspace if there were considerable depth regular-
putational power. Thus, hybrid methods have been introduced ities relative to the robot motion in the environment. The
to combine advantages of the two above-mentioned schemes. expectation–maximization EM algorithm has been utilized to
Scaramuzza and Siegwart [37] have utilized a hybrid VO enhance the learning of subspace properties.
framework wherein the translation of a ground vehicle was Similarly, Guizlini and Ramos [92], [93] have developed
estimated by feature-based method and the rotation was Coupled Gaussian Processes (CGP) as a regression model to
obtained by the appearance-based method. In such a scheme, obtain optical flow feature parameters. This work was later
the vehicle pose would be estimated at a lower cost of the extended in [94] whereby they introduced a CGP for the VO
computational load compared to the feature-based ones. problem. The CGP has enhanced the multitask capability of
Furthermore, a semi-direct VO framework was proposed the VO system to exploit the correlation between the permit-
by Forster et al. [86] in which the camera pose was esti- ted multitasks through the coupled covariance functions. Fur-
mated by two main phases: the relative camera pose to the thermore, to enhance the system performance, they modified
8 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
VOLUME 4, 20XX 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
10 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
(continued)
VOLUME 4, 20XX 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Semi-tightly coupled
Optimization-based
Appearance-based
Loosely-coupled
Tightly-coupled
Filtering-based
Author(s)
Feature-based
Ref Comments
Event-based
Year
Monocular
Thermal
Fisheye
RGBD
Stereo
VOLUME 4, 20XX 13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Semi-tightly coupled
Optimization-based
Appearance-based
Loosely-coupled
Tightly-coupled
Filtering-based
Author(s)
Feature-based
Ref Comments
Event-based
Year
Monocular
Thermal
Fisheye
RGBD
Stereo
• An indirect error state estimation due to the high dy-
HaoChih & namic rate of IMU data were employed to avoid informa-
[153] Francois X X X X tion loss using ESKF.
2017 • A keyframe concept was adopted which reduces IMU
drits increased the system stability and performance.
• The vision pose estimator was based on edge alignment
and data fusion was based on a sliding window optimiza-
tion scheme at the back-end block.
• Utilization of an efficient IMU preintegration and two-
Ling et al.
[154] X X X X way marginalization scheme was proposed for smooth
2018
and accurate pose estimation and appropriate for resource-
constrained platforms.
• Framework can operate in real-time state estimation for
aggressive quadrotor motions.
• A sliding window optimization framework was proposed
where the state was optimized by minimizing a cost func-
tion which uses the pre-integrated IMU error term alone
with the point and line re-projection error.
He et al.
[155] X X X X • Experimental validation was based on EuRoC MAV
2018
[147] and PennCOSYVIO [156] datasets.
• Results achieved have comparable performance predic-
tions to ROVIO [151], OKVIS [145], and VINS-Mono
[157].
• A novel and robust monocular tightly coupled VIO
framework was proposed that incorporates IMU preinte-
gration, estimator initialization, online extrinsic calibra-
tion, relocalization, and efficient global optimization.
Qin et al. • Experimental results have demonstrated real-time oper-
[157] X X X X
2018 ation using a single camera and an IMU to estimate the
vehicle pose.
• Experimental validation was based on EuRoC MAV
[147] and showed superior performance over OKVIS
[145].
• IMU initialization process within a few seconds.
• Experimental validation was based on EuRoC MAV
Mur-Artal et al.
[158] X X X X [147].
2017
• Remarkable results were achieved by the proposed
framework compared to [146].
• Results achieved high-precision estimation of the pose
and velocity of UAV when compared to motion capture
Song et al.
[159] X X X X system - Opti-track.
2018
• Online calibration of the IMU bias and the extrinsic
parameters was performed during the robot motion.
• A dynamic marginalization technique was proposed to
adaptively employ marginalization strategies even in cases
Von et al. where certain variables undergo drastic changes.
[160] X X X X
2018 • Experimental validation was based on EuRoC MAV
[147] and showed superior performance over ROVIO
[144] and DSO [48].
• A framework that fuses a thermal camera with inertial
measurements to extend the robotic capabilities to navi-
Khattak et al. gate in GNSS-denied and visually degraded environments
[161] X X X X
2019 was proposed.
• Results achieved have comparable performance predic-
tions to ROVIO [151], OKVIS [145], and DSO [48].
• FAST feature detector and KLT sparse optical algorithm
for feature tracking were used which reduce the computa-
tional cost.
Ma et al.
[162] X X X X • Experimental validation was based on EuRoC MAV
2019
[147].
• Results showed superior performance over OKVIS
[145], VINS-MONO [157], and S-MSCKF [163].
• Experimental validation was based on EuRoC MAV
Yang et al. [147].
[164] X X X X
2019 • Results showed comparable performance to OKVIS
[145][14] and VINS-MONO [157]
• Results achieved lower relative pose estimation error
compared to ORB-SLAM2 [138] and OKVIS [145]. •
Chen et al.
[165] X X X X Results achieved have outperformed ORB-SLAM2 [138]
2019
and OKVIS [145] in terms of root mean square error
(RMSE), mean error, and standard deviation (STD).
(continued)
14 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Semi-tightly coupled
Optimization-based
Appearance-based
Loosely-coupled
Tightly-coupled
Filtering-based
Author(s)
Feature-based
Ref Comments
Event-based
Year
Monocular
Thermal
Fisheye
RGBD
Stereo
• Experimental validation was based on EuRoC MAV
[147].
Jiang et al.
[166] X X X X • Results achieved have comparable performance predic-
2020
tions to VINS-MONO [157] and ROVIO [151] in terms of
both the accuracy and the robustness.
• Experimental validation was based on EuRoC MAV
Zhang [147] and TUM-VI dataset [168]
[167] X X X X
2020 • Results achieved have comparable and good perfor-
mance predictions compared to S-MSCKF [163].
• Results achieved have comparable and good perfor-
mance predictions compared to VINS-MONO [157] and
Zhong & ROVIO [151].
[169] Chirarattananon X X X X • Execution time due to the single plane assumption in
2020 the proposed estimator was faster by 15-30 times faster
than the two benchmark models, VINS-MONO [157] and
ROVIO [151].
• A novel state-estimation framework that integrates IMU
measurements, a range sensor, and a vision sensor (stan-
dard camera or an event camera).
Sun et al.
[170] X X X X X • Experimental results showed that the use of event camera
2021
in low-light environments provided an advantageous over
the standard camera as the sensor does not suffer from
motion blur.
observations [174].
In the second category, the VO is used to estimate the
model states, while the IMU data is integrated as the obser-
vations to update the KF. This approach is able to provide
long-term attitude estimations which are accurate, robust,
stable, and drift-free. However, such an approach, as opposite
to the first category, is mostly not based on IMUs for pose FIGURE 10. General Pipeline of Loosely-coupled VIO.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
2) Semi-tightly coupled VIO quickly constructed points in the map and started predicting
A semi-tightly coupled approach for VIO system processes the vehicle pose accurately. This work was later extended
the visual pose estimation with the IMU sensory data while in [151] by inherently dealing with the tracked landmarks
maintaining a balance between robustness and computational using iterated-EKF algorithm. Therefore, this tight-fusion
complexity. This approach aids real-time robotic navigation, approach of visual and IMU data and full-state refinement
which is able to cope with big latency between visual image per landmark processes have elevated the model pose predic-
data with the IMU measurements and performs with limited tion’s accuracy and robustness.
computational resources. An example of semi-tightly cou- Another tightly coupled VIO approach was proposed by
pled VIO approach is presented in [178] for Micro Aerial [145]. An IMU error term and the landmark reprojection
Vehicle (MAV) platform equipped with a single camera and error were integrated in a single nonlinear cost function,
an IMU. Data fusing was based on EKF and visual pose thereby marginalizing the previous states and reducing the
estimation is based on eight-point algorithm. The framework computation loads. Therefore, the number of states in the
has demonstrated its capability to estimate the 6-DoF vehicle sliding window optimization stage has been bounded to
pose in real-time operation. ensure real-time system feasibility. Experimental results have
Moreover, another semi-tightly coupled VIO framework demonstrated real-time operation using a stereo camera and
was developed by [154], to tackle real-time state estimation an IMU to estimate the vehicle pose. Results obtained were
of aggressive quadrotor motions, as presented in Fig. 11. The more accurate and robust compared to both vision-based
vision pose estimator was based on edge alignment and data and loosely coupled visual inertial approaches. Later, the
fusion is based on a sliding window optimization scheme at same framework was adopted by [157], albeit using a single
the back-end block. For smooth and accurate pose estimation, camera setup. More tightly coupled VIO studies are provided
they utilized an efficient IMU preintegration and two-way in Table 2.
marginalization scheme, which are appropriate for resource-
constrained platforms. B. TYPE OF DATA FUSION
Existing VIO studies, especially tightly coupled approaches
3) Tightly-coupled VIO can be generally categorized based on the type of data fusion
A tightly coupled VIO system processes the key informa- into filtering-based and optimization-based paradigms. This
tion and IMU measurements together with the motion and section provides a detailed description of each approach and
observation models for vehicle state estimation. With the existing solutions based on each approach.
advancements in computer and software technologies, most
VIO studies are focused on employing a tightly coupled 1) Filtering-based VIO
framework in their application as shown in Table 2. In tightly Filtering-based VIO processes data in two stages, i.e., it
coupled approaches, as opposed to loosely coupled methods, integrates IMU data to process the state estimation and then
all sensor measurements are jointly optimized, thereby pro- updates the state estimation of the vision-based estimator. In
ducing higher accuracy state estimation. The general pipeline addition, filtering-based VIO approaches can be formulated
of tightly coupled VIO is illustrated in Fig. 12. as a maximum a posteriori probability (MAP) estimator [25],
Tightly-coupled approaches can be categorized into two
where IMU measurements from proprioceptive sensors are
classes, i.e., filtering-based and optimization-based VIO
used to construct the platform pose prior distribution as the
methods, which are to be discussed in the following sub-
internal state of MAP. In addition, the visuals from exterocep-
section IV-B. The classical tightly- EKF-based approach
tive sensors are used to compute the platform pose likelihood
and well known in the VIO research area is the multi-state
distribution as the external state of MAP. In other words, the
constraint Kalman filter (MSCKF) which were developed
IMU linear acceleration and angular velocities are used to
in [143]. In this work, multi-geometrical constraints were
drive the vehicle dynamic model to estimate the vehicle pose.
derived in the measurement model from multi-continuous
This model is used later to update the vehicle state using the
camera poses, that arose when the same feature was observed
key information obtained from the visual data for ego-motion
in the motion scenes. The computational load of this frame-
estimations.
work was in the order of one and was a function of the
To date, majority of the proposed filter-based solutions
number of detected features in the frames. The experimental
can be divided into four frameworks, i.e., algorithms based
results showed that this approach was able to provide high
on Extended Kalman Filter (EKF), Unscented Kalman Filter
accurate pose estimation using a monocular camera and IMU
(UKF), Multi-State Constraint Kalman Filter (MSCKF), and
when performed in real time and in large-scale environments.
particle filter (PF). Existing solutions based on these frame-
In addition, ROVIO [144] is another tightly coupled ap-
works are provided in the following subsections.
proach based on EKF using a monocular camera and an
IMU. In this work, the pixel intensity errors of image patches • Extended and Unscented Kalman Filters
were used to formulate the observation equation in EKF. Autonomous vehicles or robots are considered as examples
This approach did not require any initialization stage since of nonlinear models. Data associated with nonlinear and
it utilized the inverse-distance landmark positions which dynamic models can be fused using any nonlinear filter such
16 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
FIGURE 11. The Framework of Semi-tightly Coupled VIO based on Edge Alignments Developed by [154].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
approach used all image information which enabled VIO to surement and the visual data for optimal prediction. There-
be performed at challenging environmental conditions such fore, in the optimization-based, VIO enables state vector
as in low-textured and low-gradient areas. A higher order linearization of various points for more precise state esti-
covariance propagation (depth) and pixel (intensity) forward- mations than the ones provided by filtering-based methods
backward propagation was employed, which enabled more [190]. In such approaches, IMU measurement constraints are
precise motion estimation in different light condition envi- calculated by integrating inertial data between two frames.
ronments An alternative and an extension of EKF is the UKF Whereas, in the conventional IMU integration technique,
in which a Bayesian filter is used that updates the systems’ the IMU body state initialization is computed at the ini-
states via a group of sigma points. The prior distribution is tial captured images. Lupton and Sukkarieh [191] proposed
used to derive the weighted sigma points. Furthermore, the an IMU preintegration technique to avoid such duplicated
mean and covariance contours are computed by using the integrations. IMU preintegration module has been widely
weighted sigma points using a nonlinear method. In [182], adopted in optimization-based VIO studies such as [145],
the authors proposed a UKF-based VIO system designed [158], [192].
directly on the 3D Special Euclidean Group, SE(3). A matrix IMU preintegration process was reformulated in Forster et
Lie group G is a group equipped with a smooth manifold al. [190] by using the rotation group which was computed
structure such that the group multiplication and inversion by a manifold rather than by Euler angles. Furthermore,
are smoothly operated. Processing rotation in the kinematic a continuous preintegration technique was adopted in the
model is considered as the main nonlinearity contributor. optimization-based VIO framework of Liu et al. [193]. Pre-
Typically, Euler angles [183] and Quaternions [184] are cise localization was achieved by using optimization-based
used to represent the model orientation. In this framework, approaches, however, at the cost of extra computational load,
the kinematics of rotation is modeled on the SE(3) space which is due to the higher number of landmarks required in
and by processing the visual and inertial information in the the optimization module. Therefore, optimization-based VIO
filter, a unique and global 6DoF pose is estimated. Inertial approaches might not be applicable for resource-constrained
measurements are used to control the inputs, while the visual platforms. To address this issue, solutions have been pro-
data are processed to update the state. Detailed analysis of posed in the literature that aim at achieving a constant
UKF-based on Lie group algorithm is provided in [185]. processing time, such as algorithms that marginalize partial
Furthermore, to improve the performance of pose estimation past states and measurements to maintain a bounded-sized
of UKF in presence of dynamic model errors, many adaptive optimization window [145], [158], [192], [194].
UKF filtering methods have been addressed in literature In OKVIS [145], a group of nonsequential old camera
[186]–[188]. Once the dynamic model errors are identified, poses, new sequential inertial states and measurements were
the UKF estimation is corrected. evolved in the nonlinear optimization module for a refined
• Multi-State Constraint Kalman Filter and precise pose estimation. In addition, Qin et al. [157]
have proposed an optimization-based VIO approach using a
One of the main drawbacks of EKFs approaches is the monocular camera incorporating loop closure modules that
requirement of high computational load, which may not be ran concurrently in multithread mode to ensure reliability
suitable for resource-constrained platforms (i.e., UAV). On and to guarantee real-time operation. Another VIO approach
the other hand, structure-less approaches such as MSCKF was proposed by [195], however they efficiently utilized loop
framework are considered a better version in terms of ac- closures that ran in a single thread, thus it had a linear
curacy and consistency because they do not rely on strict computational complexity.
probabilistic assumptions or delayed linearization [189]. In Furthermore, Rebecq et al. [196] proposed an event-based
addition to that the MSCKF [143] framework has complexity VIO algorithm using a nonlinear optimization for pose es-
which is linear as a function of landmarks due to marginal- timation. The generated asynchronouse events, which have
ization of 3D feature points. a microsecond resolution, are accumulated into a frame per
Recently, a novel IMU initialization approach has been spatiotemporal windows size. Features are then detected and
proposed by [167], which could estimate the model’s main tracked using FAST corner detector and the Lucas-Kanade
parameters within a few seconds. This approach was de- tracker, respectively. Then, the 3D matched features are used
coupled with the stereo-based MSCKF framework [163] to to triangulate between frames in order to estimate the rela-
deal with system inherited nonlinearities and measurement tive camera motion between frames. The estimated camera
or observation noises. This noise-adaptive state estimator poses and 3D landmark positions are periodically refined
enhances pose prediction accuracy and overall model robust- by minimizing the reprojection error and the inertial mea-
ness. The results provided have outperformed the results of surement error for effective fusion process (visual and IMU
state-of-the-art VIO method of [163]. measurements). The performance of the model was evaluated
on a large scale and an extremely high-speed dataset. This
2) Optimization-based VIO evaluation demonstrated the accuracy and robustness of the
Optimization-based VIO processes state estimation by solv- model.
ing the least square nonlinear problem over the IMU mea- The work of Mueggler et al. [197] proposed a continuous-
18 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
time framework using event camera to perform VIO. In patible to fulfill scene understanding at conditions where
their framework, a direct integration of the asynchronous visibility is insufficient, such as during low light or night
events at micro-second resolution and the high rate of IMU missions.
measurements. Cubic splines were used to approximate the The wide range of spectral band images and the develop-
trajectory of event camera by a smooth curve in the space of ment of rendered visible/thermal fused imagery and enhance-
rigid body motions. Their model was evaluated on real time ment algorithms are able to aid the platform observation tasks
using extensive scenes against a ground truth obtained from a such as to operate and recognize several aspects of a scene
motion-capture system with a remarkable accuracy (position and detect and localize targets. The objective of providing
and orientation errors are less 1%). such fused imagery is to present a more informative content
than the individual image (i.e., visible or thermal image),
V. LOCALIZATION TECHNIQUES IN LOW-VISIBILITY easy and clear to recognize, and robust under visual degraded
ENVIRONMENTS environments. For example, during low light conditions,
For navigation through visually degraded environments, new color remapping could improve target detection ability by
vision-based localization techniques have to be explored enhancing image contrast and the use of highlighting color
by expanding the model work capability even beyond the [203] which lead to a faster scene recognition [204]. Various
visible band. As opposed to standard visible cameras, in- image fusion methods have been investigated in the literature,
frared cameras are more robust against illumination changes. such as integrating visual and near-infrared context infor-
With the recent advancements in thermal sensors in terms mation [205]–[209] and enhance image contrast [210]. Re-
of size, weight, resolution, load, and cost, thermal-inertial cently, AI-based approaches have been investigated for color
odometry (TIO) is now considered a promising technique mapping of gray-scale thermal image [211]–[213] to enhance
for autonomous UAV and UGV systems that works in low image intensity and retain high level scene information, thus
visibility conditions without relying on GNSS data or any leading for a better visualization.
other costly sensor such as LiDARs. The working principle Recently, two localization techniques in low visibility
of thermal cameras is capturing the temperature profile in the environments were proposed by Mandischer et al. [214], a
scene; thus, it can be used in low visibility environments (i.e., novel radar-based SLAM and another radar-based localiza-
low light, night) without the need of any additional source of tion strategy employing laser maps. These approaches are
light. evaluated in indoor environments with heavy dust formu-
However, the disadvantages of thermal sensors include lation to emulate vision scenarios of the grinding process.
providing low textured, featured and image resolution as well In the first approach, scan-to-map technique was developed
as having quite low signal-to-noise ratios [198]. Therefore, based on probabilistic iterative correspondence (pIC) SLAM.
in this case, several computer vision algorithms would not While in the second approach, they utilized environmental
be effective and hence, would need further developments. In information prior to the grinding process. This data set is
literature, very limited studies have been proposed related used to generate a laser map that aids the localization process
to TIO, such as [14], [199]–[201], and could be further of radar-based SLAM. They provided a strategy to improve
investigated to overcome the limitations of thermal imagers. localization using laser maps with line fitting on Radar-based
Moreover, [202] provides an approach employing LiDAR for SLAM.
a better visualization under different degraded environments. The performance of VO is negatively affected in chal-
For SLAM applications, Shim and Kim [16] proposed a lenging illumination conditions and high dynamic range
direct thermal-infrared SLAM platform which is optimized (HDR) environments due to brightness inconsistency. There-
and tightly coupled with LiDAR measurements. This multi- fore, Gomez-Ojeda et al. [215] have proposed a learning-
modality system was selected to overcome the photometric based method to enhance the image representation of the
consistency problem of thermal images due to accumulated sequences for VO. They have adopted long short-term mem-
sensor noise over time. The first step was to rescale 14-bit ory (LSTM) layers to maintain temporal consistency of the
raw radiometric data into grey-scale for feature extraction, image sequences, thanks to LSTM internal memory cell. The
and the photometric consistency of thermal images was then trained network has been implemented in two state-of-the-
resolved by tracking the depth information of LiDAR mea- art algorithms of VO methods (ORB-SLAM [134] and DSO
surements. [48]) and tested in challenging environments. Pose estima-
For night-time visual systems, researchers have investi- tion results using the enhanced image representation were
gated advanced night imaging systems utilizing onboared compared to the VO results using normal image sequences
low light level cameras with the aid of computer vision and proved the network’s benefits that enhance localization
algorithms and the use of artificial intelligence (AI) based especially in challenging conditions.
frameworks. These cameras reflect the thermal energy of Alismail et al. [216] have proposed the use of binary
different objects in the scene into a visible image under a feature descriptors in the direct VO framework, which en-
wide range of spectra: visible (0.4–0.7µm) to near-infrared hances visual state estimation and increases system robust-
(0.7–1.0µm) light or to long-wave infrared (8–14µm). Such ness under illumination variation. This approach is invariant
systems, therefore, are illumination dependant and incom- to monotonic changes of the image intensity. In addition,
VOLUME 4, 20XX 19
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Park et al. [217] have performed a systematic evaluation of VO method, camera geometrical relations are identified to
the performance of various direct image alignment methods estimate the ego-motion such as the intensity value of image
in terms of accuracy and robustness under significant illu- pixels (appearance-based VO [36], [37], [43], [44], [48]–
mination changes. Kim et al. [218] have proposed a stereo [50], [128]) and the image texture (feature-based VO [64],
VO algorithm which employs affine illumination model in [65], [104]–[106], [108], [109], [111], [126], [137]). This
each image patch to cope with abrupt illumination variation method could provide precise state estimation only if enough
in direct state estimation model. The proposed approach features within the environment are observed in good lighting
has demonstrated a real-time system capability to accurately conditions. On the other hand, the nongeometric approach,
localize the aerial robot while maneuvering under significant learning-based VO [102], [112], [117], [118], [127], [130],
illumination changes. In addition, a multi-sensor fusion pose [132], does not require the initialization step for camera pa-
estimation technique based on a factor graph framework was rameters and the process of scale correction of the estimated
proposed by [219] to navigate in visually degraded envi- trajectory such as the case of monocular VO. VO scheme
ronments. Four different sensors were used including IMU, could be a good candidate for precise localization in GNSS-
stereo camera with 2 LED lights, active infrared (IR) camera, denied and textured environments at good illumination con-
and 2D LiDAR which have been employed on a UGV and ditions. Table 1 provides a summary of the recent literature in
tested in totally dark environments. the VO field highlighting key design choices and evaluation
The high dynamic range property of dynamic vision sen- criteria.
sor could leverage visual localization models to operate in Inertial-based odometry approaches use the high rate of
challenging illumination conditions, such as low light room. IMU data (linear acceleration and angular velocity) to esti-
Hence, Vidal et al. [220] have proposed a hybrid framework mate the vehicle pose. This approach is unreliable for long-
that fuses event data, visual data from standard images, term state estimation as the IMU data are corrupted with
and IMU measurements for a more accurate and robust noise over time. Hence, solutions based on VIO are proposed
pose estimation. The model was integrated with a resource- to overcome the limitation of visual odometry and inertial
constrained platform (quadrotor UAV) and evaluated exten- odometry techniques. VIO techniques are classified based
sively with different flight scenarios such as hovering mode, on the processing stage where sensor fusion (visual data
flying in fast circles, and different lighting conditions. Their + IMU) occurs into loosely coupled [153], [176], [177]
model outperformed the pose estimation obtained from stan- and tightly coupled models [158]–[162], [166], [167], [169],
dard frame based VIO by 85%. [170], [199]. The loosely coupled VIO processes the visual
and inertial information independently and each module will
VI. DISCUSSION AND FUTURE RESEARCH estimate camera pose. Then, at a delayed stage, the poses
DIRECTIONS estimated from IMU and VO state estimators are fused to
Recent researches and technologies have proven the capa- produce a refined pose. Such an approach is simple and
bility of autonomous vehicles to navigate in GNSS–denied easy to be integrated with other sensor modality frameworks.
and low-visibility environments. Such platforms can feed However, in terms of pose estimation accuracy and robust-
the end user with useful information that rely on vision- ness, its lower than tightly coupled VIO techniques where all
based systems. Based on the application need, an appropriate sensor measurements (visual + IMU) are jointly processed
UAV or UGV navigation system would be adopted. Vision- and optimized for pose estimation.
based localization is one of the promising research directions The VIO can be further classified based on type of data
related to computer vision and deep learning (DL), and fusion into filtering-based [153], [159], [162], [167], [181]
aims to estimate the robot’s ego-motion within the environ- and optimization-based [145], [158], [190]–[193] solutions.
ments using a set of subsequent measurements. Researchers In general, performing state estimation using Filtering-based
have investigated novel approaches to enhance vehicle self- VIO is processed in two stages, (i) estimate the vehicle
localization (position and orientation) accuracy, robustness, pose using the IMU linear acceleration and angular velocities
reliability, and adaptability while maneuvering. that drive the vehicle dynamic model and (ii) update the
This survey provides a comprehensive overview of most vehicle pose using the key information of the visual data that
of the state-of-the-art visual-based localization solutions. estimated the vehicle ego-motion. Existing filtering-based
These techniques employ visual sensory data and other(s) VIO solutions use the nonlinear filter framework (Kalman
to localize the robot in GNSS-denied environments and in filter) where errors are linearized producing accurate pose
low-visibility conditions. Two main vision-based navigation estimation. These solutions can be categorized based on the
paradigms have been reviewed, visual odometry and visual filtering frameworks into EKF [144], [151], [179], [181],
inertial odometry, and discussed in terms of the key design UKF [182] and MSCKF [143], [163], [167].
aspects, advantages, and limitations of each paradigm, where The utilization of the EKF approach can be used to lin-
applicable. earize the data associated with the nonlinear and dynamic
Key design choices of VO schemes can be classified based model parameters, thus providing good pose estimation. On
on the used visual sensor(s) and the selected processing mod- the other hand, this comes at the cost of computational power
ules, into geometric and nongeometric approaches. In the first which increases quadratically relative to the number of fea-
20 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
tures tracked per frame. Moreover, to deal with highly non- estimation process, and postprocessing modules. The use
linear models, an approach based on an UKF are proposed, of these processes will affect the performance, prediction
which is an extension of EKF framework, and achieve higher accuracy, and system power and energy efficiency. There-
accuracy at the cost of computational load [182]. To deal with fore, based on the application needs, a suitable localization
the computational load constraints, a MSCKF framework is approach should be investigated or researched for optimal
proposed and provides better accuracy and consistency. In performance. Environmental texture properties and lighting
addition to that, the computational load required is linearly conditions affect the main source of perception to self-
proportional to the number of detected landmarks. localize the robot within the workplace (scene understand-
In optimization-based VIO framework, pose estimation ing). The main evaluation metrics considered in the literature
is processed by solving the least square nonlinear problem are performance, accuracy, power and energy efficiency, and
for the IMU and visual information. Such approaches out- system robustness. The different parts of performing visual
perform the pose prediction obtained from filtering-based localization are summarized in Fig. 13.
VIO due to their capability to linearize the state vector of According to the literature reviewed, the following chal-
various points, producing more precise predictions at the lenges hinder the progression of effective self-localization
cost of extra computational loads. To tackle this issue and systems.
make the approach suitable to be deployed in resource con-
straints platforms, i.e., drone solutions have been proposed 1) Robustness: In the presence of illumination variation
to utilize constant processing time via a framework based on such as lighting or weather conditions, VO and VIO
a bounded-sized optimization window and marginalize past approaches based on standard camera are poorly per-
states [145], [158], [192], [194]. In other words, few states forming localization due to lack of features detected
are updated via the nonlinear optimization solver, which within the environment. There are post processing
reduce the computation load and make it more feasible for techniques available to reduce the effect of outliers
real-time operation. Table 2 highlights the design choices of and enhance the performance, however, at the cost
the latest studies in the VIO field. of additional computational load. To that end, post-
State-of-the-art solutions for self-localization in low- processing vision-based state estimators by deploying
visibility environments can be categorized as follows: single deep learning (DL) approaches may significantly im-
modality or multi-modality frameworks. In a single modality, prove the pose results, and hence, results in a more
the state estimation was performed based on data obtained robust visual localization system, such as in [222].
from a single sensor i.e., stereo-based VO [218], thermal- DL approaches have the ability to adapt with inherited
based VO [106], or event-based [85]. Thermal sensors pro- system nonlinearities as well as the variation in the
vide images at low resolution with low textured and features. environment. In addition to that, very limited studies
To address this issue, many image fusion techniques with have employed thermal camera for odometry estima-
the visible image have been proposed based on computer vi- tion and overcoming its low feature resolution [14],
sion algorithm [205]–[209] or machine learning tools [211]– [199]–[201].
[213]. This is to retain high level scene information for better 2) Applicability: Some platforms are limited with the
visualization and scene understanding. power and computational capabilities. Therefore, real-
To enhance robustness to difficult illumination conditions time operating sensory data approaches should uti-
and high dynamic range (HDR) environments, enhanced VO lize fully learning-based or hybridized learning and
frameworks are proposed in the literature, such as by using conventional-based paradigms. Such systems are con-
binary descriptors [216], affine illumination model [218], sidered as application dependent models, however,
and learning-based methods to enhance image representation with the advances in machine learning tools, their
[215]. Moreover, a multi-modality framework has proposed capability can be extended over time via fine tuning
to cope with difficulties in percepting the environment around or transfer learning techniques.
the vehicle at low visibility. In such frameworks, the robot’s 3) Reliability: Real time operation requires the system
pose was estimated by using a multi- sensory data fusion to have the above mentioned criteria, robustness, and
technique, i.e., thermal imager with IMU [14], [201], event- applicability. Based on the reviewed visual-based lo-
based camera with IMU [196], [197], [221], thermal imager calization approaches, they are application dependent
and LiDAR measurements [16], Radar and LiDAR [214], and models. Online-based state estimators require the robot
more than two sensory data [199], [200], [219]. to have self-awareness ability regarding the surrounded
Based on the reviewed research studies, various research environment and based on the situation, the best suit-
components have been considered when developing visual- able odometry technique should be operated. Having
based localization approaches, such as: sensor modality, type this intelligent decision (ID) platform wherein based
of environment, type of platform (ground or aerial vehicle) on the condition, best suited approach is operated for
and available computation resources, and the dimension of robot’ ego-motion estimation. Having such ID plat-
pose estimation (2D or 3D). Performing visual localization form would improve state estimation performance, in-
can be processed in three main modules: preprocessing, state crease system adaptability, reliability and robustness.
VOLUME 4, 20XX 21
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
22 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
4) Adaptability: Persistence pose estimation to adapt lighted, where applicable. In addition, VIO-related studies
with the changes in the environmental texture prop- have been categorized based on the type of the sensory data
erties, features, and illumination conditions over time. that were fused and the stage at which this fusion takes
Enhance image representation at the visual localization place. VIO techniques can be categorized into filtering or
preprocessing module using deep learning approaches optimization-based paradigms, which include loosely, semi-
would play a role for better perception and extraction tightly, and tightly coupled approaches. Key design charac-
of meaningful information, such as the work proposed teristics, strengths, and weaknesses of each type were dis-
by [215]. cussed. For lighting conditions challenges, pose estimation is
Dynamic vision sensor (DVS) has the capability, com- processed by frameworks that enhance image representation
pared to visible cameras, providing sufficient data un- and feature extraction modules. Furthermore, data fusion of
der low light conditions at low latency, high dynamic multi-sensors is also examined to cope with difficulties in
range, no motion blur [223]. Event based observation percepting the environment such as thermal imagers, event-
and navigation approaches have been widely proposed based camera, IMU, LiDAR, and RADAR measurements.
in literature [223], however, their key potential to be Advances in computer vision algorithms, machine learning
utilized for real time application under low light condi- tools, and both software and hardware technologies should
tions has not yet largely been investigated. be directed towards developing an efficient self-localization
For low visibility conditions, with the mentioned ca- system. Such systems should have an environment-awareness
pability of event-based camera, fusion techniques of capability, be resilient to outliers, adapt to environmental
thermal and event-based sensory data with an IMU challenges, and provide reliable, robust, and accurate estima-
might lead to more robust self-localization scheme. tions in real-time. Based on the surveyed papers, the main
Such schemes could be investigated with the integra- future self-localization direction includes pose estimation
tion of DL tools to learn and capture the inherited non- in GNSS-denied, complex, and visually degraded environ-
linearities error patterns associated with overall visual ments. The main future research trends in this topic are
odometery estimation. This would enhance the end- robustness, applicability, reliability, and adaptability.
to-end framework or preprocessing the raw data and
hence increase system efficiency and reliability when REFERENCES
navigating at reduced visibility conditions. [1] G. Balamurugan, J. Valarmathi, and V. P. Naidu, “Survey on UAV Navi-
The generated event data from event-based sensors are gation in GPS Denied Environments,” International Conference on Signal
noisy depending on the illumination condition and very Processing, Communication, Power and Embedded System, SCOPES
2016 - Proceedings, pp. 198–204, 2017.
sensitive to camera parameters. In low light conditions, [2] A. Bircher et al., “Structural Inspection Path Planning Via Iterative
the features or edges of moving objects, even when Viewpoint Resampling With Application to Aerial Robotics,” in 2015
tuning the camera parameters to their optimal values, IEEE International Conference on Robotics and Automation (ICRA),
2015, pp. 6423–6430.
are highly scattered and very noisy. Therefore, the [3] C. Papachristos, S. Khattak, and K. Alexis, “Uncertainty-aware Receding
need for an approach that could reject these noises Horizon Exploration and Mapping Using Aerial Robots,” Proceedings -
and sharpen the real event data is essential for a better IEEE International Conference on Robotics and Automation, pp. 4568–
4575, 2017.
extraction of meaningful information under normal, [4] D. Zermas et al., “Automation Solutions for the Evaluation of Plant
low light, and/or variation of lighting conditions. Yet, Health in Corn Fields,” IEEE International Conference on Intelligent
event denoising methods based on conventional spatio- Robots and Systems, vol. 2015-Decem, pp. 6521–6527, 2015.
[5] D. Zermas et al., “Estimating the Leaf Area Index of Crops Through the
temporal correlation or learning approaches are still Evaluation of 3D Models,” in 2017 IEEE/RSJ International Conference
largely unexplored [224]–[230]. on Intelligent Robots and Systems (IROS), 2017, pp. 6155–6162.
[6] J. G. Mooney and E. N. Johnson, “Integrated Data Management for a
Fleet of Search-and-rescue Robots,” Journal of Field Robotics, vol. 33,
VII. CONCLUSION no. 1, pp. 1–17, 2014. [Online]. Available: http://onlinelibrary.wiley.
In this article, we have surveyed most of the state-of-the-art com/doi/10.1002/rob.21514/abstract
studies related to visual-based localization solutions, namely [7] C. Papachristos, D. Tzoumanikas, and A. Tzes, “Aerial Robotic Tracking
of a Generalized Mobile Target Employing Visual and Spatio-temporal
VO and VIO, to aid autonomous navigation in GNSS-denied Dynamic Subject Perception,” IEEE International Conference on Intelli-
environments. In addition, we have conducted a comprehen- gent Robots and Systems, vol. 2015-Decem, pp. 4319–4324, 2015.
sive review on self-localization techniques for autonomous [8] W. He, Z. Li, and C. L. P. Chen, “A Survey of Human-centered Intelli-
gent Robots: Issues and Challenges,” IEEE/CAA Journal of Automatica
navigation in visually degraded environments. The main Sinica, vol. 4, no. 4, pp. 602–609, 2017.
components of performing visual localization were identified [9] S. Poddar, R. Kottath, and V. Karar, “Motion Estimation Made Easy:
and discussed. Evolution and Trends in Visual Odometry,” in Recent Advances in
Studies related to VO have been classified based on key Computer Vision. Springer, 2019, pp. 305–331.
[10] S. A. Mohamed et al., “A Survey on Odometry for Autonomous Naviga-
design choices into conventional approaches (appearance, tion Systems,” IEEE Access, vol. 7, pp. 97 466–97 486, 2019.
feature and hybrid-based methods) and non-conventional [11] Y. D. V. Yasuda, L. E. G. Martins, and F. A. M. Cappabianco,
approaches (learning-based methods). An overview of the “Autonomous Visual Navigation for Mobile Robots: A Systematic
Literature Review,” ACM Comput. Surv., vol. 53, no. 1, Feb.
key design aspects of each category was provided, and 2020. [Online]. Available: https://doi-org.libconnect.ku.ac.ae/10.1145/
the challenges associated with each approach were high- 3368961
VOLUME 4, 20XX 23
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
[12] D. Scaramuzza and F. Fraundorfer, “Visual odometry Part I: The First [35] D. Nistér, O. Naroditsky, and J. Bergen, “Visual Odometry for Ground
30 Years and Fundamentals,” IEEE Robotics and Automation Magazine, Vehicle Applications,” Journal of Field Robotics, vol. 23, no. 1, pp. 3–20,
vol. 18, no. 4, pp. 80–92, 2011. 2006.
[13] W. Rone and P. Ben-Tzvi, “Mapping, Localization and Motion Planning [36] R. Gonzalez et al., “Combined Visual Odometry and Visual Compass for
in Mobile Multi-robotic Systems,” Robotica, vol. 31, no. 1, pp. 1–23, Off-road Mobile Robots Localization,” Robotica, vol. 30, no. 6, pp. 865–
2013. 878, 2012.
[14] C. Papachristos, S. Khattak, and K. Alexis, “Autonomous Exploration [37] D. Scaramuzza and R. Siegwart, “Appearance-guided Monocular Om-
of Visually-degraded Environments Using Aerial Robots,” 2017 Inter- nidirectional Visual Odometry for Outdoor Ground Vehicles,” IEEE
national Conference on Unmanned Aircraft Systems, ICUAS 2017, pp. Transactions on Robotics, vol. 24, no. 5, pp. 1015–1026, 2008.
775–780, 2017. [38] T. A. Ciarfuglia et al., “Evaluation of Non-geometric Methods for Visual
[15] A. Djuricic and B. Jutzi, “Supporting Uavs in Low Visibility Condi- Odometry,” Robotics and Autonomous Systems, vol. 62, no. 12, pp.
tions By Multiple-Pulse Laser Scanning Devices,” ISPRS - International 1717–1730, 2014. [Online]. Available: http://dx.doi.org/10.1016/j.robot.
Archives of the Photogrammetry, Remote Sensing and Spatial Informa- 2014.08.001
tion Sciences, vol. XL-1/W1, no. May, pp. 93–98, 2013. [39] V. Guizilini and F. Ramos, “Visual Odometry Learning for Unmanned
[16] Y. Shin and A. Kim, “Sparse Depth Enhanced Direct Thermal-Infrared Aerial Vehicles,” Proceedings - IEEE International Conference on
SLAM Beyond the Visible Spectrum,” IEEE Robotics and Automation Robotics and Automation, no. November 2016, pp. 6213–6220, 2011.
Letters, vol. 4, no. 3, pp. 2918–2925, 2019. [40] L. Frédéric, “The Visual Compass: Performance and Limitations of an
[17] A. Saha, A. Kumar, and A. K. Sahu, “FPV Drone with GPS Used Appearance-Based Method,” Journal of Field Robotics, vol. 33, no. 1,
for Surveillance in Remote Areas,” Proceedings - 2017 3rd IEEE In- pp. 1–17, 2006. [Online]. Available: http://onlinelibrary.wiley.com/doi/
ternational Conference on Research in Computational Intelligence and 10.1002/rob.21514/abstract
Communication Networks, ICRCICN 2017, vol. 2017-Decem, pp. 62– [41] N. Nourani-Vatani, J. Roberts, and M. V. Srinivasan, “Practical Visual
67, 2017. Odometry for Car-like Vehicles,” Proceedings - IEEE International Con-
[18] H. N. Viet et al., “Implementation of GPS Signal Simulation for Drone ference on Robotics and Automation, pp. 3551–3557, 2009.
Security Using Matlab/Simulink,” in 2017 IEEE XXIV International [42] Y. Yu, C. Pradalier, and G. Zong, “Appearance-based Monocular Visual
Conference on Electronics, Electrical Engineering and Computing (IN- Odometry for Ground Vehicles,” IEEE/ASME International Conference
TERCON), 2017, pp. 1–4. on Advanced Intelligent Mechatronics, AIM, pp. 862–867, 2011.
[19] M. O. Aqel et al., “Review of Visual Odometry: Types, Approaches, [43] M. O. Aqel et al., “Adaptive-search Template Matching Technique Based
Challenges, and Applications,” SpringerPlus, vol. 5, no. 1, 2016. on Vehicle Acceleration for Monocular Visual Odometry System,” IEEJ
[20] D. Liu et al., “A Low-Cost Method of Improving the GNSS/SINS Transactions on Electrical and Electronic Engineering, vol. 11, no. 6, pp.
Integrated Navigation System Using Multiple Receivers,” Electronics, 739–752, 2016.
vol. 9, no. 7, 2020. [Online]. Available: https://www.mdpi.com/ [44] A. I. Comport, E. Malis, and P. Rives, “Accurate Quadrifocal Tracking
2079-9292/9/7/1079 for Robust 3D Visual Odometry,” Proceedings - IEEE International
[21] K. Yousif, A. Bab-Hadiashar, and R. Hoseinnezhad, “An Overview to Conference on Robotics and Automation, no. April, pp. 40–45, 2007.
Visual Odometry and Visual SLAM: Applications to Mobile Robotics,” [45] A. I. Comport, E. Malis, and P. Rives, “Real-time Quadrifocal Visual
Intelligent Industrial Systems, vol. 1, no. 4, pp. 289–311, 2015. Odometry,” The International Journal of Robotics Research, vol. 29, no.
2-3, pp. 245–266, 2010.
[22] R. Azzam et al., “Feature-based Visual Simultaneous Localization and
[46] S. Lovegrove, A. J. Davison, and J. Ibañez-Guzmán, “Accurate Visual
Mapping: A Survey,” SN Applied Sciences, vol. 2, no. 2, pp. 1–24, 2020.
Odometry from a Rear Parking Camera,” IEEE Intelligent Vehicles
[Online]. Available: https://doi.org/10.1007/s42452-020-2001-3
Symposium, Proceedings, pp. 788–793, 2011.
[23] F. Fraundorfer and D. Scaramuzza, “Visual Odometry : Part II: Matching,
[47] D. Caruso, J. Engel, and D. Cremers, “Large-scale Direct SLAM for
Robustness, Optimization, and Applications,” IEEE Robotics Automa-
Omnidirectional Cameras,” IEEE International Conference on Intelligent
tion Magazine, vol. 19, no. 2, pp. 78–90, 2012.
Robots and Systems, vol. 2015-Decem, pp. 141–148, 2015.
[24] G. Huang, “Visual-Inertial Navigation: A Concise Review,” in 2019
[48] J. Engel, V. Koltun, and D. Cremers, “Direct Sparse Odometry,” IEEE
International Conference on Robotics and Automation (ICRA), 2019, pp.
transactions on pattern analysis and machine intelligence, vol. 40, no. 3,
9572–9582.
pp. 611–625, 2017.
[25] J. Gui et al., “A Review of Visual Inertial Odometry from Filtering
[49] D. Valiente García et al., “Visual Odometry through Appearance-
and Optimisation Perspectives,” Advanced Robotics, vol. 29, no. 20, pp.
and Feature-Based Method with Omnidirectional Images,” Journal of
1289–1301, 2015.
Robotics, vol. 2012, pp. 1–13, 2012.
[26] M. Shan et al., “A Brief Survey of Visual Odometry for Micro Aerial [50] T. Brox et al., “High Accuracy Optical Flow Estimation based on a The-
Vehicles,” IECON Proceedings (Industrial Electronics Conference), pp. ory for Warping,” in European conference on computer vision. Springer,
6049–6054, 2016. 2004, pp. 25–36.
[27] D. Scaramuzza and F. Fraundorfer, “Visual Odometry [Tutorial],” IEEE [51] A. Bruhn and J. Weickert, “Towards Ultimate Motion Estimation: Com-
robotics & automation magazine, vol. 18, no. 4, pp. 80–92, 2011. bining Highest Accuracy with Real-time Performance,” Proceedings of
[28] H. P. Morevec, “Towards Automatic Visual Obstacle Avoidance,” in the IEEE International Conference on Computer Vision, vol. I, pp. 749–
Proceedings of the 5th International Joint Conference on Artificial In- 755, 2005.
telligence - Volume 2, ser. IJCAI’77. San Francisco, CA, USA: Morgan [52] Y. H. Kim, A. M. Martínez, and A. C. Kak, “Robust Motion Estimation
Kaufmann Publishers Inc., 1977, p. 584. Under Varying Illumination,” Image and Vision Computing, vol. 23,
[29] C. G. Harris and J. Pike, “3D Positional Integration from Image Se- no. 4, pp. 365–375, 2005.
quences,” Image and Vision Computing, vol. 6, no. 2, pp. 87–90, 1987. [53] M. J. Black and P. Anandan, “The Robust Estimation of Multiple Mo-
[30] J. F. Hoelscher et al., “Bundle Adjustment —A Modern Synthesis Bill,” tions: Parametric and Piecewise-smooth Flow Fields,” Computer Vision
Conference Record of the IEEE Photovoltaic Specialists Conference, vol. and Image Understanding, vol. 63, no. 1, pp. 75–104, 1996.
34099, pp. 943–946, 2000. [54] E. J. Corey and W.-g. Su, “Relaxing the Brightness Constancy Assump-
[31] R. Munguia and A. Grau, “Monocular SLAM for Visual Odometry,” tion in Computing Optical Flow,” Tetrahedron Letters, vol. 28, no. 44,
Parallax, 2007. pp. 5241–5244, 1987.
[32] L. Kneip, D. Scaramuzza, and R. Siegwart, “A Novel Parametrization [55] J. Campbell et al., “A Robust Visual Odometry and Precipice Detection
of the Perspective-Three-Point Problem for a Direct Computation of System using Consumer-grade Monocular Vision,” Proceedings - IEEE
Absolute Camera Position and Orientation,” Proceedings of the IEEE International Conference on Robotics and Automation, vol. 2005, no.
Computer Society Conference on Computer Vision and Pattern Recog- April, pp. 3421–3427, 2005.
nition, pp. 2969–2976, 2011. [56] A. M. Hyslop and J. S. Humbert, “Autonomous Navigation in Three-
[33] H. Longuet-Higgins, “A Computer Algorithm for Reconstructing a Scene dimensional Urban Environments using Wide-field Integration of Optic
from Two Projections,” pp. 61–62, 1981. Flow,” Journal of Guidance, Control, and Dynamics, vol. 33, no. 1, pp.
[34] D. Nistér, “An Efficient Solution to the Five-point Relative Pose Prob- 147–159, 2010.
lem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, [57] V. Grabe, H. H. Bülthoff, and P. R. Giordano, “On-board Velocity Esti-
vol. 26, no. 6, pp. 756–770, 2004. mation and Closed-loop Control of a Quadrotor UAV Based on Optical
24 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Flow,” Proceedings - IEEE International Conference on Robotics and [81] I. Cvišić and I. Petrović, “Stereo Odometry Based on Careful Feature
Automation, pp. 491–497, 2012. Selection and Tracking,” 2015 European Conference on Mobile Robots,
[58] V. Grabe, H. H. Bulthoff, and P. Robuffo Giordano, “Robust Optical-flow ECMR 2015 - Proceedings, pp. 0–5, 2015.
Based Self-motion Estimation for a Quadrotor UAV,” IEEE International [82] L. De-Maeztu et al., “A Temporally Consistent Grid-based Visual Odom-
Conference on Intelligent Robots and Systems, pp. 2153–2159, 2012. etry Framework for Multi-core Architectures,” Journal of Real-Time
[59] T. Low and G. Wyeth, “Obstacle Detection Using Optical Flow,” Pro- Image Processing, vol. 10, no. 4, pp. 759–769, 2015.
ceedings of the 2005 Australasian Conference on Robotics and Automa- [83] H. Badino, A. Yamamoto, and T. Kanade, “Visual Odometry by Multi-
tion, ACRA 2005, 2005. frame Feature Integration,” Proceedings of the IEEE International Con-
[60] C. Kerl, J. Sturm, and D. Cremers, “Robust Odometry Estimation for ference on Computer Vision, pp. 222–229, 2013.
RGB-D Cameras,” Revista Gestão, Inovação e Tecnologias, vol. 3, no. 5, [84] I. Krešo and S. Šegvić, “Improving the Egomotion Estimation by Correct-
pp. 427–436, 2014. ing the Calibration Bias,” VISAPP 2015 - 10th International Conference
[61] I. Dryanovski, R. G. Valenti, and Jizhong Xiao, “Fast Visual Odometry on Computer Vision Theory and Applications; VISIGRAPP, Proceed-
and Mapping from RGB-D Data,” in 2013 IEEE International Conference ings, vol. 3, pp. 347–356, 2015.
on Robotics and Automation, 2013, pp. 2305–2310. [85] H. Rebecq et al., “EVO: A Geometric Approach to Event-Based 6-
[62] S. Li and D. Lee, “Fast Visual Odometry Using Intensity-Assisted Itera- DOF Parallel Tracking and Mapping in Real Time,” IEEE Robotics and
tive Closest Point,” IEEE Robotics and Automation Letters, vol. 1, no. 2, Automation Letters, vol. 2, no. 2, pp. 593–600, 2017.
pp. 992–999, 2016. [86] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast Semi-direct
[63] A. De La Escalera et al., “Stereo Visual Odometry in Urban Environ- Monocular Visual Odometry,” in 2014 IEEE international conference on
ments based on Detecting Ground Features,” Robotics and Autonomous robotics and automation (ICRA). IEEE, 2014, pp. 15–22.
Systems, vol. 80, pp. 1–10, 2016. [87] H. Silva, A. Bernardino, and E. Silva, “Probabilistic Egomotion for
[64] O. Saurer et al., “Homography Based Egomotion Estimation with a Com- Stereo Visual Odometry,” Journal of Intelligent and Robotic Systems:
mon Direction,” IEEE Transactions on Pattern Analysis and Machine Theory and Applications, vol. 77, no. 2, pp. 265–280, 2014.
Intelligence, vol. 39, no. 2, pp. 327–341, 2017. [88] J. Feng et al., “A Fusion Algorithm of Visual Odometry Based on
[65] B. Guan et al., “Visual Odometry Using a Homography Formulation Feature-based Method and Direct Method,” in 2017 Chinese Automation
with Decoupled Rotation and Translation Estimation Using Minimal Congress (CAC). IEEE, 2017, pp. 1854–1859.
Solutions,” in 2018 IEEE International Conference on Robotics and [89] H. Alismail et al., “Direct Visual Odometry in Low Light Using Binary
Automation (ICRA), 2018, pp. 2320–2327. Descriptors,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp.
444–451, 2017.
[66] J. Shi and C. Tomasi, “Good Features,” Image (Rochester, N.Y.), pp. 593–
600, 1994. [90] R. Roberts et al., “Memory-based Learning for Visual Odometry,” Pro-
ceedings - IEEE International Conference on Robotics and Automation,
[67] J. Matas et al., “Robust Wide-baseline Stereo from Maximally Stable
pp. 47–52, 2008.
Extremal Regions,” Image and Vision Computing, vol. 22, no. 10 SPEC.
[91] R. Roberts, C. Potthast, and F. Dellaert, “Learning General Optical
ISS., pp. 761–767, 2004.
Flow Subspaces for Egomotion Estimation and Detection of Motion
[68] T. Lindeberg, “Feature Detection with Automatic Scale Selection,” Inter-
Anomalies,” 2009 IEEE Computer Society Conference on Computer
national Journal of Computer Vision, vol. 30, no. 2, pp. 79–116, 1998.
Vision and Pattern Recognition Workshops, CVPR Workshops 2009, vol.
[69] D. G. Lowe, “Distinctive Image Features from Scale-invariant Key- 2009 IEEE, pp. 57–64, 2009.
points,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91– [92] G. Vitor and R. Fabio, “Learning Visual Odometry for Unmanned Aerial
110, 2004. Vehicles,” IEEE International Conference on Robotics and Automation,
[70] E. Mair et al., “Adaptive and Generic Corner Detection Based on the vol. 2011-Janua, pp. 316–320, 2011.
Accelerated Segment Test,” in European conference on Computer vision. [93] V. Guizilini and F. Ramos, “Semi-parametric Models for Visual Odom-
Springer, 2010, pp. 183–196. etry,” Proceedings - IEEE International Conference on Robotics and
[71] M. Calonder et al., “BRIEF: Computing a Local Binary Descriptor Very Automation, pp. 3482–3489, 2012.
Fast,” IEEE Transactions on Pattern Analysis and Machine Intelligence, [94] V. Guizilini and F. Ramos, “Semi-parametric Learning for Visual
vol. 34, no. 7, pp. 1281–1298, 2012. Odometry,” The International Journal of Robotics Research, vol. 32,
[72] H. Bay et al., “Speeded-Up Robust Features (SURF),” Computer Vision no. 5, pp. 526–546, 2013. [Online]. Available: https://doi.org/10.1177/
and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. 0278364912472245
[73] D. G. Lowe, “Object Recognition from Local Scale-invariant Features,” [95] P. Gemeiner, P. Einramhof, and M. Vincze, “Simultaneous Motion and
in Proceedings of the Seventh IEEE International Conference on Com- Structure Estimation by Fusion of Inertial and Vision Data,” International
puter Vision, vol. 2, 1999, pp. 1150–1157 vol.2. Journal of Robotics Research, vol. 26, no. 6, pp. 591–605, 2007.
[74] E. Rublee et al., “ORB: An Efficient Alternative to SIFT or SURF,” [96] L. Porzi et al., “Visual-inertial Tracking on Android for Augmented
Proceedings of the IEEE International Conference on Computer Vision, Reality Applications,” 2012 IEEE Workshop on Environmental, Energy,
pp. 2564–2571, 2011. and Structural Monitoring Systems, EESMS 2012 - Proceedings, pp. 35–
[75] R. Y. S. Leutenegger Stefan, Margarita Chli, ““BRISK: Binary Robust In- 41, 2012.
variant Scalable Keypoints.” Computer Vision (ICCV),” Iccv, pp. 2548– [97] K. Konda and R. Memisevic, “Unsupervised Learning of Depth and
2555, 2011. Motion,” arXiv preprint arXiv:1312.3429, 2013.
[76] E. Salahat and M. Qasaimeh, “Recent Advances in Features Extraction [98] V. Peretroukhin, L. Clement, and J. Kelly, “Inferring Sun Direction to Im-
and Description Algorithms: A Comprehensive Survey,” in 2017 IEEE prove Visual Odometry: A Deep Learning Approach,” The International
International Conference on Industrial Technology (ICIT), 2017, pp. Journal of Robotics Research, vol. 37, no. 9, pp. 996–1016, 2018.
1059–1063. [99] V. Mohanty et al., “DeepVO: A Deep Learning Approach for Monocular
[77] L. P. Morency and R. Gupta, “Robust Real-time Egomotion from Stereo Visual Odometry,” arXiv preprint arXiv:1611.06069, 2016.
Images,” IEEE International Conference on Image Processing, vol. 2, pp. [100] L. Clement and J. Kelly, “How to Train a CAT: Learning Canonical
719–722, 2003. Appearance Transformations for Direct Visual Localization under Illu-
[78] D. Scaramuzza, A. Martinelli, and R. Siegwart, “A Flexible Technique mination Change,” IEEE Robotics and Automation Letters, vol. 3, no. 3,
for Accurate Omnidirectional Camera Calibration and Structure from pp. 2447–2454, 2018.
Motion,” Fourth IEEE International Conference on Computer Vision [101] J. Jiao et al., “MagicVO: An End-to-End Hybrid CNN and Bi-LSTM
Systems (ICVS’06), pp. 45–45, 2006. Method for Monocular Visual Odometry,” IEEE Access, vol. 7, pp.
[79] B. Kitt, F. Moosmann, and C. Stiller, “Moving on to Dynamic Environ- 94 118–94 127, 2019.
ments: Visual Odometry Using Feature Classification,” IEEE/RSJ 2010 [102] Q. Liu et al., “Using Unsupervised Deep Learning Technique for Monoc-
International Conference on Intelligent Robots and Systems, IROS 2010 ular Visual Odometry,” IEEE Access, vol. 7, pp. 18 076–18 088, 2019.
- Conference Proceedings, pp. 5551–5556, 2010. [103] H. Wang et al., “Monocular VO Based on Deep Siamese Convolutional
[80] B. Kitt, A. Geiger, and H. Lategahn, “Visual Odometry Based on Stereo Neural Network,” Complexity, vol. 2020, 2020.
Image Sequences with RANSAC-based Outlier Rejection Scheme,” [104] A. de la Escalera et al., “Stereo Visual Odometry in Urban
IEEE Intelligent Vehicles Symposium, Proceedings, pp. 486–492, 2010. Environments Based on Detecting Ground Features,” Robotics and
VOLUME 4, 20XX 25
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Autonomous Systems, vol. 80, pp. 1 – 10, 2016. [Online]. Available: [128] W. Dai et al., “Multi-Spectral Visual Odometry without Explicit Stereo
http://www.sciencedirect.com/science/article/pii/S0921889015303183 Matching,” 2019 International Conference on 3D Vision (3DV), pp. 443–
[105] W. Zhou, H. Fu, and X. An, “A Classification-Based Visual Odometry 452, 2019.
Approach,” in 2016 8th International Conference on Intelligent Human- [129] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An Open-Source SLAM
Machine Systems and Cybernetics (IHMSC), vol. 02, 2016, pp. 85–89. System for Monocular, Stereo, and RGB-D Cameras,” IEEE Transactions
[106] P. V. K. Borges and S. Vidas, “Practical infrared visual odometry,” IEEE on Robotics, vol. 33, pp. 1255–1262, 2017.
Transactions on Intelligent Transportation Systems, vol. 17, no. 8, pp. [130] S. Li et al., “Self-Supervised Deep Visual Odometry With Online Adap-
2205–2213, 2016. tation,” 2020 IEEE/CVF Conference on Computer Vision and Pattern
[107] I. Vanhamel, I. Pratikakis, and H. Sahli, “Multiscale Gradient Watersheds Recognition (CVPR), pp. 6338–6347, 2020.
of Color Images,” IEEE transactions on Image Processing, vol. 12, no. 6, [131] H. Zhan et al., “Visual Odometry Revisited: What Should Be Learnt?”
pp. 617–626, 2003. 2020 IEEE International Conference on Robotics and Automation
[108] B. Kueng et al., “Low-latency Visual Odometry Using Event-based Fea- (ICRA), pp. 4203–4210, 2020.
ture Tracks,” in 2016 IEEE/RSJ International Conference on Intelligent [132] G. Zhai et al., “PoseConvGRU: A Monocular Approach for Visual Ego-
Robots and Systems (IROS), 2016, pp. 16–23. Motion Estimation by Learning,” Pattern Recogn., vol. 102, no. C, Jun.
[109] P. Liu et al., “Direct Visual Odometry for a Fisheye-stereo Camera,” 2020. [Online]. Available: https://doi.org/10.1016/j.patcog.2019.107187
in 2017 IEEE/RSJ International Conference on Intelligent Robots and [133] J.-L. Blanco-Claraco, F.-Á. Moreno-Dueñas, and J. González-Jiménez,
Systems (IROS), 2017, pp. 1746–1752. “The Málaga Urban Dataset: High-rate Stereo and LiDAR in a Realistic
[110] L. Heng and B. Choi, “Semi-direct Visual Odometry for a Fisheye-stereo Urban Scenario,” The International Journal of Robotics Research, vol. 33,
Camera,” in 2016 IEEE/RSJ International Conference on Intelligent no. 2, pp. 207–214, 2014.
Robots and Systems (IROS). IEEE, 2016, pp. 4077–4084. [134] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A
[111] R. Kottath et al., “Inertia Constrained Visual Odometry for Navigational Versatile and Accurate Monocular SLAM System,” IEEE Transactions
Applications,” in 2017 Fourth International Conference on Image Infor- on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
mation Processing (ICIIP), 2017, pp. 1–4. [135] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D Reconstruc-
[112] Y. Almalioglu et al., “GANVO: Unsupervised Deep Monocular Visual tion in Real-time,” in 2011 IEEE intelligent vehicles symposium (IV).
Odometry and Depth Estimation with Generative Adversarial Networks,” Ieee, 2011, pp. 963–968.
in 2019 International Conference on Robotics and Automation (ICRA), [136] M. R. U. Saputra et al., “Learning Monocular Visual Odometry through
2019, pp. 5474–5480. Geometry-Aware Curriculum Learning,” 2019 International Conference
[113] M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehicles,” on Robotics and Automation (ICRA), pp. 3549–3555, 2019.
in 2015 IEEE Conference on Computer Vision and Pattern Recognition [137] J. Huang et al., “ClusterVO: Clustering Moving Instances and Estimating
(CVPR), 2015, pp. 3061–3070. Visual Odometry for Self and Surroundings,” in 2020 IEEE/CVF Con-
[114] T. Zhou et al., “Unsupervised Learning of Depth and Ego-motion from ference on Computer Vision and Pattern Recognition (CVPR), 2020, pp.
Video,” in Proceedings of the IEEE Conference on Computer Vision and 2165–2174.
Pattern Recognition, 2017, pp. 1851–1858. [138] K. M. Judd and J. D. Gammell, “The Oxford Multimotion Dataset: Mul-
[115] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular tiple SE(3) Motions With Ground Truth,” IEEE Robotics and Automation
Depth Estimation with Left-Right Consistency,” in CVPR, 2017. Letters, vol. 4, no. 2, pp. 800–807, 2019.
[116] Z. Yin and J. Shi, “GeoNet: Unsupervised Learning of Dense Depth, [139] I. A. Barsan et al., “Robust Dense Mapping for Large-Scale Dynamic
Optical Flow and Camera Pose,” in 2018 IEEE/CVF Conference on Environments,” in 2018 IEEE International Conference on Robotics and
Computer Vision and Pattern Recognition, 2018, pp. 1983–1992. Automation (ICRA), 2018, pp. 7510–7517.
[117] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised Learning [140] P. Li, T. Qin et al., “Stereo Vision-based Semantic 3D Object and
of Depth and Ego-Motion from Monocular Video Using 3D Geometric Ego-motion Tracking for Autonomous Driving,” in Proceedings of the
Constraints,” in CVPR, 2018. European Conference on Computer Vision (ECCV), 2018, pp. 646–661.
[118] A. Valada, N. Radwan, and W. Burgard, “Deep Auxiliary Learning [141] J. Huang et al., “ClusterSLAM: A SLAM Backend for Simultaneous
for Visual Localization and Odometry,” in International Conference on Rigid Body Clustering and Motion Estimation,” in 2019 IEEE/CVF
Robotics and Automation (ICRA 2018). IEEE, 2018. International Conference on Computer Vision (ICCV), 2019, pp. 5874–
[119] J. Shotton et al., “Scene Coordinate Regression Forests for Camera Re- 5883.
localization in RGB-D Images,” in Proceedings of the IEEE Conference [142] J.-C. Piao and S.-D. Kim, “Adaptive Monocular Visual–Inertial SLAM
on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937. for Real-Time Augmented Reality Applications in Mobile Devices,”
[120] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A Convolutional Sensors, vol. 17, no. 11, p. 2567, 2017.
Network for Real-Time 6-DOF Camera Relocalization,” in 2015 IEEE [143] A. I. Mourikis and S. I. Roumeliotis, “A Multi-State Constraint Kalman
International Conference on Computer Vision (ICCV), 2015, pp. 2938– Filter for Vision-aided Inertial Navigation,” in Proceedings 2007 IEEE
2946. International Conference on Robotics and Automation, 2007, pp. 3565–
[121] S. Wang et al., “DeepVO: Towards End-to-end Visual Odometry with 3572.
Deep Recurrent Convolutional Neural Networks,” in 2017 IEEE Interna- [144] M. Bloesch et al., “Robust Visual Inertial Odometry Using a Direct
tional Conference on Robotics and Automation (ICRA), 2017, pp. 2043– EKF-based Approach,” in 2015 IEEE/RSJ International Conference on
2050. Intelligent Robots and Systems (IROS), 2015, pp. 298–304.
[122] I. Melekhov et al., “Relative Camera Pose Estimation Using Convolu- [145] S. Leutenegger et al., “Keyframe-based Visual–inertial Odometry
tional Neural Networks,” 2017. Using Nonlinear Optimization,” The International Journal of Robotics
[123] A. Nicolai et al., “Deep Learning for Laser Based Odometry Estimation,” Research, vol. 34, no. 3, pp. 314–334, 2015. [Online]. Available:
in RSS workshop Limits and Potentials of Deep Learning in Robotics, https://doi.org/10.1177/0278364914554813
vol. 184, 2016. [146] V. Usenko et al., “Direct Visual-inertial Odometry with Stereo Cameras,”
[124] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous in 2016 IEEE International Conference on Robotics and Automation
Driving? The KITTI Vision Benchmark Suite,” in 2012 IEEE Conference (ICRA), 2016, pp. 1885–1892.
on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361. [147] M. Burri et al., “The EuRoC Micro Aerial Vehicle Datasets,” The Inter-
[125] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3D Reconstruc- national Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163,
tion in Real-time,” in 2011 IEEE Intelligent Vehicles Symposium (IV), 2016.
2011, pp. 963–968. [148] C. Forster et al., “On-Manifold Preintegration for Real-Time Visual-
[126] C. Jaramillo et al., “Visual Odometry with a Single-camera Stereo Omni- Inertial Odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp.
directional System,” Machine Vision and Applications, vol. 30, pp. 1145 1–21, 2017.
– 1155, 2019. [149] M. Schwaab et al., “Tightly Coupled Fusion of Direct Stereo Visual
[127] L. Wang et al., “Estimating Pose of Omnidirectional Camera by Con- Odometry and Inertial Sensor Measurements Using an Iterated Informa-
volutional Neural Network,” in 2019 IEEE 8th Global Conference on tion Filter,” in 2017 DGON Inertial Sensors and Systems (ISS), 2017, pp.
Consumer Electronics (GCCE), 2019, pp. 201–202. 1–20.
26 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
[150] X. Zheng et al., “Photometric Patch-based Visual-inertial Odometry,” Environments,” IEEE Robotics Automation Magazine, vol. 21, no. 3, pp.
in 2017 IEEE International Conference on Robotics and Automation 26–40, 2014.
(ICRA), 2017, pp. 3264–3271. [174] Y. Liu et al., “Stereo Visual-Inertial Odometry With Multiple Kalman
[151] M. Bloesch et al., “Iterated Extended Kalman Filter Based Visual-inertial Filters Ensemble,” IEEE Transactions on Industrial Electronics, vol. 63,
Odometry using Direct Photometric Feedback,” The International no. 10, pp. 6205–6216, 2016.
Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017. [175] K. Konolige, M. Agrawal, and J. Sola, “Large-scale Visual Odometry for
[Online]. Available: https://doi.org/10.1177/0278364917728574 Rough Terrain,” in Robotics research. Springer, 2010, pp. 201–212.
[152] D. Caruso et al., “A Robust Indoor/Outdoor Navigation Filter Fusing [176] S. Sirtkaya, B. Seymen, and A. A. Alatan, “Loosely coupled Kalman
Data from Vision and Magneto-Inertial Measurement Unit,” Sensors, filtering for fusion of Visual Odometry and inertial navigation,” in Pro-
vol. 17, no. 12, 2017. [Online]. Available: https://www.mdpi.com/ ceedings of the 16th International Conference on Information Fusion,
1424-8220/17/12/2795 2013, pp. 219–226.
[153] L. HaoChih and D. Francois, “Loosely Coupled Stereo Inertial Odometry [177] N. S. Gopaul, J. Wang, and B. Hu, “Loosely Coupled Visual Odometry
on Low-cost System,” 2017. Aided Inertial Navigation System Using Discrete Extended Kalman
[154] Y. Ling, M. Kuse, and S. Shen, “Edge Alignment-Based Visual— Filter with Pairwise Time Correlated Measurements,” in 2017 Forum on
Inertial Fusion for Tracking of Aggressive Motions,” Auton. Robots, Cooperative Positioning and Service (CPGPS), 2017, pp. 283–288.
vol. 42, no. 3, p. 513–528, Mar. 2018. [Online]. Available: https: [178] S. Weiss et al., “Real-time Onboard Visual-inertial State Estimation and
//doi.org/10.1007/s10514-017-9642-0 Self-calibration of MAVs in Unknown Environments,” in 2012 IEEE
[155] Y. He et al., “PL-VIO: Tightly-coupled Monocular Visual-inertial Odom- International Conference on Robotics and Automation, 2012, pp. 957–
etry using Point and Line Features,” Sensors, vol. 18, no. 4, p. 1159, 2018. 964.
[156] B. Pfrommer et al., “PennCOSYVIO: A Challenging Visual Inertial [179] F. Zheng et al., “Trifo-VIO: Robust and Efficient Stereo Visual Inertial
Odometry Benchmark,” in 2017 IEEE International Conference on Odometry Using Points and Lines,” in 2018 IEEE/RSJ International
Robotics and Automation (ICRA), 2017, pp. 3847–3854. Conference on Intelligent Robots and Systems (IROS), 2018, pp. 3686–
[157] T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile 3693.
Monocular Visual-Inertial State Estimator,” Trans. Rob., vol. 34, no. 4, [180] J. Civera et al., “1-point RANSAC for EKF-based Structure from Mo-
p. 1004–1020, Aug. 2018. [Online]. Available: https://doi.org/10.1109/ tion,” in 2009 IEEE/RSJ International Conference on Intelligent Robots
TRO.2018.2853729 and Systems, 2009, pp. 3498–3504.
[158] R. Mur-Artal and J. D. Tardós, “Visual-inertial Monocular SLAM with [181] A. Hardt-Stremayr and S. Weiss, “Monocular visual-inertial odometry in
Map Reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. low-textured environments with smooth gradients: A fully dense direct
796–803, 2017. filtering approach,” in 2020 IEEE International Conference on Robotics
[159] J. Song et al., “Tightly Coupled Visual Inertial Odometry based on Artifi- and Automation (ICRA), 2020, pp. 7837–7843.
cial Landmarks,” in 2018 IEEE International Conference on Information [182] G. Loianno, M. Watterson, and V. Kumar, “Visual Inertial Odometry
and Automation (ICIA). IEEE, 2018, pp. 63–70. for Quadrotors on SE(3),” in 2016 IEEE International Conference on
[160] L. Von Stumberg, V. Usenko, and D. Cremers, “Direct Sparse Visual- Robotics and Automation (ICRA), 2016, pp. 1544–1551.
inertial Odometry Using Dynamic Marginalization,” in 2018 IEEE Inter- [183] S. Shen et al., “Vision-based state estimation and trajectory control
national Conference on Robotics and Automation (ICRA). IEEE, 2018, towards high-speed flight with a quadrotor,” in Proceedings of Robotics:
pp. 2510–2517. Science and Systems (RSS ’13), June 2013.
[161] S. Khattak, C. Papachristos, and K. Alexis, “Keyframe-based Di- [184] J. Kelly and G. S. Sukhatme, “Visual-inertial Simultaneous Localiza-
rect Thermal-inertial Odometry,” in 2019 International Conference on tion, Mapping and Sensor-to-Sensor Self-calibration,” in 2009 IEEE
Robotics and Automation (ICRA). IEEE, 2019, pp. 3563–3569. International Symposium on Computational Intelligence in Robotics and
[162] S. Ma et al., “Robust Stereo Visual-Inertial Odometry Using Nonlinear Automation - (CIRA), 2009, pp. 360–368.
Optimization,” Sensors, vol. 19, no. 17, p. 3747, 2019. [185] M. Brossard, S. Bonnabel, and J. Condomines, “Unscented Kalman
[163] K. Sun et al., “Robust Stereo Visual Inertial Odometry for Fast Au- Filtering on Lie Groups,” in 2017 IEEE/RSJ International Conference on
tonomous Flight,” IEEE Robotics and Automation Letters, vol. 3, pp. Intelligent Robots and Systems (IROS), 2017, pp. 2485–2491.
965–972, 2018. [186] G. Hu et al., “A new direct filtering approach to INS/GNSS integration,”
[164] G. Yang et al., “Optimization-based, Simplified Stereo Visual-inertial Aerospace Science and Technology, vol. 77, pp. 755–764, 2018.
Odometry with High-accuracy Initialization,” IEEE Access, vol. 7, pp. [187] ——, “Model Predictive Based Unscented Kalman Filter for Hypersonic
39 054–39 068, 2019. Vehicle Navigation with INS/GNSS Integration,” IEEE Access, vol. 8,
[165] C. Chen et al., “A Stereo Visual-inertial SLAM Approach for Indoor pp. 4814–4823, 2019.
Mobile Robots in Unknown Environments Without Occlusions,” IEEE [188] C. Shen et al., “Dual-optimization for a MEMS-INS/GPS System During
Access, vol. 7, pp. 185 408–185 421, 2019. GPS Outages Based on the Cubature Kalman Filter and Neural Net-
[166] J. Jiang et al., “DVIO: An Optimization-based Tightly Coupled Direct works,” Mechanical Systems and Signal Processing, vol. 133, p. 106222,
Visual-Inertial Odometry,” IEEE Transactions on Industrial Electronics, 2019.
2020. [189] M. Li and A. I. Mourikis, “High-precision, Consistent EKF-based Visual-
[167] Z. Zhang et al., “Improving S-MSCKF With Variational Bayesian Adap- inertial Odometry,” The International Journal of Robotics Research,
tive Nonlinear Filter,” IEEE Sensors Journal, vol. 20, no. 16, pp. 9437– vol. 32, no. 6, pp. 690–711, 2013.
9448, 2020. [190] C. Forster et al., “On-Manifold Preintegration for Real-Time Visual–
[168] D. Schubert et al., “The TUM VI Benchmark for Evaluating Visual- Inertial Odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp.
Inertial Odometry,” in 2018 IEEE/RSJ International Conference on In- 1–21, 2016.
telligent Robots and Systems (IROS), 2018, pp. 1680–1687. [191] T. Lupton and S. Sukkarieh, “Visual-Inertial-Aided Navigation for High-
[169] S. Zhong and P. Chirarattananon, “An Efficient Iterated EKF-based Di- Dynamic Motion in Built Environments Without Initial Conditions,”
rect Visual-Inertial Odometry for MAVs Using a Single Plane Primitive,” IEEE Transactions on Robotics, vol. 28, no. 1, pp. 61–76, 2012.
IEEE Robotics and Automation Letters, 2020. [192] S. Shen, N. Michael, and V. Kumar, “Tightly-coupled Monocular Visual-
[170] S. Sun et al., “Autonomous Quadrotor Flight Despite Rotor Failure inertial Fusion for Autonomous Flight of Rotorcraft MAVs,” in 2015
With Onboard Vision Sensors: Frames vs. Events,” IEEE Robotics and IEEE International Conference on Robotics and Automation (ICRA).
Automation Letters, vol. 6, no. 2, pp. 580–587, 2021. IEEE, 2015, pp. 5303–5310.
[171] J. P. Tardif et al., “A New Approach to Vision-Aided Inertial Naviga- [193] Y. Liu et al., “Monocular Visual-Inertial SLAM: Continuous
tion,” in Proceedings of (IROS) IEEE/RSJ International Conference on Preintegration and Reliable Initialization,” Sensors, vol. 17, no. 11, 2017.
Intelligent Robots and Systems, October 2010, pp. 4161 – 4168. [Online]. Available: https://www.mdpi.com/1424-8220/17/11/2613
[172] S. Sirtkaya, B. Seymen, and A. A. Alatan, “Loosely coupled Kalman [194] Z. Yang and S. Shen, “Monocular Visual–Inertial State Estimation With
filtering for fusion of Visual Odometry and inertial navigation,” in Pro- Online Initialization and Camera–IMU Extrinsic Calibration,” IEEE
ceedings of the 16th International Conference on Information Fusion, Transactions on Automation Science and Engineering, vol. 14, no. 1, pp.
2013, pp. 219–226. 39–51, 2017.
[173] D. Scaramuzza et al., “Vision-Controlled Micro Flying Robots: From [195] P. Geneva, K. Eckenhoff, and G. Huang, “A linear-complexity EKF
System Design to Autonomous Navigation and Mapping in GPS-Denied for visual-inertial navigation with loop closures,” in 2019 International
VOLUME 4, 20XX 27
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. [217] S. Park, T. Schöps, and M. Pollefeys, “Illumination Change Robustness
3535–3541. in Direct Visual SLAM,” in 2017 IEEE International Conference on
[196] H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Real-time Visual- Robotics and Automation (ICRA), 2017, pp. 4523–4530.
inertial Odometry for Event Cameras Using Keyframe-based Nonlinear [218] P. Kim, H. Lee, and H. J. Kim, “Autonomous Flight with Robust Visual
Optimization,” 2017. Odometry Under Dynamic Lighting Conditions,” Autonomous Robots,
[197] E. Mueggler et al., “Continuous-time Visual-inertial Odometry for Event vol. 43, no. 6, pp. 1605–1622, 2019.
Cameras,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1425–1440, [219] M. Sizintsev et al., “Multi-Sensor Fusion for Motion Estimation in
2018. Visually-Degraded Environments,” in 2019 IEEE International Sympo-
[198] T. Mouats et al., “Thermal Stereo Odometry for UAVs,” IEEE Sensors sium on Safety, Security, and Rescue Robotics (SSRR), 2019, pp. 7–14.
Journal, vol. 15, no. 11, pp. 6335–6347, 2015. [220] A. R. Vidal et al., “Ultimate SLAM? Combining Events, Images, and
[199] K. Alexis, “Resilient Autonomous Exploration and Mapping of Under- IMU for Robust Visual SLAM in HDR and High-speed Scenarios,” IEEE
ground Mines using Aerial Robots,” in 2019 19th International Confer- Robotics and Automation Letters, vol. 3, no. 2, pp. 994–1001, 2018.
ence on Advanced Robotics (ICAR), 2019, pp. 1–8. [221] A. Zihao Zhu, N. Atanasov, and K. Daniilidis, “Event-based Visual
[200] C. Papachristos et al., “Autonomous Navigation and Mapping in Under- Inertial Odometry,” in Proceedings of the IEEE Conference on Computer
ground Mines Using Aerial Robots,” in 2019 IEEE Aerospace Confer- Vision and Pattern Recognition, 2017, pp. 5391–5399.
ence, 2019, pp. 1–8. [222] R. Azzam et al., “A Stacked LSTM-Based Approach for Reducing
[201] J. Delaune et al., “Thermal-Inertial Odometry for Autonomous Flight Semantic Pose Estimation Error,” IEEE Transactions on Instrumentation
Throughout the Night,” in 2019 IEEE/RSJ International Conference on and Measurement, vol. 70, pp. 1–14, 2021.
Intelligent Robots and Systems (IROS), 2019, pp. 1122–1128. [223] G. Gallego et al., “Event-based Vision: A Survey,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
[202] A. Djuricic and B. Jutzi, “Supporting Uavs in Low Visibility Conditions
[224] S. Guo et al., “A Noise Filter for Dynamic Vision Sensors using Self-
by Multiple-Pulse Laser Scanning Devices,” The international archives
adjusting Threshold,” 2020.
of photogrammetry, remote sensing and spatial information sciences,
[225] D. Czech and G. Orchard, “Evaluating Noise Filtering for Event-based
vol. 40, no. 1W1, pp. 93–98, 2013.
Asynchronous Change Detection Image Sensors,” Proceedings of the
[203] M. A. Hogervorst and A. Toet, “Evaluation of a Color Fused Dual-band
IEEE RAS and EMBS International Conference on Biomedical Robotics
NVG,” in 2009 12th International Conference on Information Fusion,
and Biomechatronics, vol. 2016-July, pp. 19–24, 2016.
2009, pp. 1432–1438.
[226] Y. Feng et al., “Event Density Based Denoising Method for Dynamic
[204] A. Toet et al., “Perceptual Evaluation of Color Transformed Multispectral
Vision Sensor,” Applied Sciences (Switzerland), vol. 10, no. 6, 2020.
Imagery,” Optical Engineering, vol. 53, no. 4, pp. 1–13, 2014. [Online].
[227] A. Khodamoradi and R. Kastner, “O(N)-Space Spatiotemporal Filter for
Available: https://doi.org/10.1117/1.OE.53.4.043101
Reducing Noise in Neuromorphic Vision Sensors,” IEEE Transactions on
[205] D. P. Bavirisetti and R. Dhuli, “Two-scale Image Fusion of Visible Emerging Topics in Computing, vol. XX, no. X, pp. 1–8, 2017.
and Infrared Images Using Saliency Detection,” Infrared Physics [228] J. Wu et al., “Probabilistic Undirected Graph Based Denoising Method
& Technology, vol. 76, pp. 52–64, 2016. [Online]. Available: for Dynamic Vision Sensor,” IEEE Transactions on Multimedia, vol.
http://www.sciencedirect.com/science/article/pii/S1350449515300955 9210, no. c, pp. 1–13, 2020.
[206] D. P. Bavirisetti and R. Dhuli, “Fusion of Infrared and Visible Sensor Im- [229] S. Guo et al., “SeqXFilter: A Memory-efficient Denoising Filter for
ages Based on Anisotropic Diffusion and Karhunen-Loeve Transform,” Dynamic Vision Sensors,” 2020. [Online]. Available: http://arxiv.org/
IEEE Sensors Journal, vol. 16, no. 1, pp. 203–209, 2016. abs/2006.01687
[207] H. Li et al., “Infrared and Visible Image Fusion Scheme Based [230] R. W. Baldwin et al., “Event Probability Mask (EPM) and Event De-
on NSCT and Low-level Visual Features,” Infrared Physics & noising Convolutional Neural Network (EDnCNN) for Neuromorphic
Technology, vol. 76, pp. 174–184, 2016. [Online]. Available: http: Cameras,” pp. 1698–1707, 2020.
//www.sciencedirect.com/science/article/pii/S1350449515301146
[208] J. Ma et al., “Infrared and Visible Image Fusion via Gradient Transfer and
Total Variation Minimization,” Information Fusion, vol. 31, pp. 100–109,
2016. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S156625351630001X
[209] Z. Zhou et al., “Perceptual Fusion of Infrared and Visible Images Through
a Hybrid Multi-scale Decomposition with Gaussian and Bilateral Filters,” YUSRA ALKENDI received the M.Sc. degree
Information Fusion, vol. 30, pp. 15–26, 2016. [Online]. Available: in mechanical engineering from Khalifa Univer-
http://www.sciencedirect.com/science/article/pii/S1566253515001013 sity, Abu Dhabi, United Arab Emirates, in 2019,
[210] T. Shibata, M. Tanaka, and M. Okutomi, “Versatile Visible and where she is currently pursuing the Ph.D. de-
Near-infrared Image Fusion Based on High Visibility Area Selection,” gree in aerospace engineering with a focus on
Journal of Electronic Imaging, vol. 25, no. 1, pp. 1–16, 2016. [Online]. robotics with the Khalifa University Center for
Available: https://doi.org/10.1117/1.JEI.25.1.013016 Autonomous Robotics Systems (KUCARS). Her
[211] N. Bhat et al., “Generating Visible Spectrum Images from Thermal current research is focused on the application of
Infrared using Conditional Generative Adversarial Networks,” in 2020
artificial intelligence (AI) in the fields of dynamic
5th International Conference on Communication and Electronics Systems
vision for perception and navigation.
(ICCES), 2020, pp. 1390–1394.
[212] J. Ma et al., “FusionGAN: A Generative Adversarial Network for
Infrared and Visible Image Fusion,” Information Fusion, vol. 48, pp.
11–26, 2019. [Online]. Available: http://www.sciencedirect.com/science/
article/pii/S1566253518301143
[213] H. Li and X. Wu, “DenseFuse: A Fusion Approach to Infrared and Visible
Images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. LAKMAL SENEVIRATNE received B.Sc.(Eng.)
2614–2623, 2019. and Ph.D. degrees in Mechanical Engineering
[214] N. Mandischer et al., “Bots2ReC: Radar Localization in Low Visibility from King’s College London (KCL), London,U.K.
Indoor Environments,” in 2019 IEEE International Symposium on Safety, He is currently a Professor in Mechanical En-
Security, and Rescue Robotics (SSRR), 2019, pp. 158–163. gineering and the Director of the Robotic Insti-
[215] R. Gomez-Ojeda et al., “Learning-based Image Enhancement for Visual tute at Khalifa University. He is also an Emeritus
Odometry in Challenging HDR Environments,” in 2018 IEEE Interna-
Professor at King’s College London. His research
tional Conference on Robotics and Automation (ICRA). IEEE, 2018,
interests are focused on robotics and autonomous
pp. 805–811.
systems. He has published over 300 refereed re-
[216] H. Alismail et al., “Direct Visual Odometry in Low Light Using Binary
Descriptors,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. search papers related to these topics.
444–451, 2017.
28 VOLUME 4, 20XX
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3082778, IEEE Access
Yusra Alkendi et al.: State of the Art in Vision-based Localization Techniques for Autonomous Navigation Systems
VOLUME 4, 20XX 29
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
View publication stats