A Survey On Non-Filter-Based Monocular Visual SLAM
A Survey On Non-Filter-Based Monocular Visual SLAM
net/publication/304787224
CITATIONS READS
35 3,071
3 authors:
Elie Shammas
American University of Beirut
85 PUBLICATIONS 1,104 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daniel Asmar on 15 September 2016.
Abstract Extensive research in the field of Visual SLAM cost and small size, cameras are frequently used in local-
for the past fifteen years has yielded workable systems that ization applications where weight and power consumption
found their way into various applications, such as robotics are deciding factors, such as for Unmanned Aerial Vehicles
and augmented reality. Although filter-based (e.g., Kalman (UAVs). Even though there are still many challenges facing
Filter, Particle Filter) Visual SLAM systems were com- camera-based localization, it is expected that such solutions
mon at some time, non-filter based (i.e., akin to SfM so- will eventually offer significant advantages over other types
lutions), which are more efficient, are becoming the de facto of localization techniques.
methodology for building a Visual SLAM system. This pa- Putting aside localization solutions relying on tracking
per presents a survey that covers the various non-filter based of markers or objects, camera-based localization can be
Visual SLAM systems in the literature, detailing the various broadly categorized into two approaches. In what is known
components of their implementation, while critically assess- as Image-Based Localization (IBL), the scene is processed
ing the specific strategies made by each proposed system in beforehand to yield its 3D structure, scene images and corre-
implementing its components. sponding camera viewpoints. The localization problem then
Keywords Visual SLAM · monocular · non-filter based. reduces to that of matching new query images to those in the
database and choosing the camera position that corresponds
to the best-matched image. In the second technique, no prior
1 Introduction information of the scene is given; rather, map building and
localization are concurrently done. Here we can incremen-
Localization solutions using a single camera have been gain- tally estimate camera pose—a technique known as Visual
ing considerable popularity in the past fifteen years. Cam- Odometry (VO) (Scaramuzza and Fraundorfer, 2011); or to
eras are ubiquitously found in hand-held devices such as reduce the considerable drift that is common in VO, we
phones and tablets and with the recent increase in aug- maintain a map and pose estimate of the camera throughout
mented reality applications, the camera is the natural sen- the journey. This is commonly referred to as Visual Simulta-
sor of choice to localize the user while projecting virtual neous Localization and Mapping (Visual SLAM). Although
scenes to him/her from the correct viewpoint. With their low all of the above camera-based techniques are equally impor-
tant, the subject of this paper is related to the topic of Visual
G. Younes
SLAM.
E-mail: [email protected]
Although a number of surveys for the general SLAM
D. Asmar
E-mail: [email protected]
problem exist in the literature, only a few exclusively han-
dle Visual SLAM. In 2012, Fuentes-Pacheco et al (2012)
E. Shammas
E-mail: [email protected]
published a general survey on Visual SLAM and but did not
delve into the details of the solutions put forward by dif-
Mechanical Engineering department
ferent people in the community. Also, subsequent to date
American University of Beirut their paper was published, almost thirteen new systems have
Beirut, Lebanon been proposed, with many of them introducing significant
contributions to Visual SLAM. In 2015, Yousif et al (2015)
2 Georges Younes et al.
T0 T1 T2 T3 Tn T0 T1 T2 T3 Tn
also published a general survey on Visual SLAM that in- filter based Visual SLAM systems and finally Section 5 con-
cludes filter-based, non-filter based, and RGB-D systems. cludes the paper.
While filter-based Visual SLAM solutions were common
before 2010, most solutions thereafter designed their sys-
tems around a non-filter-based architecture. The survey of
Yousif et al. describes a generic Visual SLAM but lacks 2 Overview of contributions
focus on the details and problems of monocular non-filter
based systems. With the above motivations in mind, the pur- Visual SLAM solutions are either filter-based (e.g., Kalman
pose of this paper is to survey the state-of-the-art in non- filter, Particle filter) or non-filter-based (i.e., posing it as an
filter-based monocular based Visual SLAM systems. The optimization problem). Figure 1a shows the data links be-
description of the open-source systems will go beyond the tween different components of filter-type systems; the cam-
information provided in their papers, but also rely on the un- era pose Tn with the entire state of all landmarks in the map
derstanding and experience we gained modifying and apply- are tightly joined and need to be updated at every processed
ing their code in real settings. Unfortunately, for the closed frame. In contrast to non-filter-based systems (shown in Fig.
source systems we will have to suffice with the information 1b), where the data connections between different compo-
provided in their papers, as well as the insight acquired run- nents allow the pose estimate of the camera at Tn to be esti-
ning their executables (for those who provide them). mated using a subset of the entire map, without the need to
update the map’s data at every processed frame. As a con-
A survey as the one proposed in this paper is a valu- sequence of these differences, Strasdat et al. in 2010 proved
able tool for any user or researcher in camera-based local- that non-filter based methods outperform filter-based ones. It
ization. With the many new proposed systems coming out is therefore not surprising that since then, most new releases
every day, the information is daunting to the novice and one of Visual SLAM systems are non-filter-based (see Table 1).
is often perplexed as to which algorithm he/she should use. In this paper we will focus on analyzing only non-filter-
Furthermore, this paper should help researchers quickly pin- based techniques and for filter-based ones we will suffice
point the shortcomings of each of the proposed techniques on listing them.
and accordingly help them focus their effort on alleviating In 2007, Parallel Tracking and Mapping (Klein and Mur-
these weaknesses. ray, 2007) was released, and since then many variations
and modifications of it have been proposed, such as in Cas-
The remainder of the paper is structured as follows. Sec- tle et al (2008), Weiss et al (2013), and Klein and Murray
tion 2 reviews the historical evolution of Visual SLAM sys- (2008). PTAM was the first algorithm to successfully sep-
tems, from the time of Mono SLAM (Davison, 2003) to this arate tracking and mapping into two parallel computation
date. Section 3 describes the fundamental building blocks of threads that run simultaneously and share information when-
a Visual SLAM system and critically evaluates the differ- ever necessary. This separation made the adaptation of off-
ences in the proposed open-source solutions; namely in the line Structure from Motion (SfM) methods possible within
initialization, measurement and data association, pose esti- PTAM in a real-time performance. Its ideas were revolution-
mation, map generation, map maintenance, failure recovery, ary in the monocular visual SLAM community, and the no-
and loop closure. Section 4 summarizes closed source non- tion of separation between tracking and mapping became the
A survey on non-filter-based monocular Visual SLAM systems 3
Table 1 List of different visual SLAM system. Non-filter-based approaches are highlighted in a gray color.
standard backbone of almost all visual SLAM algorithms In 2015, ORB SLAM (Mur-Artal et al, 2015) was re-
thereafter. leased as an indirect Visual SLAM system. It divides the
Visual SLAM problem into three parallel threads, one for
In 2014, SVO (Forster et al, 2014) was published as an
tracking, one for mapping, and a third for map optimiza-
open-source implementation of a hybrid system that em-
tion. The main contributions of ORB SLAM are the us-
ploys both direct and indirect methods in its proposed solu-
age of ORB features (Rublee et al, 2011) in real-time, a
tion for solving the Visual SLAM task. Unlike PTAM, SVO
model based initialization as suggested by Torr et al (1999),
requires a high frame rate camera. SVO was designed with
re-localization with invariance to viewpoint changes (Mur-
the concern of operating on high-end platforms as well as
Artal and Tardós, 2014), a place recognition module using
computationally-limited ones such as on-board hardware of
bags of words to detect loops, and covisibility and Essential
a generic Micro Aerial Vehicle (MAV). To achieve such re-
graph optimization.
silience, SVO offers two default configurations, one opti-
In late 2015, short for Dense Piecewise Parallel track-
mized for speed and the other for accuracy.
ing and Mapping (DPPTAM) (Concha and Civera, 2015),
Also in 2014, Large Scale Direct monocular SLAM was released as a semi-dense direct method similar to LSD
(LSD SLAM) (Engel et al, 2014) was released as an open- SLAM, to solve for the Visual SLAM task. A key contribu-
source adaptation of the visual odometry method proposed tion of DPPTAM’s adaptation of LSD SLAM, is the added
in Engel et al (2013). LSD SLAM employs an efficient prob- third parallel thread that performs dense reconstructions us-
abilistic direct approach to estimate semi-dense maps to be ing segmented super-pixels from indoor planar scenes.
used with an image alignment scheme to solve the SLAM
task. In contrast to other methods that use bundle adjust-
ment, LSD SLAM employs a pose graph optimization over 3 Design of Visual SLAM systems
Sim(3) as in Kummerle et al (2011), which explicitly rep-
In an effort to better understand Visual SLAM state-of-
resents the scale in the system, allowing for scale drift cor-
the-art implementations, this section provides a look un-
rection and loop closure detection in real-time. A modified
der the hood of the most successful and recent open-source
version of LSD SLAM was later showcased running on a
non-filter-based Visual SLAM systems. More specifically,
mobile platform and another adaptation of the system was
our discussion will be based on information extracted from
presented in Engel et al (2015) for a stereo camera setup.
PTAM, SVO, DT SLAM, LSD SLAM, ORB SLAM, and
LSD SLAM employs three parallel threads after initializa-
DPPTAM. A generic non-filter Visual SLAM system is con-
tion takes place: tracking, depth map estimation, and map
cerned with eight main components (Fig. 2); namely (1) in-
optimization.
put data type, (2) data association, (3) initialization, (4) pose
In late 2014, short for Deferred triangulation SLAM, estimation, (5) map generation, (6) map maintenance, (7)
DT SLAM (Herrera et al, 2014) was released as an indirect failure recovery, and (8) loop closure.
method. Similar to other algorithms, it divides the Visual
SLAM task into three parallel threads: tracking, mapping,
and bundle adjustment. One of the main contributions in DT
SLAM is its ability to estimate the camera pose from 2D Data type Initialisation
and 3D features in a unified framework and suggests a bun-
dle adjustment that incorporates both types of features. This
gives DT SLAM robustness against pure rotational move- Data Pose
ments. Another characteristic of the system is its ability to association estimation
handle multiple maps with undefined scales and merge them
together once a sufficient number of 3D matches are estab- Map Map
lished. In DT SLAM, no explicit initialization procedure is maintenance generation
required since it is embedded in the tracking thread; fur-
thermore, it is capable of performing multiple initializations Failure Loop
whenever tracking is lost. Since initialization is done auto- recovery closure
matically whenever the system is lost, data can still be col-
lected and camera tracking functions normally, albeit at a
Fig. 2 Eight components of a non-filter-based Visual SLAM system.
different scale. This ability to re-initialize local sub-maps re-
duces the need for re-localization procedures. Once a suffi-
cient number of correspondences between keyframes resid- In the following sections, we will detail each of these
ing in separate sub-maps are found, the sub-maps are fused components and critically assess how each Visual SLAM
into a single map with a uniform scale throughout. implementation addressed them. It is noteworthy to first
A survey on non-filter-based monocular Visual SLAM systems 5
mention that all discussed systems implicitly assume the in- Visual SLAM applications of direct methods, until recently,
trinsic parameters are known based on an off-line calibration were not considered feasible. With the recent advancements
step. in parallelized processing, adaptations of direct methods
were integrated within a Visual SLAM context (Concha and
Civera, 2015; Engel et al, 2015; Forster et al, 2014).
3.1 Input data type
3.1.2 Indirect methods
Vision SLAM methods are categorized as being direct, in-
direct, or a hybrid of both. Direct methods—also known Indirect methods rely on features for matching. On one
as dense or semi-dense methods—exploit the information hand, features are expected to be distinctive and invariant
available at every pixel in the image (brightness values) to to viewpoint and illumination changes, as well as resilient
estimate the parameters that fully describe the camera pose. to blur and noise. On the other hand, it is desirable for fea-
On the other hand, indirect methods were introduced to re- ture extractors to be computationally efficient and fast. Un-
duce the computational complexity of processing each pixel; fortunately, such objectives are hard to achieve at the same
this is achieved by using only salient image locations (called time and a trade-off between computational speed and fea-
features) in the pose estimation calculations (see Fig. 3). ture quality is required.
The computer vision community had developed over
3.1.1 Direct methods decades of research many different feature extractors and
descriptors, each exhibiting varying performances in terms
The basic underlying principle for all direct methods is of rotation and scale invariance, as well as speed. The selec-
known as the brightness consistency constraint and is best tion of an appropriate feature detector depends on the plat-
described as: form’s computational power, the environment in which the
Visual SLAM algorithm is due to operate in, as well as its
J(x, y) = I(x + u(x, y) + v(x, y)), (1)
expected frame rate. Feature detector examples include Hes-
where x and y are pixel coordinates; u and v denotes dis- sian corner detector (Beaudet, 1978), Harris detector (Harris
placement functions of the pixel (x, y) between two images and Stephens, 1988), Shi-Tomasi corners (Shi and Tomasi,
I and J of the same scene. Every pixel in the image provides 1994), Laplacian of Gaussian detector (Lindeberg, 1998),
one brightness constraint; however, it adds two unknowns MSER (Matas et al, 2002), Difference of Gaussian (Lowe,
(u and v) and hence the system becomes under-determined 2004) and the accelerated segment test family of detectors
with n equations and 2n unknowns (where n is the num- (FAST, AGAST, OAST) (Mair et al, 2010).
ber of pixels in the image). To render (1) solvable, Lucas To minimize computational requirements, most indirect
& Kanade (Lucas and Kanade, 1981) suggested in 1981, in systems use FAST (Rosten and Drummond, 2006) as a
what they referred to as Forward Additive Image Alignment feature extractor, coupled with a feature descriptor to be
(FAIA), to replace all the individual pixel displacements u able to perform data association. Feature descriptors in-
and v by a single general motion model, in which the num- clude and are not limited to BRIEF (Calonder et al, 2012),
ber of parameters is dependent on the implied type of mo- BRISK (Leutenegger et al, 2011), SURF (Bay et al, 2008),
tion. FAIA iteratively minimize the squared pixel-intensity SIFT (Lowe, 1999), HoG (Dalal and Triggs, 2005), FREAK
difference between a template and an input image by chang- (Alahi et al, 2012), ORB (Rublee et al, 2011) and a low level
ing the transformation parameters. Since that time and to re- local patch of pixels. Further information regarding feature
duce computational complexity, other variants of the FAIA extractors and descriptors are outside the scope of this work,
were suggested such as FCIA (Forward Compositional Im- but the reader can refer to Hartmann et al (2013), Moreels
age Alignment), ICIA (Inverse Compositional Image Align- and Perona (2007), Rey-Otero et al (2014), or Hietanen et al
ment) and IAIA (Inverse Additive Image Alignment) (Baker (2016) for the most recent comparisons.
and Matthews, 2004).
Direct methods exploit all information available in the 3.1.3 Hybrid methods
image and are therefore more robust than indirect methods
in regions with poor texture. Nevertheless, direct methods Different from the direct and indirect methods, systems such
are susceptible to failure when scene illumination changes as SVO are considered hybrids, which use a combination
occur as the minimization of the photometric error be- of direct methods to establish feature correspondences and
tween two frames relies on the underlying assumption of indirect methods to refine the camera pose estimates.
the brightness consistency constraint (1). A second disad- Table 2 summarizes the data types used by the selected
vantage is that the calculation of the photometric error at ev- Visual SLAM systems. From the list of open-source indi-
ery pixel is computationally intensive; therefore, real-time rect methods surveyed in this paper, PTAM, SVO and DT
6 Georges Younes et al.
T,R T,R
(a) direct methods using all information of (b) indirect methods use the features of the
the triangle to match to a query image. triangle to match to the features of a query
image.
Fig. 3 Data types used by a Visual SLAM system.
SLAM use FAST features (Rosten and Drummond, 2006), of descriptors used: for the local patch of pixels descriptor,
while ORB SLAM uses ORB features (Rublee et al, 2011). it is typical to use the sum of squared difference (SSD),
or to increase robustness against illumination changes, a
Zero-Mean SSD score (ZMSSD) is used (Jérôme Martin,
1995). For higher order feature descriptors such as ORB,
Table 2 Method used by different Visual SLAM systems. Abbrevia-
tions used: indirect (i), direct (d), and hybrid (h) SIFT, and SURF, the L1-norm, L2-norm, or Hamming dis-
tances may be used; however, establishing matches using
DT LSD ORB these measures is computationally intensive and may de-
PTAM SVO DPPTAM
SLAM SLAM SLAM
grade real-time operations if not carefully applied. For such
Method I H I D I D
a purpose, special implementations that sort and perform
feature matching in KD trees or bags of words, are usually
employed. Examples include the works of Muja and Lowe
(2009), and Galvez-López and Tardos (2012).
tractable.
Each pyramid levels has a different threshold for Shi- c2
c1
Tomasi score selection and non-maximum suppression; ?
thereby giving control over the strength and the number of T,R
features to be tracked across the pyramid levels. 3D land-
marks are then projected onto the new frame using a pose Fig. 5 Initialization required by any Visual SLAM system.
estimate prior and in a similar manner to the 2D-2D meth-
ods, feature correspondences are established within a search
window surrounding the projected landmark location. The
descriptors used for feature matching of the 3D landmarks 3.3 Initialization
are usually extracted from the 2D image from which the 3D
landmark was first observed; however some systems pro- Monocular Visual SLAM systems require an initialization
pose to update this descriptor as the camera view point ob- phase, during which both a map of 3D landmarks and the
serving it changes significantly or, in the case of local patch starting camera poses are generated. To do so, the same
of pixels are warped to virtually account for view point scene must be observed through at least two viewpoints sep-
changes. arated by a baseline. Figure 5 represents the initialization
DT SLAM. In a similar scheme to PTAM, DT SLAM that is required in any Visual SLAM system, where only as-
employs the same mechanism to establish 2D-3D feature sociated data between two images is known and both the
matches. initial camera pose and the scene’s structure are unknown.
8 Georges Younes et al.
T
Triangulation of Homography /
Camera pose
initial 3D Fundamental estimation
recovery
Landmarks and decomposition
Different solutions to this problem were proposed by differ- erate a Homography relating both keyframes and uses inliers
ent people. to refine it before decomposing it (as described in (Faugeras
In early Visual SLAM systems such as in MonoSLAM and Lustman, 1988)) into eight possible solutions. The cor-
(Davison et al, 2007), system initialization required the cam- rect pair of camera poses is chosen such that all triangulated
era to be placed at a known distance from a planar scene 3D points do not generate unreal configurations (negative
composed of four corners of a two dimensional square, and depths in both frames).
SLAM was initialized with the distance separating the cam- The generated initial map is scaled such as the estimated
era from the square keyed in by the operator. translation between the first two keyframes corresponds to
PTAM. Figure 6 shows the flowchart of a generic 0.1 units, before a structure-only BA (optimize only the 3D
model-based initialization procedure, such as the one em- poses of the landmarks) step takes place. The mean of the
ployed in PTAM, SVO and ORB SLAM. To eliminate the 3D landmarks is selected to serve as the world coordinate
obligation of a user’s manual input of depth, PTAM’s (Klein frame, while the positive z-direction is chosen such as the
and Murray, 2007) initial release suggested the usage of the camera poses reside along its positive side.
five-point algorithm (Nistér, 2004) to estimate and decom- PTAM’s initialization procedure is brittle and remains
pose a Fundamental matrix into an assumed non-planar ini- tricky to perform, especially for inexperienced users. Fur-
tial scene. PTAM’s initialization was later changed to the us- thermore, it is subject to degeneracies when the planarity of
age of a Homography (Faugeras and Lustman, 1988), where the initial scene’s assumption is violated or when the user’s
the scene is assumed to be composed of 2D planes. PTAM’s motion is inappropriate, crashing the system, without means
initialization requires the user’s input twice to capture the of detecting such degeneracies.
first two keyframes in the map; furthermore, it requires SVO. Similarly, Forster et al (2014) adopted in SVO, a
the user to perform, in between the first and the second Homography for initialization, however, SVO requires no
keyframe, a slow, smooth and relatively significant transla- user input and the algorithm uses at startup the first ac-
tional motion parallel to the observed scene. quired acquired keyframe; it extracts FAST features and
FAST Features extracted from the first keyframe are tracks them with an implementation of KLT (Tomasi and
tracked in a 2D-2D data association scheme in each incom- Kanade, 1991) (variant of direct methods) across incoming
ing frame until the user flags the insertion of the second frames. To avoid the need for a second input by the user,
keyframe. As the matching procedure takes place through SVO monitors the median of the baseline of the features
the ZMSSD without warping the features, establishing cor- tracked between the first keyframe and the current frame;
rect matches is susceptible to both motion blur and signif- whenever this value reaches a certain threshold, the algo-
icant appearance changes of the features caused by camera rithm assumes enough parallax has been achieved and sig-
rotations; hence the strict requirements on the user’s motion nals the Homography estimation to start. The Homography
during the initialization. is then decomposed; the correct camera poses are then se-
To ensure minimum false matches, the features are lected and the landmarks corresponding to inlier matches
searched for twice; once from the current frame to the pre- are triangulated and used to estimate an initial scene depth.
vious frame and a second time in the opposite direction. If Bundle Adjustment takes place for the two frames and all
the matches in both directions are not coherent, the feature is their associated landmarks, before the second frame is used
discarded. Since PTAM’s initialization employs a Homogra- as a second keyframe and passed to the map management
phy estimation, the observed scene during the initialization thread.
is assumed to be planar. Once the second keyframe is suc- As is the case in PTAM, the initialization of SVO re-
cessfully incorporated into the map, a MLESAC (Torr and quires the same type of motion and is prone to sudden move-
Zisserman, 2000) loop uses the established matches to gen- ments as well as to non-planar scenes; furthermore, monitor-
A survey on non-filter-based monocular Visual SLAM systems 9
Table 4 Initialization. Abbreviations used: homography decomposition (h.d.), Essential decomposition (e.d.), random depth initialization (r.d.),
planar (p), non-planar (n.p.), no assumption (n.a.)
DT LSD ORB
PTAM SVO DPPTAM
SLAM SLAM SLAM
Initialization h.d. h.d. e.d. r.d. h.d.+e.d. r.d.
Initial scene assumption p p n.p. n.a. n.a. n.a.
Pyramid levels\ T
Process New Update Map Send to
features\ Regions Failure Is
MAP Frame maintenance Mapping
of interest Recovery Keyframe?
data thread
Direct and indirect methods estimate the camera pose by PTAM. PTAM represents the camera pose as an SE(3)
minimizing a measure of error between frames; direct meth- transformation (Hall, 2015) that can be minimally repre-
ods measure the photometric error and indirect methods es- sented by six parameters. The mapping from the full SE(3)
timate the camera pose by minimizing the re-projection er- transform to its minimal representation Sξ (3) and vice versa
ror of landmarks from the map over the frame’s prior pose. can be done through logarithmic and exponential mapping
The re-projection error is formulated as the distance in pix- in Lie algebra. The minimally represented Sξ (3) transform
els between a projected 3D landmark onto the frame using is of great importance as it reduces the number of param-
the prior pose and its found 2-D position in the image. eters to optimize from twelve to six, leading to significant
Note in Fig. 9 how camera pose estimation takes place. speedups in the optimization process.
The motion model is used to seed the new frame’s pose In PTAM, the pose estimation procedure first starts by
at Cm , and a list of potentially visible 3D landmarks from estimating a prior to the frame’s pose using the constant
the map are projected onto the new frame. Data association velocity motion model. The prior is then refined, using a
takes place in a search window Sw surrounding the loca- Small Blurry Image (SBI) representation of the frame, by
tion of the projected landmarks. The system then proceeds employing an Efficient Second Order minimization (Benhi-
by minimizing the re-projection error d over the parame- mane and Malis, 2007). The velocity of the prior is defined
ters of the rigid body transformation. To gain robustness as the change between the current estimate of the pose and
against outliers (wrongly associated features), the minimiza- the previous camera pose. If the velocity is high, PTAM an-
tion takes place over an objective function that penalizes fea- ticipates a fast motion is taking place and hence the presence
tures with large re-projection errors. of motion blur; to counter failure from motion blur, PTAM
restricts tracking to take place only at the highest pyramid
levels (most resilient to motion blur) in what is known as
X
a coarse tracking stage only; otherwise the coarse tracking
stage is followed by a fine tracking stage. However, when
the camera is stationary, the coarse stage may lead to jitter-
x1 ing of the camera’s pose–hence it is turned off.
𝑥"
The minimally represented initial camera pose prior is
then refined by minimizing the tukey-biweight (Moranna
et al, 2006) objective function of the re-projection error that
c1
c2 cm down-weights observations with large error. If fine tracking
R,T is to take place, features from the lowest pyramid levels are
selected and a similar procedure to the above is repeated.
Rm ,Tm To determine the tracking quality, the pose estimation
Fig. 9 Generic pose estimation procedure. Cm is the new frame’s pose thread in PTAM monitors the ratio of successfully matched
estimated by the motion model and C2 is the actual camera pose. features in the frame against the total number of attempted
A survey on non-filter-based monocular Visual SLAM systems 11
Table 5 Pose estimation. Abbreviations are as follows: constant velocity motion model (c.v.m.m), same as previous pose (s.a.p.p.), similarity
transform with previous frame (s.t.p.f.), optimization through minimization of features (o.m.f.), optimization through minimization of photometric
error (o.m.p.e.), Essential matrix decomposition (E.m.d.), pure rotation estimation from 2 points (p.r.e.), significant pose change (s.p.c.), significant
scene appearance change (s.s.a.c)
LSD ORB
PTAM SVO DT SLAM DPPTAM
SLAM SLAM
Motion prior c.v.m.m. s.a.p.p. s.t.p.t. s.a.p.p. c.v.m.m. c.v.m.m.
+ESM or place or s.a.p.p.
recogn.
Tracking o.m.f. o.m.p.e. 3 modes: o.m.p.e. o.m.f. o.m.p.e.
1-e.m.d.;
2-o.m.f.;
3-p.r.e.
keyframe add criterion s.p.c. s.p.c. s.s.a.c. s.p.c. s.s.a.c. s.p.c.
feature matches. If the tracking quality is questionable, the Bundle Adjustment then takes place, followed by a structure
tracking thread operates normally but no keyframes are ac- only Bundle Adjustment that refines the 3D location of the
cepted by the system. If the tracker’s performance is deemed landmarks based on the refined camera pose of the previous
bad for 3 consecutive frames, then the tracker is considered step.
lost and failure recovery is initiated. Finally, a joint (pose and structure) local bundle adjust-
Table. 5 summarizes the pose estimation methods used ment fine tunes the reported camera pose estimate. During
by different Visual SLAM systems. this pose estimation module, the tracking quality is continu-
SVO. SVO uses a sparse model-based image alignment ously monitored; if the number of observations in a frame is
in a pyramidal scheme in order to estimate an initial cam- a below a certain threshold or if the number of features be-
era pose estimate. It starts by assuming the camera pose at tween consecutive frames drops drastically, tracking quality
time t to be the same as at t − 1 and aims to minimize the is deemed insufficient and failure recovery methods are ini-
photometric error of 2D image locations of known depth tiated.
in the current frame with respect to their location at t − 1, DT SLAM. DT SLAM maintains a camera pose based
by varying the camera transformation relating both frames. on three tracking modes: full pose estimation, Essential ma-
The minimization takes places through thirty Gauss New- trix estimation, and pure rotation estimation. When a suffi-
ton iterations of the inverse compositional image alignment cient number of 3D matches exist, a full pose can be esti-
method. This however introduces many limitations to SVO mated; otherwise, if a sufficient number of 2D matches are
since the ICIA requires small displacements between frames established that exhibit small translations, an Essential ma-
(1pixel). This limits the operation of SVO to high frame rate trix is estimated; and finally, if a pure rotation is exhibited,
cameras (typically > 70 f ps) so that the displacement limi- 2 points are used to estimate the absolute orientation of the
tation is not exceeded. Furthermore, the ICIA is based on matches (Kneip et al, 2012). The pose estimation module
the brightness consistency constraint rendering it vulnerable finally aims, in an iterative manner, to minimize the error
to any variations in lighting conditions. vector of both 3D-2D re-projections and 2D-2D matches.
SVO does not employ explicit feature matching for ev- When tracking failure occurs, the system initializes a new
ery incoming frame; rather, it is achieved implicitly as a map and continues to collect data for tracking in a differ-
byproduct of the image alignment step. Once image align- ent map; however, the map making thread continues to look
ment takes place, landmarks that are estimated to be visi- for possible matches between the keyframes of the new map
ble in the current frame, are projected onto the image. The and the old one, and once a match is established, both maps
2D location of the projected landmarks are fine-tuned by are fused together, thereby allowing the system to handle
minimizing the photometric error between a patch, extracted multiple sub-maps, each at a different scale.
from the initial projected location in the current frame, and LSD SLAM. The tracking thread in LSD SLAM is re-
a warp of the landmark generated from the nearest keyframe sponsible for estimating the current frame pose with respect
observing it. To decrease the computational complexity and to the currently active keyframe in the map using the previ-
to maintain only the strongest features, the frame is divided ous frame pose as a prior. The required pose is represented
into a grid and only one projected landmark (the strongest) by an SE(3) transformation and is found by an iteratively
per grid cell is used. However, This minimization violates re-weighted Gauss-Newton optimization that minimizes the
the epipolar constraint for the entire frame and further pro- variance normalized photometric residual error, as described
cessing in the tracking module is required. Motion-only in (Engel et al, 2013), between the current frame and the ac-
12 Georges Younes et al.
tive keyframe in the map. A keyframe is considered active if the map by forfeiting geometric information (scale, distance
it is the most recent keyframe accommodated in the map. To and direction) in favor for connectivity information. In the
minimize outlier effects, measurements with large residuals context of Visual SLAM, a topological map is an undirected
are down-weighted from one iteration to the other. graph of nodes that typically represents keyframes linked to-
ORB SLAM. Pose estimation in ORB SLAM is estab- gether by edges, when shared data associations between the
lished through a constant velocity motion model prior, fol- nodes exists.
lowed by a pose refinement using optimization. As the mo- While topological maps scale well with large scenes, in
tion model is expected to be easily violated through abrupt order to maintain camera pose estimates, metric informa-
motions, ORB SLAM detects such failures by tracking the tion is also required; the conversion from a topological to a
number of matched features; if it falls below a certain thresh- metric map is not always a trivial task and therefore recent
old, map points are projected onto the current frame and a Visual SLAM systems such as (Engel et al, 2014; Lim et al,
wide range feature search takes place around the projected 2014, 2011; Mur-Artal et al, 2015) employ hybrid maps that
locations. If tracking fails, ORB SLAM invokes its failure are locally metric and globally topological. The implemen-
recovery method to establish an initial frame pose via global tation of a hybrid map representation permits the system to
re-localization. (1) reason about the world on a high level, which allows for
In an effort to make ORB SLAM operate in large en- efficient solutions to loop closures and failure recovery using
vironments, a subset of the global map, known as the local topological information, and (2) to increase efficiency of the
map, is defined by all landmarks corresponding to the set metric pose estimate by limiting the scope of the map to a
of all keyframes that share edges with the current frame, as local region surrounding the camera (Fernández-Moral et al,
well as all neighbors of this set of keyframes from the pose 2015). A hybrid map allows for local optimization of the
graph ( more on that in the following section). The selected metric map while maintaining scalability of the optimiza-
landmarks are filtered out to keep only the features that are tion over the global topological map (Konolige, 2010).
most likely to be matched in the current frame. Furthermore, In a metric map the map making process handles the ini-
if the distance from the camera’s center to the landmark is tialization of new landmarks into the map as well as outlier
beyond the range of the valid features scales, the landmark detection and handling. The 3D structure of the observed
is also discarded. The remaining set of landmarks is then scene is sought from a known transformation between two
searched for and matched in the current frame before a final frames, along with the corresponding data associations. Due
camera pose refinement step takes place. to noise in data association and pose estimates of the tracked
DPPTAM. Similar to LSD SLAM, DPPTAM optimizes images, projecting rays from two associated features will
the photometric error of high gradient pixel locations be- most probably not intersect in 3D space. Triangulation by
tween two images using the ICIA formulation over the optimization as (shown in Fig. 11) aims to estimate a land-
SE(3) transform relating them. The minimization is started mark pose corresponding to the associated features, by min-
using a constant velocity motion model unless the photomet- imizing its re-projection error e1 and e2 onto both frames. To
ric error increases after applying it. If the latter is true, the gain resilience against outliers and to obtain better accuracy,
motion model is disregarded and the pose of the last tracked some systems employ a similar optimization over features
frame is used. Similar to PTAM the optimization takes place associated across more than two views.
in the tangent space Sξ (3) that minimally parameterizes the
rigid body transform by six parameters.
X
MAP F
a particle filter with a uniform distribution (D1 ) of landmark quality of the landmarks and to allow for the map refinement
position estimates, which are then updated as the landmark step to remove corrupt data.
is observed across multiple views. This continues until the New landmarks are generated by establishing and trian-
filter converges from a uniform distribution to a Gaussian gulating feature matches between the newly added keyframe
with a small variance (D3 ). In this type of landmark estima- and its nearest keyframe (in terms of position) from the
tion, outliers are easily flagged as landmarks whose distribu- map. Already existant landmarks from the map are pro-
tion remain approximately uniform after significant observa- jected onto both keyframes and feature matches from the
tions. Filter-based methods result in a time delay before an current keyframe are searched for along their correspond-
observed landmark can be used for pose tracking, in contrast ing epipolar line in the other keyframe at regions that do
to triangulation by optimization methods that can be used as not contain projected landmarks. The average depth of the
soon as the landmark is triangulated from two views. projected landmarks is used to constrain the epipolar search
A major limitation in all these methods is that they re- from a line to a segment; this limits the computation cost
quire a baseline between the images observing the feature, of the search and avoids adding landmarks in regions where
and hence are all prone to failure when the camera’s motion nearby landmarks exist. However, this also limits the newly
is constrained to pure rotations. To counter such a mode of created landmarks to be within the epipolar segment, and
failure, DT SLAM introduced into the map 2D landmarks hence very large variations in the scene’s depth may lead to
that can be used for rotation estimation before they are tri- the negligence of possible landmarks.
angulated into 3D landmarks. SVO. The map generation thread in SVO runs parallel
to the tracking thread and is responsible for creating and up-
dating the map. SVO parametrizes 3D landmarks using an
inverse depth parameterization model (Civera et al, 2008).
Upon insertion of a new keyframe, features possessing the
highest Shi-Tomasi scores are chosen to initialize a num-
ber of depth filters. These features are labeled as seeds and
are initialized to be along a line propagating from the cam-
era center to the 2D location of the seed in the originating
keyframe. The only parameter that remains to be solved for
x1 is then the depth of the landmark, which is initialized to the
mean of the scene’s depth, as observed from the keyframe
of origin.
Fig. 12 Landmark estimation using filter based methods.
During the times when no new keyframe is being pro-
cessed, the map management thread monitors and updates
Table 6 summarizes map generation methods employed map seeds by subsequent observations in newly acquired
by different Visual SLAM systems which can be divided frames. The seed is searched for in new frames along an
into two main categories: triangulation by optimization epipolar search line, which is limited by the uncertainty of
(PTAM and ORB SLAM) and filter based landmark esti- the seed and the mean depth distribution observed in the
mation (SVO, LSD SLAM and DPPTAM). current frame. As the filter converges, its uncertainty de-
PTAM. When a new keyframe is added in PTAM, creases and the epipolar search range decreases. If seeds fail
all bundle adjustment operations are halted, and the new to match frequently, if they diverge to infinity, or if a long
keyframe inherits the pose from the coarse tracking stage. time has passed since their initialization, they are considered
The potentially visible set of landmarks, estimated by the bad seeds and removed from the map.
tracker, is then re-projected onto the new keyframe, and fea- The filter converges when the distribution of the depth
ture matches are established. Correctly matched landmarks estimate of a seed transitions from the initially assumed
are marked as seen again; this is done to keep track of the uniform distribution into a Gaussian one. The seed in then
14 Georges Younes et al.
Table 6 Map generation. Abbreviations: 2 view triangulation (2.v.t.), particle filter with inverse depth parametrization (p.f.), 2D landmarks trian-
gulated to 3D landmarks (2D.l.t.), depth map propagation from previous frame (p.f.p.f.), depth map refined through small baseline observations
(s.b.o.), multiple hypotheses photometric error minimization (m.h.p.m.)
LSD ORB
PTAM SVO DT SLAM DPPTAM
SLAM SLAM
p.f.p.f or
Map generation 2.v.t. p.f. 2D.l.t. 2.v.t. m.h.p.m
s.b.o.
Map type metric metric metric hybrid hybrid metric
added into the map, with the mean of the Gaussian distri- The Sim(3) of a newly added keyframe is then estimated
bution as its depth. This process however limits SVO to op- and refined in a direct, scale-drift aware image alignment
erate in environments of relatively uniform depth distribu- scheme, which is similar to the one done in the tracking
tions. Since the initialization of landmarks in SVO relies on thread, but with respect to other keyframes in the map and
many observations in order for the features to be triangu- over the 7d.o. f . Sim(3) transform.
lated, the map contains few if any outliers and hence no out- Due to the non-convexity of the direct image alignment
lier deletion method is required. However, this comes at the method on Sim(3), an accurate initialization to the mini-
expense of a delayed time before the features are initialized mization procedure is required; for such purpose, ESM (Ef-
as landmarks and added to the map. ficient Second Order minimization) (Benhimane and Malis,
DT SLAM. DT SLAM aims to add keyframes when 2007) and a coarse to fine pyramidal scheme with very low
enough visual change has occurred; the three criteria for resolutions proved to increase the convergence radius of the
keyframe addition are (1) for the frame to contain a suffi- task.
cient number of new 2D features that can be created from If the map generation module deems the current frame
areas not covered by the map, or (2) a minimum number of as not being a keyframe, depth map refinement takes place
2D features can be triangulated into 3D landmarks, or (3) a by establishing stereo matches for each pixel in a suitable
given number of already existing 3D landmarks have been reference frame. The reference frame for each pixel is deter-
observed from a significantly different angle. The map con- mined by the oldest frame the pixel was observed in, where
tains both 2D features and 3D landmarks, where the trian- the disparity search range and the observation angle do not
gulation of 2D features into 3D landmarks is done through exceed a certain threshold. A 1-D search along the epipolar
two view triangulation and is deferred until enough paral- line for each pixel is performed with an SSD metric.
lax between keyframes is observed–hence the name of the To minimize computational cost and reduce the effect of
algorithm. outliers on the map, not all established stereo matches are
LSD SLAM. LSD SLAM’s map generation module used to update the depth map; instead, a subset of pixels
functions can be divided into two main categories, depend- is selected for which the accuracy of a disparity search is
ing on whether the current frame is a keyframe or not; if it sufficiently large. The accuracy is determined by three crite-
is, depth map creation takes place by keyframe accommoda- ria: the photometric disparity error, the geometric disparity
tion; if not, depth map refinement is done on regular frames. error, and the pixel to inverse depth ratio. Further details re-
To maintain tracking quality, LSD SLAM requires frequent garding these criteria are outside the scope of this work, the
addition of keyframes into the map as well as relatively high interested reader is referred to Engel et al (2013). Finally,
frame rate cameras. depth map regularization and outlier handling, similar to the
keyframe processing step, take place.
If a frame is labeled a keyframe, the estimated depth map
ORB SLAM. ORB SLAM’s local mapping thread is re-
from the previous keyframe is projected onto it and serves as
sponsible for keyframe insertion, map point triangulation,
its initial depth map. Spatial regularization then takes place
map point culling, keyframe culling and local bundle adjust-
by replacing each projected depth value by the average of its
ment. The keyframe insertion step is responsible for updat-
surrounding values and the variance is chosen as the mini-
ing the co-visibility and essential graphs with the appropri-
mal variance value of the neighboring measurements.
ate edges, as well as computing the bag of words represent-
In LSD SLAM, outliers are detected by monitoring the ing the newly added keyframe in the map. The co-visibility
probability of the projected depth hypothesis at each pixel to graph is a pose graph that represents all keyframes in the
be an outlier or not. To make the outliers detection step pos- system by nodes, in contrast to the essential graph that al-
sible, LSD SLAM keeps records of all successfully matched low every node to have two or less edges, by keeping the
pixels during the tracking thread, and accordingly increases strongest two edges only for every node. The map point cre-
or decreases the probability of it being an outlier. ation module spawns new landmarks by triangulating ORB
A survey on non-filter-based monocular Visual SLAM systems 15
T Is Dense
Perform dense Map optimization through LBA Search for
reconstruction
reconstruction and/or GBA and/or PGO Loop closures
required?
features that appear in two or more views from connected computationally involved and intractable if performed on all
keyframes in the co-visibility graph. Triangulated landmarks frames and all poses. The breakthrough that enabled its ap-
are tested for positive depth, re-projection error, and scale plication in PTAM is the notion of keyframes, where only
consistency in all keyframes they are observed in, in order select frames labeled as keyframes are used in the map cre-
to accommodate them into the map. ation and passed to the bundle adjustment process in contrast
DPPTAM. Landmark triangulation in DPPTAM takes to SfM methods that uses all available frames. Different al-
place over several overlapping observations of the scene gorithms apply different criteria for keyframe labeling, as
using inverse depth parametrization; the map maker aims well as different strategies for BA, some use jointly a local
to minimize the photometric error between a high gradient (over a local number of keyframes) LBA and global (over
pixel patch in the last added keyframe and the correspond- the entire map) GBA, while others argue that a local BA
ing patch of pixels, found by projecting the feature from the only is sufficient to maintain a good quality map. To reduce
keyframe onto the current frame. The minimization is re- the computational expenses of bundle adjustment, Strasdat
peated ten times for all high-gradient pixels when the frame et al (2011) proposed to represent the visual SLAM map
exhibits enough translation; the threshold for translation is by both a Euclidean map for LBA, along with a topological
increased from an iteration to another to ensure enough map for pose graph optimization that explicitly distributes
baseline between the frames. The end result is ten hypothe- the accumulated drift along the entire map.
ses for the depth of each high-gradient pixel. To deduce the Pose Graph Optimization (PGO) returns inferior results
final depth estimate from the hypotheses, three consecutive to those produced by GBA. The reasons is that while PGO
tests are performed, including gradient direction test, tem- optimizes only for the keyframe poses–and accordingly ad-
poral consistency, and spatial consistency. justs the 3D structure of landmarks–GBA jointly optimizes
for both keyframe poses and 3D structure. The stated ad-
vantage of GBA comes at the cost of computational time,
3.6 Map maintenance with PGO exhibiting significant speed up compared to GBA.
However, pose graph optimization requires efficient loop
Map maintenance takes care of optimizing the map through closure detection and may not yield an optimal result as the
either bundle adjustment or pose graph optimization (Kum- errors are distributed along the entire map, leading to lo-
merle et al, 2011). Figure 13 presents the steps required for cally induced inaccuracies in regions that were not originally
map maintenance of a generic Visual SLAM. During a map wrong.
exploration phase, new 3D landmarks are triangulated based Map maintenance is also responsible for detecting and
on the camera pose estimates. After some time, system drift removing outliers in the map due to noisy and faulty
manifests itself in wrong camera pose measurements, due matched features. While the underlying assumption of most
to accumulated errors in previous camera poses that were Visual SLAM algorithms is that the environment is static,
used to expand the map. Figure 14 describes the map main- some algorithms such as RD SLAM exploits map main-
tenance effect where the scene’s map is refined through out- tenance methods to accommodate slowly varying scenes
lier removal and error minimizations, to yield a more accu- (lighting and structural changes).
rate scene representation. PTAM. The map making thread in PTAM runs parallel
Bundle adjustment (BA) is inherited from SfM and con- to the tracking thread and does not operate on a frame by
sists of a nonlinear optimization process for refining a vi- frame basis; instead, it only processes keyframes. When the
sual reconstruction, to jointly produce an optimal structure map making thread is not processing new keyframes, it per-
and coherent camera pose estimates. Bundle adjustment is forms various optimizations and maintenance to the map,
16 Georges Younes et al.
Table 7 Map maintenance. Abbreviations used: Local Bundle Adjustment (LBA), Global Bundle Adjustment (GBA), Pose Graph Optimization
(PGO),
PTAM SVO DT SLAM LSD SLAM ORB SLAM DPPTAM
Optimization LBA & GBA LBA LBA & GBA PGO PGO& LBA Dense mapping
Scene type static & small uniform depth static & small static &small or large static & small or large static & indoor planar
then the re-localizer considers itself converged and contin- 3.8 Loop closure
ues tracking regularly; otherwise, it attempts to re-localize
using new incoming frames. Such a re-localizer is sensitive Since Visual SLAM is an optimization problem, it is prone
to any change in the lighting conditions of the scene, and the to drifts in camera pose estimates. Returning to a certain
lost frame location should be close enough to the queried pose after an exploration phase may not yield the same cam-
keyframe for successful re-localization to take place. era pose measurement as it was at the start of the run (See
LSD SLAM. LSD SLAM’s recovery procedure first Fig. 15). Such camera pose drift can also manifest itself in a
chooses randomly a keyframe from the map that has more map scale drift that will eventually lead the system to erro-
than two neighboring keyframes connected to it in the pose neous measurements and fatal failure. To address this issue,
graph. It then attempts to align the currently lost frame to some algorithms detect loop closures in an on-line Visual
it. If the outlier-to-inlier ratio is large, the keyframe is dis- SLAM session and optimize the loops track, in an effort to
carded and replaced by another keyframe at random; other- correct the drift and the error in the camera pose and in all
wise, all neighboring keyframes connected to it in the pose relevant map data that were created during the loop. The
graph are then tested. If the number of neighbors with a large loop closure thread attempts to establish loops upon the in-
inlier-to-outlier ratio is larger than the number of neighbors sertion of a new keyframe in order to correct and minimize
with a large outlier-to-inlier ratio, or if there are more than any accumulated drift by the system over time.
five neighbors with a large inlier-to-outlier ratio, the neigh-
boring keyframe with the largest ratio is set as the active
keyframe and regular tracking is accordingly resumed.
ORB SLAM. Triggered by tracking failure, ORB
SLAM invokes its global place recognition module. Upon
running, the re-localizer transforms the current frame into a
bag of words and queries the database of keyframes for all
possible keyframes that might be used to re-localize from.
The place recognition module implemented in ORB SLAM,
used for both loop detection and failure recovery, relies on
Actual path
bags of words as frames observing the same scene share
a big number of common visual vocabulary. In contrast to Estimated path
other bag of words methods that return the best queried hy-
pothesis from the database of keyframes, the place recogni- Fig. 15 Drift suffered by the Visual SLAM pose estimate after return-
ing to its starting point.
tion module of ORB SLAM returns all possible hypotheses
that have a probability of being a match larger than seventy
five percent of the best match. The combined added value LSD SLAM. Whenever a keyframe is processed by LSD
of the ORB features, along with the bag of words imple- SLAM, loop closures are searched for within its ten nearest
mentation of the place recognition module, manifests itself keyframes as well as through the appearance based model
in a real-time, high recall, and relatively high tolerance to of FABMAP (Glover et al, 2012) to establish both ends of
viewpoint changes during re-localization and loop detection. a loop. Once a loop edge is detected, a pose graph opti-
All hypotheses are then tested through a RANSAC imple- mization minimizes the similarity error established at the
mentation of the PnP algorithm (Lepetit et al, 2009) that loops edge by distributing the error over the loops keyframes
determines the camera pose from a set of 3D to 2D corre- poses.
spondences. The found camera pose with the most inliers is ORB SLAM. Loop detection in ORB SLAM takes place
then used to establish more matches to features associated via its global place recognition module that returns all hy-
with the candidate keyframe, before an optimization over potheses of keyframes from the database that might corre-
the camera’s pose using the established matches takes place. spond to the opposing loop end. To ensure enough distance
Table 8 summarizes the failure recovery mechanisms change has taken place, they compute what they refer to as
used by different Visual SLAM system. the similarity transform between all connected keyframes to
18 Georges Younes et al.
Table 8 Failure recovery. Abbreviations used: photometric error minimization of SBIs (p.e.m.), image alignment with last correctly tracked
keyframe (i.a.l.), image alignment with random keyframe (i.a.r.), bag of words place recognition (b.w.)
DT LSD ORB
PTAM SVO DPPTAM
SLAM SLAM SLAM
Failure recovery p.e.m. i.a.l. none i.a.r. b.w. i.a.l
the current keyframe in the thresholded co-visibility graph. map. To ensure sufficient baseline, the system is initialized
It the similarity score is less than a threshold, the hypothesis by automatic insertion of three keyframes. However the sys-
of a loop is removed. If enough inliers support the refined tem does not use the 3 views to perform the initialization,
similarity transform, the queried keyframe is considered to instead it solves for the initialization using the 5-point al-
be the other end of the loop and loop fusion takes place. gorithm of Nistér (2004) between the 1st and 3rd keyframes
The loop fusion first merges duplicate map points in both only therefore, it is still susceptible to planar scenes.
keyframes and inserts a new edge in the co-visibility graph Silveira et al (2008) proposed a real-time direct solution
that closes the loop by correcting the Sim(3) pose of the by assuming relatively large patches of pixels surrounding
current keyframe using the similarity transform. Using the regions of high intensity gradients as planar, and perform-
corrected pose, all landmarks associated with the queried ing image alignment by minimizing the photometric error
keyframe and its neighbors are projected to and searched of these patches across incoming frames, in a single op-
for in all keyframes associated with the current keyframe in timization step that incorporates Cheirality, geometric and
the co-visibility graph. The initial set of inliers, as well as photometric constraints. To gain resilience against lighting
the found matches are used to update the co-visibility and changes and outliers, the system employs a photogeometric
Essential graphs, establishing many edges between the two generative model and monitors the errors in the minimiza-
ends of the loop. Finally, a pose graph optimization over the tion process to flag outliers.
essential graph takes place similar to that of LSD SLAM, In 2010, =Strasdat et al (2010a) introduced similarity
which minimizes and distributes the loop closing error along transforms into Visual SLAM, allowing for scale drift esti-
the loop nodes. mation and correction once the system detects loop closure.
Table 9 summarizes the Loop closure mechanisms used Feature tracking is performed by a mixture of top-bottom
by different Visual SLAM system. and bottom-up approach, using a dense variational optical
flow and a search over a window surrounding the projected
landmarks. Landmarks are triangulated by updating infor-
Table 9 Loop closure. Abbreviations used: Bag of Words place recog-
nition (B.W.p.r), sim(3) optimization (s.o.) mation filters and loop detection is performed using a bag of
words discretization of SURF features (Bay et al, 2008). The
DT LSD ORB loop is finally closed by applying a pose graph optimization
PTAM SVO DPPTAM
SLAM SLAM SLAM
over the similarity transforms relating the keyframes.
Loop FabMap B.W.p.r.
none none none none Also in 2010, Newcombe and Davison (2010) suggested
closure +s.o. +s.o.
a hybrid Visual SLAM system that relied on feature-based
SLAM (PTAM) to fit a dense surface estimate of the en-
vironment that is refined using direct methods. A surface-
based model is then computed and polygonized to best fit
4 Closed source systems the triangulated landmarks from the feature-based front end.
A parallel process chooses a batch of frames that have a po-
We have discussed so far methods presented in open-source tentially overlapping surface visibility in order to estimate a
Visual SLAM systems; however, plenty of closed source dense refinement over the base mesh using a GPU acceler-
methods exist in the literature. This section aims to provide ated implementation of variational optical flow.
a quick overview of these systems, which include many in- In an update to this work, Newcombe released in 2011
teresting ideas for the reader. Table 10 lists in chronological Dense Tracking and Mapping (DTAM) (Newcombe et al,
order each of these systems. To avoid repetition, we will not 2011) that removed the need for PTAM as a front-end to
outline the complete details of each system; rather, we will the system and generalized the dense reconstruction to fully
focus on what we feel has additive value to the reader from solve the Visual SLAM pipeline, by performing on-line a
the information provided in Section 3. In 2006, Mouragnon dense reconstruction, given camera pose estimates that are
et al (2006) were the first to introduce in the concept of found through whole image alignment.
keyframes in Visual SLAM and employed a local Bundle Similar to the work of Newcombe(Newcombe et al,
Adjustment in real-time over a subset of keyframes in the 2011), Pretto et al (2011) modeled the environment as a 3D
A survey on non-filter-based monocular Visual SLAM systems 19
Real Time Localization and 3D Reconstruction – Introduced keyframes and LBA for real-time Visual SLAM
2006 – Utilize 3 keyframes for initialization
Mouragnon et al (2006)
Live dense reconstruction with a single moving – Fit a base mesh to a sparse set of landmarks triangulated using PTAM
2010 – GPU parallelization of variational optical flow to refine the base mesh
camera Newcombe and Davison (2010)
Dense Tracking and Mapping (DTAM) New- – Removed the need for PTAM as a front-end from Pretto et al (2011)
2011
combe et al (2011)
Handling pure camera rotation in keyframe based – panoramic submaps (2D landmarks) to handle pure camera rotation
2013 – phonySIFT feature extraction and matching using hierarchical k-means.
SLAM Pirchheim et al (2013)
Real-Time 6-DOF Monocular Visual SLAM in a – Hybrid topological and metric map
2014 – Tracking, mapping and loop closure all using the same binary descriptor
large scale environment Lim et al (2014)
– Off-line method
Robust Large Scale monocular Visual SLAM – Divide the map into submaps stored in a graph
2015 – Suggested a loop closure outlier detection mechanism in submaps
Bourmaud and Megret (2015)
– Employed a loopy belief propagation algorithm (LS-RSA)
piecewise smooth surface and used a sparse feature based significant rotational changes, CD SLAM suggests the use
front-end as a base for a Delaunay triangulation to fit a mesh of a modified Histogram of Oriented Cameras descriptor
that is used to interpolate a dense reconstruction of the envi- (HOC) (Pirker, 2010) with a GPU accelerated descriptor
ronment. update and a probabilistic weighting scheme to handle out-
Pirker et al (2011) released CD SLAM in 2011 with liers. Furthermore, it suggests the use of large-scale nested
the objectives to handle short- and long-term environmen- loop closures with scale drift correction and provide a ge-
tal changes and to handle mixed indoor/outdoor environ- ometric adaptation to update the feature descriptors after
ments. To limit the map size and gain robustness against loop closure. Keyframes are organized in an undirected, un-
20 Georges Younes et al.
weighted pose graph. Re-localization is performed using a To cope with a very large number of submaps, a loopy belief
non-linear least squared minimization initialized, with the propagation algorithm cuts the main graph into subgraphs,
pose of the best matching candidate keyframe from the map before a non-linear optimization takes place.
found through FABMAP; whereas loop closure takes place
using pose graph optimization.
In 2013, RD SLAM (Tan et al, 2013) was released 5 Conclusions
with the aim to handle occlusions and slowly varying, dy-
namic scenes. RD SLAM employs a heavily parallelized During the course of this work, we have outlined the build-
GPU accelerated SIFT and stores them in a KD-Tree (Bent- ing blocks of a generic Visual SLAM system; including data
ley, 1975) that further accelerates feature matching based type, initialization, data association, pose estimation, map
on the nearest neighbor of the queried feature in the tree. generation, map maintenance, failure recovery and loop clo-
To cope with dynamic objects (moving) and slowly vary- sure. We also discussed the details of the latest open-source
ing scenes, RD SLAM suggests a prior-based adaptive state of the art systems in Visual SLAM including PTAM,
RANSAC scheme, that samples, based on the outlier ratio of SVO, DT SLAM, LSD SLAM, ORB SLAM and DPP-
features in previous frames, the features in the current frame TAM. Finally, we compiled and summarized what added in-
from which to estimate the camera pose along with a land- formation closed-source non-filter-based monocular Visual
mark and keyframe culling mechanism, using histograms of SLAM systems have to offer.
colors to detect and update changed image locations, while Although extensive research has been dedicated to this
sparing temporarily occluded landmarks. field, it is our opinion that each of the building blocks dis-
Pirchheim et al (2013) dealt with the problem of pure cussed above could benefit from improvements. Robust data
rotations in the camera motion by building local panorama association against illumination changes, dynamic scenes
maps, whenever the system explores a new scene with pure and occluded environments; an initialization method that
rotational motion. The system extracts phonySIFT descrip- can operate without an initial scene assumption nor a large
tors as described in Wagner et al (2010)and establish fea- number of processed frames; an accurate camera pose esti-
ture correspondences using an accelerated matching method mate that is not affected by sudden movements, blur, noise,
through hierarchical k-means. When insufficient 3D land- large depth variations nor moving objects; a map making
marks are observed during pose estimation, the system tran- module capable of generating an efficient dense scene rep-
sitions into a rotation-only estimate mode and starts building resentation in regions of little texture, a map maintenance
a panorama map until the camera observes part of the finite method that improves the map with resilience against dy-
map. namic, changing small and large scale environments; a fail-
In the work of Lim et al (2014) the sought after objective ure recovery procedure capable of reviving the system from
is to handle tracking, mapping, and loop closure, all using significantly large changes in camera view points, are all de-
the same binary feature, through a hybrid map representa- sired properties that unfortunately most state of the art sys-
tion (topological and metric). Whenever a loop is detected, tems lack and remain challenging topics in the field.
the map is converted to its metric form where a local Bundle We are currently working on creating a set of experi-
Adjustment take place before returning the map back to its ments, which cater to the requirements of each of the above
topological form. open-source systems–for initialization, camera frame rate,
In 2015, Bourmaud and Megret (2015) released an of- and depth homogeneity. Upon completion of these experi-
fline Visual SLAM system (requiring around two hours and ments, we will benchmark all the open-source Visual SLAM
a half for a dataset of 10000 images). The system employs systems to better identify the advantages and disadvantages
a divide- and-conquer strategy, by segmenting the map into of each system module. The long-term goal would then be
submaps. A similarity transform is estimated between each to leverage on all of the advantages to create a superior non-
submap and its ten nearest neighbors. A global similarity filter-based Visual SLAM.
transform, relating every submap to a single global reference
frame, is computed by a pose graph optimization, where Acknowledgements This work was funded by the ENPI (European
the reference frames are stored in a graph of submaps. The Neighborhood Partnership Instrument) grant # I-A/1.2/113 as well as
above procedure is susceptible to outliers in the loop detec- the Lebanese National Research Council (LNCSR).
tion module and hence the need for an efficient outlier han-
dling mechanism. For such purpose, and to prevent outliers,
temporally consecutive similarity measurements are always References
considered as inliers. The outlier rejection module proceeds
then by integrating the similarities over the shortest loop it Alahi A, Ortiz R, Vandergheynst P (2012) Freak: Fast retina
can find, and monitors the closure error to accept the loop. keypoint. In: Computer Vision and Pattern Recognition
A survey on non-filter-based monocular Visual SLAM systems 21
DOI 10.1007/s10462-012-9365-8, URL http://dx. Klein G, Murray D (2007) Parallel Tracking and Mapping
doi.org/10.1007/s10462-012-9365-8 for Small AR Workspaces. 6th IEEE and ACM Interna-
Galvez-López D, Tardos JD (2012) Bags of Binary Words tional Symposium on Mixed and Augmented Reality pp
for Fast Place Recognition in Image Sequences. Robotics, 1–10
IEEE Transactions on 28(5):1188–1197, DOI 10.1109/ Klein G, Murray D (2008) Improving the Agility of
TRO.2012.2197158 Keyframe-Based {SLAM}. In: Proc. 10th European Con-
Glover A, Maddern W, Warren M, Reid S, Milford M, ference on Computer Vision (ECCV), Marseille, pp 802–
Wyeth G (2012) OpenFABMAP: An open source tool- 815
box for appearance-based loop closure detection. In: 2012 Kneip L, Siegwart R, Pollefeys M (2012) Finding the
IEEE International Conference on Robotics and Automa- Exact Rotation between Two Images Independently
tion (ICRA), IEEE, pp 4730–4735 of the Translation, Springer Berlin Heidelberg, Berlin,
Grasa O, Bernal E, Casado S, Gil I, Montiel J (2014) Vi- Heidelberg, chap Finding th, pp 696–709. DOI
sual slam for handheld monocular endoscope. Medical 10.1007/978-3-642-33783-3{\ }50, URL http://dx.
Imaging, IEEE Transactions on 33(1):135–146, DOI doi.org/10.1007/978-3-642-33783-3{_}50
10.1109/TMI.2013.2282997 Konolige K (2010) Sparse sparse bundle adjustment. In:
Hall BC (2015) Lie Groups, Lie Algebras, and Representa- Proceedings of the British Machine Vision Conference,
tions, vol 222, number 102 edn. Springer- Verlag BMVA Press, pp 102.1–102.11, doi:10.5244/C.24.102
Harris C, Stephens M (1988) A combined corner and edge Kummerle R, Grisetti G, Strasdat H, Konolige K, Burgard
detector. In: In Proc. of Fourth Alvey Vision Conference, W (2011) G2o: A general framework for graph optimiza-
pp 147–151 tion. In: Robotics and Automation (ICRA), IEEE Interna-
Hartley R, Zisserman A (2003) Multiple View Geometry in tional Conference on, IEEE, pp 3607–3613
Computer Vision. Cambridge University Press Kwon J, Lee KM (2010) Monocular slam with locally pla-
Hartmann J, Klussendorff JH, Maehle E (2013) A compar- nar landmarks via geometric rao-blackwellized particle
ison of feature descriptors for visual SLAM. In: Mobile filtering on lie groups. In: Computer Vision and Pat-
Robots (ECMR), 2013 European Conference on, pp 56– tern Recognition (CVPR), 2010 IEEE Conference on, pp
61, DOI 10.1109/ECMR.2013.6698820 1522–1529, DOI 10.1109/CVPR.2010.5539789
Herrera D, Kannala J, Pulli K, Heikkila J (2014) DT-SLAM: Lee SH (2014) Real-time camera tracking using a parti-
Deferred Triangulation for Robust SLAM. In: 3D Vision, cle filter combined with unscented kalman filters. Jour-
2nd International Conference on, IEEE, vol 1, pp 609– nal of Electronic Imaging 23(1):013,029, DOI 10.1117/1.
616 JEI.23.1.013029, URL http://dx.doi.org/10.1117/
Hietanen A, Lankinen J, Kämäräinen JK, Buch AG, Krüger 1.JEI.23.1.013029
N (2016) A comparison of feature detectors and descrip- Lemaire T, Lacroix S (2007) Monocular-vision based slam
tors for object class matching. Neurocomputing using line segments. In: Robotics and Automation, 2007
Hochdorfer S, Schlegel C (2009) Towards a robust visual IEEE International Conference on, pp 2791–2796, DOI
slam approach: Addressing the challenge of life-long op- 10.1109/ROBOT.2007.363894
eration. In: Advanced Robotics, 2009. ICAR 2009. Inter- Lepetit V, Moreno-Noguer F, Fua P (2009) EPnP: An Accu-
national Conference on, pp 1–6 rate O(n) Solution to the PnP Problem. International Jour-
Holmes SA, Klein G, Murray DW (2008) A square root un- nal of Computer Vision 81(2):155–166
scented kalman filter for visual monoslam. IEEE Trans- Leutenegger S, Chli M, Siegwart RY (2011) Brisk: Bi-
actions on Pattern Analysis and Machine Intelligence nary robust invariant scalable keypoints. In: Computer Vi-
31(7):1251–1263, DOI http://doi.ieeecomputersociety. sion (ICCV), 2011 IEEE International Conference on, pp
org/10.1109/TPAMI.2008.189 2548–2555, DOI 10.1109/ICCV.2011.6126542
Horn BKP (1987) Closed-form solution of absolute orienta- Lim H, Lim J, Kim HJ (2014) Real-time 6-DOF monocular
tion using unit quaternions. Journal of the Optical Society visual SLAM in a large-scale environment. In: Robotics
of America A 4(4):629–642 and Automation (ICRA), IEEE International Conference
Jeong W, Lee KM (2005) Cv-slam: a new ceiling vision- on, pp 1532–1539
based slam technique. In: Intelligent Robots and Sys- Lim J, Frahm JM, Pollefeys M (2011) Online environment
tems, 2005. (IROS 2005). 2005 IEEE/RSJ International mapping. In: Computer Vision and Pattern Recognition
Conference on, pp 3195–3200, DOI 10.1109/IROS.2005. (CVPR), 2011 IEEE Conference on, pp 3489–3496, DOI
1545443 10.1109/CVPR.2011.5995511
Jérôme Martin JLC (1995) Experimental Comparison of Lindeberg T (1998) Feature detection with automatic scale
Correlation Techniques. In: IAS-4, International Confer- selection. Int J Comput Vision 30(2):79–116, DOI 10.
ence on Intelligent Autonomous Systems 1023/A:1008045108935, URL http://dx.doi.org/
A survey on non-filter-based monocular Visual SLAM systems 23
how to correct it. CoRR abs/1409.2465, URL http:// Weiss S, Achtelik MW, Lynen S, Achtelik MC, Kneip
arxiv.org/abs/1409.2465 L, Chli M, Siegwart R (2013) Monocular vision for
Rosten E, Drummond T (2006) Machine Learning for High- long-term micro aerial vehicle state estimation: A com-
speed Corner Detection. In: 9th European Conference on pendium. Journal of Field Robotics 30(5):803–831,
Computer Vision - Volume Part I, Proceedings of the, DOI 10.1002/rob.21466, URL http://dx.doi.org/
Springer-Verlag, Berlin, Heidelberg, ECCV’06, pp 430– 10.1002/rob.21466
443 Williams B, Reid I (2010) On combining visual slam and
Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: visual odometry. In: Proc. International Conference on
An efficient alternative to SIFT or SURF. In: International Robotics and Automation
Conference on Computer Vision (ICCV), pp 2564–2571 Yousif K, Bab-Hadiashar A, Hoseinnezhad R (2015) An
Scaramuzza D, Fraundorfer F (2011) Visual odometry [tuto- overview to visual odometry and visual slam: Applica-
rial]. IEEE Robotics Automation Magazine 18(4):80–92, tions to mobile robotics. Intelligent Industrial Systems
DOI 10.1109/MRA.2011.943233 1(4):289–311
Shi J, Tomasi C (1994) Good features to track. In: Com- Zhou H, Zou D, Pei L, Ying R, Liu P, Yu W (2015) Struct-
puter Vision and Pattern Recognition, 1994. Proceedings slam: Visual slam with building structure lines. Vehicu-
CVPR ’94., 1994 IEEE Computer Society Conference on, lar Technology, IEEE Transactions on 64(4):1364–1375,
pp 593–600 DOI 10.1109/TVT.2015.2388780
Silveira G, Malis E, Rives P (2008) An efficient direct ap-
proach to visual slam. Robotics, IEEE Transactions on
24(5):969–979, DOI 10.1109/TRO.2008.2004829
Smith P, Reid I, Davison A (2006) Real-time monocular
slam with straight lines. pp 17–26, URL http://hdl.
handle.net/10044/1/5648
Strasdat H, Montiel J, Davison A (2010a) Scale drift-aware
large scale monocular slam. The MIT Press, URL http:
//www.roboticsproceedings.org/rss06/
Strasdat H, Montiel JMM, Davison AJ (2010b) Real-time
monocular SLAM: Why filter? In: Robotics and Automa-
tion (ICRA), IEEE International Conference on, pp 2657–
2664
Strasdat H, Davison AJ, Montiel JMM, Konolige K (2011)
Double Window Optimisation for Constant Time Visual
SLAM. In: International Conference on Computer Vision,
Proceedings of the, IEEE Computer Society, Washington,
DC, USA, ICCV ’11, pp 2352–2359
Tan W, Liu H, Dong Z, Zhang G, Bao H (2013) Robust
monocular SLAM in dynamic environments. 2013 IEEE
International Symposium on Mixed and Augmented Re-
ality (ISMAR) pp 209–218
Tomasi C, Kanade T (1991) Detection and Tracking of Point
Features. Tech. rep., International Journal of Computer
Vision
Torr P, Fitzgibbon A, Zisserman A (1999) The Problem
of Degeneracy in Structure and Motion Recovery from
Uncalibrated Image Sequences. International Journal of
Computer Vision 32(1):27–44
Torr PHS, Zisserman A (2000) MLESAC. Computer Vision
and Image Understanding 78(1):138–156
Wagner D, Reitmayr G, Mulloni A, Drummond T, Schmal-
stieg D (2010) Real-time detection and tracking for aug-
mented reality on mobile phones. Visualization and Com-
puter Graphics, IEEE Transactions on 16(3):355–368,
DOI 10.1109/TVCG.2009.99