0% found this document useful (0 votes)

40 views25 pages

A Survey On Non-Filter-Based Monocular Visual SLAM

A_survey_on_non-filter-based_monocular_Visual_SLAM A_survey_on_non-filter-based_monocular_Visual_SLAM A_survey_on_non-filter-based_monocular_Visual_SLAM

Uploaded by

zsong953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views25 pages

A Survey On Non-Filter-Based Monocular Visual SLAM

A_survey_on_non-filter-based_monocular_Visual_SLAM A_survey_on_non-filter-based_monocular_Visual_SLAM A_survey_on_non-filter-based_monocular_Visual_SLAM

Uploaded by

zsong953

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/304787224

A survey on non-ﬁlter-based monocular Visual SLAM systems

Article · July 2016

CITATIONS READS

35 3,071

3 authors:

Georges Younes Daniel Asmar

American University of Beirut American University of Beirut
18 PUBLICATIONS 140 CITATIONS 137 PUBLICATIONS 980 CITATIONS

SEE PROFILE SEE PROFILE

Elie Shammas
American University of Beirut
85 PUBLICATIONS 1,104 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Measurement of Soil Water Content View project

motion planning View project

All content following this page was uploaded by Daniel Asmar on 15 September 2016.

The user has requested enhancement of the downloaded file.

Noname manuscript No.
(will be inserted by the editor)

A survey on non-filter-based monocular Visual SLAM systems

Georges Younes · Daniel Asmar · Elie Shammas
arXiv:1607.00470v1 [cs.CV] 2 Jul 2016

Received: date / Accepted: date

Abstract Extensive research in the field of Visual SLAM cost and small size, cameras are frequently used in local-
for the past fifteen years has yielded workable systems that ization applications where weight and power consumption
found their way into various applications, such as robotics are deciding factors, such as for Unmanned Aerial Vehicles
and augmented reality. Although filter-based (e.g., Kalman (UAVs). Even though there are still many challenges facing
Filter, Particle Filter) Visual SLAM systems were com- camera-based localization, it is expected that such solutions
mon at some time, non-filter based (i.e., akin to SfM so- will eventually offer significant advantages over other types
lutions), which are more efficient, are becoming the de facto of localization techniques.
methodology for building a Visual SLAM system. This pa- Putting aside localization solutions relying on tracking
per presents a survey that covers the various non-filter based of markers or objects, camera-based localization can be
Visual SLAM systems in the literature, detailing the various broadly categorized into two approaches. In what is known
components of their implementation, while critically assess- as Image-Based Localization (IBL), the scene is processed
ing the specific strategies made by each proposed system in beforehand to yield its 3D structure, scene images and corre-
implementing its components. sponding camera viewpoints. The localization problem then
Keywords Visual SLAM · monocular · non-filter based. reduces to that of matching new query images to those in the
database and choosing the camera position that corresponds
to the best-matched image. In the second technique, no prior
1 Introduction information of the scene is given; rather, map building and
localization are concurrently done. Here we can incremen-
Localization solutions using a single camera have been gain- tally estimate camera pose—a technique known as Visual
ing considerable popularity in the past fifteen years. Cam- Odometry (VO) (Scaramuzza and Fraundorfer, 2011); or to
eras are ubiquitously found in hand-held devices such as reduce the considerable drift that is common in VO, we
phones and tablets and with the recent increase in aug- maintain a map and pose estimate of the camera throughout
mented reality applications, the camera is the natural sen- the journey. This is commonly referred to as Visual Simulta-
sor of choice to localize the user while projecting virtual neous Localization and Mapping (Visual SLAM). Although
scenes to him/her from the correct viewpoint. With their low all of the above camera-based techniques are equally impor-
tant, the subject of this paper is related to the topic of Visual
G. Younes
SLAM.
E-mail: [email protected]
Although a number of surveys for the general SLAM
D. Asmar
E-mail: [email protected]
problem exist in the literature, only a few exclusively han-
dle Visual SLAM. In 2012, Fuentes-Pacheco et al (2012)
E. Shammas
E-mail: [email protected]
published a general survey on Visual SLAM and but did not
delve into the details of the solutions put forward by dif-
Mechanical Engineering department
ferent people in the community. Also, subsequent to date
American University of Beirut their paper was published, almost thirteen new systems have
Beirut, Lebanon been proposed, with many of them introducing significant
contributions to Visual SLAM. In 2015, Yousif et al (2015)
2 Georges Younes et al.

Frame Poses Frame Poses

T0 T1 T2 T3 Tn T0 T1 T2 T3 Tn

Map Landmarks Map Landmarks

(a) Filter based systems (b) Non-filter based systems

Fig. 1 Data links in filter versus non-filter based Visual SLAM systems. Figure inspired by Strasdat et al (2010b)

also published a general survey on Visual SLAM that in- filter based Visual SLAM systems and finally Section 5 con-
cludes filter-based, non-filter based, and RGB-D systems. cludes the paper.
While filter-based Visual SLAM solutions were common
before 2010, most solutions thereafter designed their sys-
tems around a non-filter-based architecture. The survey of
Yousif et al. describes a generic Visual SLAM but lacks 2 Overview of contributions
focus on the details and problems of monocular non-filter
based systems. With the above motivations in mind, the pur- Visual SLAM solutions are either filter-based (e.g., Kalman
pose of this paper is to survey the state-of-the-art in non- filter, Particle filter) or non-filter-based (i.e., posing it as an
filter-based monocular based Visual SLAM systems. The optimization problem). Figure 1a shows the data links be-
description of the open-source systems will go beyond the tween different components of filter-type systems; the cam-
information provided in their papers, but also rely on the un- era pose Tn with the entire state of all landmarks in the map
derstanding and experience we gained modifying and apply- are tightly joined and need to be updated at every processed
ing their code in real settings. Unfortunately, for the closed frame. In contrast to non-filter-based systems (shown in Fig.
source systems we will have to suffice with the information 1b), where the data connections between different compo-
provided in their papers, as well as the insight acquired run- nents allow the pose estimate of the camera at Tn to be esti-
ning their executables (for those who provide them). mated using a subset of the entire map, without the need to
update the map’s data at every processed frame. As a con-
A survey as the one proposed in this paper is a valu- sequence of these differences, Strasdat et al. in 2010 proved
able tool for any user or researcher in camera-based local- that non-filter based methods outperform filter-based ones. It
ization. With the many new proposed systems coming out is therefore not surprising that since then, most new releases
every day, the information is daunting to the novice and one of Visual SLAM systems are non-filter-based (see Table 1).
is often perplexed as to which algorithm he/she should use. In this paper we will focus on analyzing only non-filter-
Furthermore, this paper should help researchers quickly pin- based techniques and for filter-based ones we will suffice
point the shortcomings of each of the proposed techniques on listing them.
and accordingly help them focus their effort on alleviating In 2007, Parallel Tracking and Mapping (Klein and Mur-
these weaknesses. ray, 2007) was released, and since then many variations
and modifications of it have been proposed, such as in Cas-
The remainder of the paper is structured as follows. Sec- tle et al (2008), Weiss et al (2013), and Klein and Murray
tion 2 reviews the historical evolution of Visual SLAM sys- (2008). PTAM was the first algorithm to successfully sep-
tems, from the time of Mono SLAM (Davison, 2003) to this arate tracking and mapping into two parallel computation
date. Section 3 describes the fundamental building blocks of threads that run simultaneously and share information when-
a Visual SLAM system and critically evaluates the differ- ever necessary. This separation made the adaptation of off-
ences in the proposed open-source solutions; namely in the line Structure from Motion (SfM) methods possible within
initialization, measurement and data association, pose esti- PTAM in a real-time performance. Its ideas were revolution-
mation, map generation, map maintenance, failure recovery, ary in the monocular visual SLAM community, and the no-
and loop closure. Section 4 summarizes closed source non- tion of separation between tracking and mapping became the
A survey on non-filter-based monocular Visual SLAM systems 3

Table 1 List of different visual SLAM system. Non-filter-based approaches are highlighted in a gray color.

Year Name Method Type Reference

2003 Real-time simultaneous localization and mapping with a single camera filter indirect Davison (2003)
2004 Simultaneous localization and mapping using multiple view feature descrip- filter indirect Meltzer et al (2004)
tors
2004 Real-Time 3D SLAM with Wide-Angle Vision filter indirect Davison et al (2004)
2005 Real-Time Camera Tracking Using a Particle Filter filter indirect Pupilli and Calway (2005)
2005 CV-SLAM filter indirect Jeong and Lee (2005)
2006 Real-time Localization and 3D Reconstruction non-filter indirect Mouragnon et al (2006)
2006 Scalable Monocular SLAM filter indirect Eade and Drummond (2006)
2006 Real-Time Monocular SLAM with Straight Lines filter indirect Smith et al (2006)
2007 Monocular-vision based SLAM using Line Segments filter indirect Lemaire and Lacroix (2007)
2007 MonoSLAM filter indirect Davison et al (2007)
2007 Parallel Tracking and Mapping (PTAM) non-filter indirect Klein and Murray (2007)
2007 Monocular SLAM as a Graph of Coalesced Observations filter indirect Eade and Drummond (2007)
2007 Mapping Large Loops with a Single Hand-Held Camera filter indirect Clemente et al (2007)
2007 Dimensionless Monocular SLAM filter indirect Civera et al (2007)
2008 A Square Root UKF for visual monoSLAM filter indirect Holmes et al (2008)
2008 An Efficient Direct Approach to Visual SLAM non-filter direct Silveira et al (2008)
2008 Efficient View-Based SLAM Using Visual Loop Closures filter indirect Mahon et al (2008)
2008 Large-Scale SLAM Building Conditionally Independent Local Maps: Ap- filter indirect Pinies and Tardos (2008)
plication to Monocular Vision
2009 Towards a robust visual SLAM approach: Addressing the challenge of life- filter indirect Hochdorfer and Schlegel (2009)
long operation
2009 Use a Single Camera for Simultaneous Localization And Mapping with Mo- filter indirect Migliore et al (2009)
bile Object Tracking in dynamic environments
2010 On Combining Visual SLAM and Visual Odometry filter indirect Williams and Reid (2010)
2010 Scale Drift-Aware Large Scale Monocular SLAM non-filter indirect Strasdat et al (2010a)
2010 Live dense reconstruction with a single moving camera non-filter hybrid Newcombe and Davison (2010)
2010 Monocular SLAM with locally planar landmarks via geometric rao- filter indirect Kwon and Lee (2010)
blackwellized particle filtering on Lie groups
2011 Dense Tracking and Mapping (DTAM) non-filter direct Newcombe et al (2011)
2011 Omnidirectional dense large-scale mapping and navigation based on mean- non-filter direct Pretto et al (2011)
ingful triangulation
2011 Continuous localization and mapping in a dynamic world (CD SLAM) non-filter indirect Pirker et al (2011)
2011 Online environment mapping non-filter indirect Lim et al (2011)
2011 Homography-based planar mapping and tracking for mobile phones non-filter indirect Pirchheim and Reitmayr (2011)
2013 Robust monocular SLAM in Dynamic environments (RD SLAM) non-filter indirect Tan et al (2013)
2013 Handling pure camera rotation in keyframe-based SLAM (Hybrid SLAM) non-filter indirect Pirchheim et al (2013)
2013 Monocular Vision SLAM for Indoor Aerial Vehicles filter indirect Celik and Somani (2013)
2014 Visual SLAM for Handheld Monocular Endoscope filter indirect Grasa et al (2014)
2014 Real-time camera tracking using a particle filter combined with unscented filter indirect Lee (2014)
Kalman filters
2014 Semi-direct Visual Odometry (SVO) non-filter hybrid Forster et al (2014)
2014 Large Scale Direct monocular SLAM (LSD SLAM) non-filter direct Engel et al (2014)
2014 Deferred Triangulation SLAM (DT SLAM) non-filter indirect Herrera et al (2014)
2014 Real-Time 6-DOF Monocular Visual SLAM in a Large Scale Environment non-filter indirect Lim et al (2014)
2015 StructSLAM: Visual SLAM With Building Structure Lines filter indirect Zhou et al (2015)
2015 Robust large scale monocular Visual SLAM non-filter indirect Bourmaud and Megret (2015)
2015 ORB SLAM non-filter indirect Mur-Artal et al (2015)
2015 Dense Piecewise Parallel Tracking and Mapping (DPPTAM) non-filter direct Concha and Civera (2015)
4 Georges Younes et al.

standard backbone of almost all visual SLAM algorithms In 2015, ORB SLAM (Mur-Artal et al, 2015) was re-
thereafter. leased as an indirect Visual SLAM system. It divides the
Visual SLAM problem into three parallel threads, one for
In 2014, SVO (Forster et al, 2014) was published as an
tracking, one for mapping, and a third for map optimiza-
open-source implementation of a hybrid system that em-
tion. The main contributions of ORB SLAM are the us-
ploys both direct and indirect methods in its proposed solu-
age of ORB features (Rublee et al, 2011) in real-time, a
tion for solving the Visual SLAM task. Unlike PTAM, SVO
model based initialization as suggested by Torr et al (1999),
requires a high frame rate camera. SVO was designed with
re-localization with invariance to viewpoint changes (Mur-
the concern of operating on high-end platforms as well as
Artal and Tardós, 2014), a place recognition module using
computationally-limited ones such as on-board hardware of
bags of words to detect loops, and covisibility and Essential
a generic Micro Aerial Vehicle (MAV). To achieve such re-
graph optimization.
silience, SVO offers two default configurations, one opti-
In late 2015, short for Dense Piecewise Parallel track-
mized for speed and the other for accuracy.
ing and Mapping (DPPTAM) (Concha and Civera, 2015),
Also in 2014, Large Scale Direct monocular SLAM was released as a semi-dense direct method similar to LSD
(LSD SLAM) (Engel et al, 2014) was released as an open- SLAM, to solve for the Visual SLAM task. A key contribu-
source adaptation of the visual odometry method proposed tion of DPPTAM’s adaptation of LSD SLAM, is the added
in Engel et al (2013). LSD SLAM employs an efficient prob- third parallel thread that performs dense reconstructions us-
abilistic direct approach to estimate semi-dense maps to be ing segmented super-pixels from indoor planar scenes.
used with an image alignment scheme to solve the SLAM
task. In contrast to other methods that use bundle adjust-
ment, LSD SLAM employs a pose graph optimization over 3 Design of Visual SLAM systems
Sim(3) as in Kummerle et al (2011), which explicitly rep-
In an effort to better understand Visual SLAM state-of-
resents the scale in the system, allowing for scale drift cor-
the-art implementations, this section provides a look un-
rection and loop closure detection in real-time. A modified
der the hood of the most successful and recent open-source
version of LSD SLAM was later showcased running on a
non-filter-based Visual SLAM systems. More specifically,
mobile platform and another adaptation of the system was
our discussion will be based on information extracted from
presented in Engel et al (2015) for a stereo camera setup.
PTAM, SVO, DT SLAM, LSD SLAM, ORB SLAM, and
LSD SLAM employs three parallel threads after initializa-
DPPTAM. A generic non-filter Visual SLAM system is con-
tion takes place: tracking, depth map estimation, and map
cerned with eight main components (Fig. 2); namely (1) in-
optimization.
put data type, (2) data association, (3) initialization, (4) pose
In late 2014, short for Deferred triangulation SLAM, estimation, (5) map generation, (6) map maintenance, (7)
DT SLAM (Herrera et al, 2014) was released as an indirect failure recovery, and (8) loop closure.
method. Similar to other algorithms, it divides the Visual
SLAM task into three parallel threads: tracking, mapping,
and bundle adjustment. One of the main contributions in DT
SLAM is its ability to estimate the camera pose from 2D Data type Initialisation
and 3D features in a unified framework and suggests a bun-
dle adjustment that incorporates both types of features. This
gives DT SLAM robustness against pure rotational move- Data Pose
ments. Another characteristic of the system is its ability to association estimation
handle multiple maps with undefined scales and merge them
together once a sufficient number of 3D matches are estab- Map Map
lished. In DT SLAM, no explicit initialization procedure is maintenance generation
required since it is embedded in the tracking thread; fur-
thermore, it is capable of performing multiple initializations Failure Loop
whenever tracking is lost. Since initialization is done auto- recovery closure
matically whenever the system is lost, data can still be col-
lected and camera tracking functions normally, albeit at a
Fig. 2 Eight components of a non-filter-based Visual SLAM system.
different scale. This ability to re-initialize local sub-maps re-
duces the need for re-localization procedures. Once a suffi-
cient number of correspondences between keyframes resid- In the following sections, we will detail each of these
ing in separate sub-maps are found, the sub-maps are fused components and critically assess how each Visual SLAM
into a single map with a uniform scale throughout. implementation addressed them. It is noteworthy to first
A survey on non-filter-based monocular Visual SLAM systems 5

mention that all discussed systems implicitly assume the in- Visual SLAM applications of direct methods, until recently,
trinsic parameters are known based on an off-line calibration were not considered feasible. With the recent advancements
step. in parallelized processing, adaptations of direct methods
were integrated within a Visual SLAM context (Concha and
Civera, 2015; Engel et al, 2015; Forster et al, 2014).
3.1 Input data type
3.1.2 Indirect methods
Vision SLAM methods are categorized as being direct, in-
direct, or a hybrid of both. Direct methods—also known Indirect methods rely on features for matching. On one
as dense or semi-dense methods—exploit the information hand, features are expected to be distinctive and invariant
available at every pixel in the image (brightness values) to to viewpoint and illumination changes, as well as resilient
estimate the parameters that fully describe the camera pose. to blur and noise. On the other hand, it is desirable for fea-
On the other hand, indirect methods were introduced to re- ture extractors to be computationally efficient and fast. Un-
duce the computational complexity of processing each pixel; fortunately, such objectives are hard to achieve at the same
this is achieved by using only salient image locations (called time and a trade-off between computational speed and fea-
features) in the pose estimation calculations (see Fig. 3). ture quality is required.
The computer vision community had developed over
3.1.1 Direct methods decades of research many different feature extractors and
descriptors, each exhibiting varying performances in terms
The basic underlying principle for all direct methods is of rotation and scale invariance, as well as speed. The selec-
known as the brightness consistency constraint and is best tion of an appropriate feature detector depends on the plat-
described as: form’s computational power, the environment in which the
Visual SLAM algorithm is due to operate in, as well as its
J(x, y) = I(x + u(x, y) + v(x, y)), (1)
expected frame rate. Feature detector examples include Hes-
where x and y are pixel coordinates; u and v denotes dis- sian corner detector (Beaudet, 1978), Harris detector (Harris
placement functions of the pixel (x, y) between two images and Stephens, 1988), Shi-Tomasi corners (Shi and Tomasi,
I and J of the same scene. Every pixel in the image provides 1994), Laplacian of Gaussian detector (Lindeberg, 1998),
one brightness constraint; however, it adds two unknowns MSER (Matas et al, 2002), Difference of Gaussian (Lowe,
(u and v) and hence the system becomes under-determined 2004) and the accelerated segment test family of detectors
with n equations and 2n unknowns (where n is the num- (FAST, AGAST, OAST) (Mair et al, 2010).
ber of pixels in the image). To render (1) solvable, Lucas To minimize computational requirements, most indirect
& Kanade (Lucas and Kanade, 1981) suggested in 1981, in systems use FAST (Rosten and Drummond, 2006) as a
what they referred to as Forward Additive Image Alignment feature extractor, coupled with a feature descriptor to be
(FAIA), to replace all the individual pixel displacements u able to perform data association. Feature descriptors in-
and v by a single general motion model, in which the num- clude and are not limited to BRIEF (Calonder et al, 2012),
ber of parameters is dependent on the implied type of mo- BRISK (Leutenegger et al, 2011), SURF (Bay et al, 2008),
tion. FAIA iteratively minimize the squared pixel-intensity SIFT (Lowe, 1999), HoG (Dalal and Triggs, 2005), FREAK
difference between a template and an input image by chang- (Alahi et al, 2012), ORB (Rublee et al, 2011) and a low level
ing the transformation parameters. Since that time and to re- local patch of pixels. Further information regarding feature
duce computational complexity, other variants of the FAIA extractors and descriptors are outside the scope of this work,
were suggested such as FCIA (Forward Compositional Im- but the reader can refer to Hartmann et al (2013), Moreels
age Alignment), ICIA (Inverse Compositional Image Align- and Perona (2007), Rey-Otero et al (2014), or Hietanen et al
ment) and IAIA (Inverse Additive Image Alignment) (Baker (2016) for the most recent comparisons.
and Matthews, 2004).
Direct methods exploit all information available in the 3.1.3 Hybrid methods
image and are therefore more robust than indirect methods
in regions with poor texture. Nevertheless, direct methods Different from the direct and indirect methods, systems such
are susceptible to failure when scene illumination changes as SVO are considered hybrids, which use a combination
occur as the minimization of the photometric error be- of direct methods to establish feature correspondences and
tween two frames relies on the underlying assumption of indirect methods to refine the camera pose estimates.
the brightness consistency constraint (1). A second disad- Table 2 summarizes the data types used by the selected
vantage is that the calculation of the photometric error at ev- Visual SLAM systems. From the list of open-source indi-
ery pixel is computationally intensive; therefore, real-time rect methods surveyed in this paper, PTAM, SVO and DT
6 Georges Younes et al.

T,R T,R

(a) direct methods using all information of (b) indirect methods use the features of the
the triangle to match to a query image. triangle to match to the features of a query
image.
Fig. 3 Data types used by a Visual SLAM system.

SLAM use FAST features (Rosten and Drummond, 2006), of descriptors used: for the local patch of pixels descriptor,
while ORB SLAM uses ORB features (Rublee et al, 2011). it is typical to use the sum of squared difference (SSD),
or to increase robustness against illumination changes, a
Zero-Mean SSD score (ZMSSD) is used (Jérôme Martin,
1995). For higher order feature descriptors such as ORB,
Table 2 Method used by different Visual SLAM systems. Abbrevia-
tions used: indirect (i), direct (d), and hybrid (h) SIFT, and SURF, the L1-norm, L2-norm, or Hamming dis-
tances may be used; however, establishing matches using
DT LSD ORB these measures is computationally intensive and may de-
PTAM SVO DPPTAM
SLAM SLAM SLAM
grade real-time operations if not carefully applied. For such
Method I H I D I D
a purpose, special implementations that sort and perform
feature matching in KD trees or bags of words, are usually
employed. Examples include the works of Muja and Lowe
(2009), and Galvez-López and Tardos (2012).

3.2 Data association

3.2.2 3D-2D
Data association is defined as the process of establishing Figure 4 represents the 3D-2D data association problem,
measurement correspondences across different images to where the previous camera pose estimate and the 3D struc-
serve as inputs to other Visual SLAM modules. This step ture are known and one seeks to estimate correspondences
is inherent in systems that employ direct methods and hence between 2D features and the projection of 3D landmarks
the discussion on data association for direct methods will be onto a newly acquired frame without the knowledge of the
included in the camera pose section below. exact motion between the frames. This type of data asso-
To establish feature correspondences in indirect meth- ciation is typically used during the regular pose estimation
ods, three types of data association can be distinguished: phase of Visual SLAM.
2D-2D, 3D-2D and 3D-3D.
3.2.3 3D-3D
3.2.1 2D-2D
To estimate and correct accumulated drift along loops, 3D-
When a map is not available, and neither the camera trans- 3D data association is required, and for such a purpose, de-
formation between the two frames nor the scene structure scriptors of 3D landmarks that are observed in both frames
is available, 2D-2D data association is used. To reduce the are used to establish matches among the landmarks that are
computation time and avoid the possibility of erroneous data then exploited–as explained in (Horn, 1987)–to yield a sim-
association, the feature’s 2D location in the first image is ilarity transform between both frames. Table 3 summarizes
used to define a search window in the second image. Each the feature types and descriptors employed by various open-
feature has associated with it a descriptor, which can be used source Visual SLAM systems.
to provide a quantitative measure of similarity to other fea- PTAM. After a successful initialization of PTAM, a 4
tures. The descriptor distance function varies with the type level pyramid representation of every incoming frame is
A survey on non-filter-based monocular Visual SLAM systems 7

Table 3 Feature extractors and descriptors. Abbreviations:local patch of pixels (L.P.P.)

PTAM SVO DT SLAM LSD SLAM ORB SLAM DPPTAM

Feature type FAST FAST FAST None FAST None
Feature descriptor L.P.P. L.P.P. L.P.P. L.P.P. ORB L.P.P.

SVO. SVO generates a five level pyramid representation

X of the incoming frame: data association is first established
through an iterative direct image alignment scheme start-
ing from the highest pyramid level up till the third. Prelim-
I1 inary data association from this step is used as a prior to a
I2
x1 x2? FAST feature matching procedure, similar to PTAM’s warp-
ing technique, with a Zero-Mean SSD score.
ORB SLAM. ORB SLAM extracts FAST corners
c2
throughout eight pyramid levels. To ensure a homogeneous
c1
n.a. distribution along the entire image, each pyramid level is di-
vided into cells and the parameters of the FAST detector are
T,R tuned on-line to ensure a minimum of five corners are ex-
tracted per cell. A 256-bit ORB descriptor is then computed
Fig. 4 3D-2D Data association problem (n.a. stands for not available for each extracted feature. The higher order feature descrip-
during the task). tor ORB is used to establish correspondences between fea-
tures. ORB SLAM discretizes and stores the descriptors into
bags of words, known as visual vocabulary (Galvez-López
generated (e.g., level 1: 640x480, level 2: 320x240). The and Tardos, 2012), which are used to speed up image and
pyramid levels are used to make features more robust against feature matching by constraining those features that belong
scale changes and to decrease the convergence radius of the to the same node in the vocabulary tree.
pose estimation module, as shall be described later. FAST
features are extracted at each level and a Shi-Tomasi score
(Shi and Tomasi, 1994) for each feature is estimated; fea- X?
tures having a Shi-Tomasi score below a threshold are re-
moved before non-maximum suppression takes place. This
is done to ensure high saliency of the extracted features I1
I2
and limit their numbers–in order to remain computationally x1 x2

tractable.
Each pyramid levels has a different threshold for Shi- c2
c1
Tomasi score selection and non-maximum suppression; ?
thereby giving control over the strength and the number of T,R
features to be tracked across the pyramid levels. 3D land-
marks are then projected onto the new frame using a pose Fig. 5 Initialization required by any Visual SLAM system.
estimate prior and in a similar manner to the 2D-2D meth-
ods, feature correspondences are established within a search
window surrounding the projected landmark location. The
descriptors used for feature matching of the 3D landmarks 3.3 Initialization
are usually extracted from the 2D image from which the 3D
landmark was first observed; however some systems pro- Monocular Visual SLAM systems require an initialization
pose to update this descriptor as the camera view point ob- phase, during which both a map of 3D landmarks and the
serving it changes significantly or, in the case of local patch starting camera poses are generated. To do so, the same
of pixels are warped to virtually account for view point scene must be observed through at least two viewpoints sep-
changes. arated by a baseline. Figure 5 represents the initialization
DT SLAM. In a similar scheme to PTAM, DT SLAM that is required in any Visual SLAM system, where only as-
employs the same mechanism to establish 2D-3D feature sociated data between two images is known and both the
matches. initial camera pose and the scene’s structure are unknown.
8 Georges Younes et al.

Process Keyframe: For every Extract and Is 2nd

1st Keyframe
Extract features new frame match Features Keyframe?

T
Triangulation of Homography /
Camera pose
initial 3D Fundamental estimation
recovery
Landmarks and decomposition

Fig. 6 Generic initialization pipeline.

Different solutions to this problem were proposed by differ- erate a Homography relating both keyframes and uses inliers
ent people. to refine it before decomposing it (as described in (Faugeras
In early Visual SLAM systems such as in MonoSLAM and Lustman, 1988)) into eight possible solutions. The cor-
(Davison et al, 2007), system initialization required the cam- rect pair of camera poses is chosen such that all triangulated
era to be placed at a known distance from a planar scene 3D points do not generate unreal configurations (negative
composed of four corners of a two dimensional square, and depths in both frames).
SLAM was initialized with the distance separating the cam- The generated initial map is scaled such as the estimated
era from the square keyed in by the operator. translation between the first two keyframes corresponds to
PTAM. Figure 6 shows the flowchart of a generic 0.1 units, before a structure-only BA (optimize only the 3D
model-based initialization procedure, such as the one em- poses of the landmarks) step takes place. The mean of the
ployed in PTAM, SVO and ORB SLAM. To eliminate the 3D landmarks is selected to serve as the world coordinate
obligation of a user’s manual input of depth, PTAM’s (Klein frame, while the positive z-direction is chosen such as the
and Murray, 2007) initial release suggested the usage of the camera poses reside along its positive side.
five-point algorithm (Nistér, 2004) to estimate and decom- PTAM’s initialization procedure is brittle and remains
pose a Fundamental matrix into an assumed non-planar ini- tricky to perform, especially for inexperienced users. Fur-
tial scene. PTAM’s initialization was later changed to the us- thermore, it is subject to degeneracies when the planarity of
age of a Homography (Faugeras and Lustman, 1988), where the initial scene’s assumption is violated or when the user’s
the scene is assumed to be composed of 2D planes. PTAM’s motion is inappropriate, crashing the system, without means
initialization requires the user’s input twice to capture the of detecting such degeneracies.
first two keyframes in the map; furthermore, it requires SVO. Similarly, Forster et al (2014) adopted in SVO, a
the user to perform, in between the first and the second Homography for initialization, however, SVO requires no
keyframe, a slow, smooth and relatively significant transla- user input and the algorithm uses at startup the first ac-
tional motion parallel to the observed scene. quired acquired keyframe; it extracts FAST features and
FAST Features extracted from the first keyframe are tracks them with an implementation of KLT (Tomasi and
tracked in a 2D-2D data association scheme in each incom- Kanade, 1991) (variant of direct methods) across incoming
ing frame until the user flags the insertion of the second frames. To avoid the need for a second input by the user,
keyframe. As the matching procedure takes place through SVO monitors the median of the baseline of the features
the ZMSSD without warping the features, establishing cor- tracked between the first keyframe and the current frame;
rect matches is susceptible to both motion blur and signif- whenever this value reaches a certain threshold, the algo-
icant appearance changes of the features caused by camera rithm assumes enough parallax has been achieved and sig-
rotations; hence the strict requirements on the user’s motion nals the Homography estimation to start. The Homography
during the initialization. is then decomposed; the correct camera poses are then se-
To ensure minimum false matches, the features are lected and the landmarks corresponding to inlier matches
searched for twice; once from the current frame to the pre- are triangulated and used to estimate an initial scene depth.
vious frame and a second time in the opposite direction. If Bundle Adjustment takes place for the two frames and all
the matches in both directions are not coherent, the feature is their associated landmarks, before the second frame is used
discarded. Since PTAM’s initialization employs a Homogra- as a second keyframe and passed to the map management
phy estimation, the observed scene during the initialization thread.
is assumed to be planar. Once the second keyframe is suc- As is the case in PTAM, the initialization of SVO re-
cessfully incorporated into the map, a MLESAC (Torr and quires the same type of motion and is prone to sudden move-
Zisserman, 2000) loop uses the established matches to gen- ments as well as to non-planar scenes; furthermore, monitor-
A survey on non-filter-based monocular Visual SLAM systems 9

ing the median of the baseline between features is not a good

X
approach to automate the initial keyframe pair selection as it
is prone to failure against degenerate cases, with no means
of detecting them.
DT SLAM. DT SLAM does not have an explicit ini- x1 x2
tialization phase; rather, it is integrated within its tracking
module as an Essential matrix estimation method.
Table 4 summarizes the initialization methods employed c1 c2
?
by different Visual SLAM systems along with the assump-
tion they make about the configuration of the scene at startup T,R
of the system. With the exception of MonoSLAM, all the
suggested methods described above suffer from degenera- Fig. 7 Pose estimation required by any Visual SLAM system
cies when subjected to certain scenes; namely under low-
parallax movements of the camera or when the scene’s struc-
quickly discarded by the system and it restarts with a dif-
ture assumption for the corresponding method–Fundamental
ferent pair of frames. It is noteworthy to mention that the
matrix’s assumption for general non-planar scenes or the
relationship between image coordinates and corresponding
Homography’s assumption of planar scenes–is violated.
3D point coordinates in all the listed initialization methods,
LSD SLAM. To address this issue, Engel et al (2014) aside that of monoSLAM, can only be determined up to an
suggested in LSD SLAM, a randomly initialized scene’s unknown scale λ .
depth from the first viewpoint, that is later refined through
measurements across subsequent frames. LSD SLAM uses
an initialization method that does not require two view ge- 3.4 Pose estimation
ometry. Instead of tracking features across two frames, as
the other systems do, LSD SLAM initialization procedure Because data association is computationally heavy, most Vi-
takes place on a single frame; pixels of interest (i.e., image sual SLAM systems assume, for the pose of each new frame,
locations that have high intensity gradients) are initialized a prior, which guides and limits the amount of work required
into the system with a random depth distribution and a large for data association. Estimating this prior is generally the
variance. Tracking starts directly as image alignment takes first task in pose estimation. Figure 7 depicts the pose es-
place between the first initialized keyframe and preceeding timation problem; a map and data association between two
frames. Using the incoming frames and until convergence, frames are known and one seeks to estimate the pose of the
the depth measurements of the initialized features are re- second frame, given the pose of the first. PTAM, DT SLAM,
fined using a filter-based scheme. This method does not suf- ORB SLAM, and DPPTAM employ a constant velocity mo-
fer from the degeneracies of two view geometry methods; tion model that assumes a smooth camera motion and uses
however, depth estimation requires a relatively large number the pose changes across the two previously tracked frames
of processed frames before convergence takes place, result- to estimate the prior for the current frame. Unfortunately,
ing in an intermediate tracking phase where the generated such a model is prone to failure when sudden change in
map is not reliable. direction of the camera’s motion occurs. LSD SLAM and
DPPTAM. DPPTAM, Concha and Civera (2015) bor- SVO assume no significant change in the camera pose be-
rows from LSD SLAM its initialization procedure, and tween consecutive frames (such as the case in high frame
therefore it also suffers from the random depth initializa- rate cameras), and hence they assign the prior for the pose
tion symptoms, where several keyframes must be added to of the current frame to be the same as the previously tracked
the system before it reaches a stable configuration. one.
To deal with the limitations arising from all the above Figure 8 presents a generic pipeline for pose estimation.
methods, Mur-Artal et al (2015) suggested to compute in The pose of the prior frame is used to guide the data as-
parallel, both a Fundamental matrix and a Homography (in sociation procedure in several ways. It helps determine a
a RANSAC scheme), while penalizing each model accord- potentially visible set of features from the map in the cur-
ing to its symmetric transfer error (Hartley and Zisserman, rent frame, thereby reducing the computational expense of
2003), in order to select the appropriate model. Once this blindly projecting the entire map. Furthermore, it helps es-
is done, appropriate decomposition takes place and both the tablish an estimated feature location in the current frame,
scene structure and the camera poses are recovered, before such that feature matching takes place in small search re-
a bundle adjustment step optimizes the map. If the cho- gions, instead of across the entire image. Finally, it serves
sen model yields poor tracking quality and few feature cor- as a starting point for the minimization procedure, which re-
respondences in the upcoming frame, the initialization is fines the camera pose.
10 Georges Younes et al.

Table 4 Initialization. Abbreviations used: homography decomposition (h.d.), Essential decomposition (e.d.), random depth initialization (r.d.),
planar (p), non-planar (n.p.), no assumption (n.a.)

DT LSD ORB
PTAM SVO DPPTAM
SLAM SLAM SLAM
Initialization h.d. h.d. e.d. r.d. h.d.+e.d. r.d.
Initial scene assumption p p n.p. n.a. n.a. n.a.

Pyramid levels\ T
Process New Update Map Send to
features\ Regions Failure Is
MAP Frame maintenance Mapping
of interest Recovery Keyframe?
data thread

Previous Data association

Sample Monitor Data Is Data T
frame’s between Frame Iterative pose
landmarks association association
pose & Map/previous optimization
from Map quality quality good?
estimate frame

Fig. 8 Generic pose estimation pipeline.

Direct and indirect methods estimate the camera pose by PTAM. PTAM represents the camera pose as an SE(3)
minimizing a measure of error between frames; direct meth- transformation (Hall, 2015) that can be minimally repre-
ods measure the photometric error and indirect methods es- sented by six parameters. The mapping from the full SE(3)
timate the camera pose by minimizing the re-projection er- transform to its minimal representation Sξ (3) and vice versa
ror of landmarks from the map over the frame’s prior pose. can be done through logarithmic and exponential mapping
The re-projection error is formulated as the distance in pix- in Lie algebra. The minimally represented Sξ (3) transform
els between a projected 3D landmark onto the frame using is of great importance as it reduces the number of param-
the prior pose and its found 2-D position in the image. eters to optimize from twelve to six, leading to significant
Note in Fig. 9 how camera pose estimation takes place. speedups in the optimization process.
The motion model is used to seed the new frame’s pose In PTAM, the pose estimation procedure first starts by
at Cm , and a list of potentially visible 3D landmarks from estimating a prior to the frame’s pose using the constant
the map are projected onto the new frame. Data association velocity motion model. The prior is then refined, using a
takes place in a search window Sw surrounding the loca- Small Blurry Image (SBI) representation of the frame, by
tion of the projected landmarks. The system then proceeds employing an Efficient Second Order minimization (Benhi-
by minimizing the re-projection error d over the parame- mane and Malis, 2007). The velocity of the prior is defined
ters of the rigid body transformation. To gain robustness as the change between the current estimate of the pose and
against outliers (wrongly associated features), the minimiza- the previous camera pose. If the velocity is high, PTAM an-
tion takes place over an objective function that penalizes fea- ticipates a fast motion is taking place and hence the presence
tures with large re-projection errors. of motion blur; to counter failure from motion blur, PTAM
restricts tracking to take place only at the highest pyramid
levels (most resilient to motion blur) in what is known as
X
a coarse tracking stage only; otherwise the coarse tracking
stage is followed by a fine tracking stage. However, when
the camera is stationary, the coarse stage may lead to jitter-
x1 ing of the camera’s pose–hence it is turned off.
𝑥"
The minimally represented initial camera pose prior is
then refined by minimizing the tukey-biweight (Moranna
et al, 2006) objective function of the re-projection error that
c1
c2 cm down-weights observations with large error. If fine tracking
R,T is to take place, features from the lowest pyramid levels are
selected and a similar procedure to the above is repeated.
Rm ,Tm To determine the tracking quality, the pose estimation
Fig. 9 Generic pose estimation procedure. Cm is the new frame’s pose thread in PTAM monitors the ratio of successfully matched
estimated by the motion model and C2 is the actual camera pose. features in the frame against the total number of attempted
A survey on non-filter-based monocular Visual SLAM systems 11

Table 5 Pose estimation. Abbreviations are as follows: constant velocity motion model (c.v.m.m), same as previous pose (s.a.p.p.), similarity
transform with previous frame (s.t.p.f.), optimization through minimization of features (o.m.f.), optimization through minimization of photometric
error (o.m.p.e.), Essential matrix decomposition (E.m.d.), pure rotation estimation from 2 points (p.r.e.), significant pose change (s.p.c.), significant
scene appearance change (s.s.a.c)

LSD ORB
PTAM SVO DT SLAM DPPTAM
SLAM SLAM
Motion prior c.v.m.m. s.a.p.p. s.t.p.t. s.a.p.p. c.v.m.m. c.v.m.m.
+ESM or place or s.a.p.p.
recogn.
Tracking o.m.f. o.m.p.e. 3 modes: o.m.p.e. o.m.f. o.m.p.e.
1-e.m.d.;
2-o.m.f.;
3-p.r.e.
keyframe add criterion s.p.c. s.p.c. s.s.a.c. s.p.c. s.s.a.c. s.p.c.

feature matches. If the tracking quality is questionable, the Bundle Adjustment then takes place, followed by a structure
tracking thread operates normally but no keyframes are ac- only Bundle Adjustment that refines the 3D location of the
cepted by the system. If the tracker’s performance is deemed landmarks based on the refined camera pose of the previous
bad for 3 consecutive frames, then the tracker is considered step.
lost and failure recovery is initiated. Finally, a joint (pose and structure) local bundle adjust-
Table. 5 summarizes the pose estimation methods used ment fine tunes the reported camera pose estimate. During
by different Visual SLAM systems. this pose estimation module, the tracking quality is continu-
SVO. SVO uses a sparse model-based image alignment ously monitored; if the number of observations in a frame is
in a pyramidal scheme in order to estimate an initial cam- a below a certain threshold or if the number of features be-
era pose estimate. It starts by assuming the camera pose at tween consecutive frames drops drastically, tracking quality
time t to be the same as at t − 1 and aims to minimize the is deemed insufficient and failure recovery methods are ini-
photometric error of 2D image locations of known depth tiated.
in the current frame with respect to their location at t − 1, DT SLAM. DT SLAM maintains a camera pose based
by varying the camera transformation relating both frames. on three tracking modes: full pose estimation, Essential ma-
The minimization takes places through thirty Gauss New- trix estimation, and pure rotation estimation. When a suffi-
ton iterations of the inverse compositional image alignment cient number of 3D matches exist, a full pose can be esti-
method. This however introduces many limitations to SVO mated; otherwise, if a sufficient number of 2D matches are
since the ICIA requires small displacements between frames established that exhibit small translations, an Essential ma-
(1pixel). This limits the operation of SVO to high frame rate trix is estimated; and finally, if a pure rotation is exhibited,
cameras (typically > 70 f ps) so that the displacement limi- 2 points are used to estimate the absolute orientation of the
tation is not exceeded. Furthermore, the ICIA is based on matches (Kneip et al, 2012). The pose estimation module
the brightness consistency constraint rendering it vulnerable finally aims, in an iterative manner, to minimize the error
to any variations in lighting conditions. vector of both 3D-2D re-projections and 2D-2D matches.
SVO does not employ explicit feature matching for ev- When tracking failure occurs, the system initializes a new
ery incoming frame; rather, it is achieved implicitly as a map and continues to collect data for tracking in a differ-
byproduct of the image alignment step. Once image align- ent map; however, the map making thread continues to look
ment takes place, landmarks that are estimated to be visi- for possible matches between the keyframes of the new map
ble in the current frame, are projected onto the image. The and the old one, and once a match is established, both maps
2D location of the projected landmarks are fine-tuned by are fused together, thereby allowing the system to handle
minimizing the photometric error between a patch, extracted multiple sub-maps, each at a different scale.
from the initial projected location in the current frame, and LSD SLAM. The tracking thread in LSD SLAM is re-
a warp of the landmark generated from the nearest keyframe sponsible for estimating the current frame pose with respect
observing it. To decrease the computational complexity and to the currently active keyframe in the map using the previ-
to maintain only the strongest features, the frame is divided ous frame pose as a prior. The required pose is represented
into a grid and only one projected landmark (the strongest) by an SE(3) transformation and is found by an iteratively
per grid cell is used. However, This minimization violates re-weighted Gauss-Newton optimization that minimizes the
the epipolar constraint for the entire frame and further pro- variance normalized photometric residual error, as described
cessing in the tracking module is required. Motion-only in (Engel et al, 2013), between the current frame and the ac-
12 Georges Younes et al.

tive keyframe in the map. A keyframe is considered active if the map by forfeiting geometric information (scale, distance
it is the most recent keyframe accommodated in the map. To and direction) in favor for connectivity information. In the
minimize outlier effects, measurements with large residuals context of Visual SLAM, a topological map is an undirected
are down-weighted from one iteration to the other. graph of nodes that typically represents keyframes linked to-
ORB SLAM. Pose estimation in ORB SLAM is estab- gether by edges, when shared data associations between the
lished through a constant velocity motion model prior, fol- nodes exists.
lowed by a pose refinement using optimization. As the mo- While topological maps scale well with large scenes, in
tion model is expected to be easily violated through abrupt order to maintain camera pose estimates, metric informa-
motions, ORB SLAM detects such failures by tracking the tion is also required; the conversion from a topological to a
number of matched features; if it falls below a certain thresh- metric map is not always a trivial task and therefore recent
old, map points are projected onto the current frame and a Visual SLAM systems such as (Engel et al, 2014; Lim et al,
wide range feature search takes place around the projected 2014, 2011; Mur-Artal et al, 2015) employ hybrid maps that
locations. If tracking fails, ORB SLAM invokes its failure are locally metric and globally topological. The implemen-
recovery method to establish an initial frame pose via global tation of a hybrid map representation permits the system to
re-localization. (1) reason about the world on a high level, which allows for
In an effort to make ORB SLAM operate in large en- efficient solutions to loop closures and failure recovery using
vironments, a subset of the global map, known as the local topological information, and (2) to increase efficiency of the
map, is defined by all landmarks corresponding to the set metric pose estimate by limiting the scope of the map to a
of all keyframes that share edges with the current frame, as local region surrounding the camera (Fernández-Moral et al,
well as all neighbors of this set of keyframes from the pose 2015). A hybrid map allows for local optimization of the
graph ( more on that in the following section). The selected metric map while maintaining scalability of the optimiza-
landmarks are filtered out to keep only the features that are tion over the global topological map (Konolige, 2010).
most likely to be matched in the current frame. Furthermore, In a metric map the map making process handles the ini-
if the distance from the camera’s center to the landmark is tialization of new landmarks into the map as well as outlier
beyond the range of the valid features scales, the landmark detection and handling. The 3D structure of the observed
is also discarded. The remaining set of landmarks is then scene is sought from a known transformation between two
searched for and matched in the current frame before a final frames, along with the corresponding data associations. Due
camera pose refinement step takes place. to noise in data association and pose estimates of the tracked
DPPTAM. Similar to LSD SLAM, DPPTAM optimizes images, projecting rays from two associated features will
the photometric error of high gradient pixel locations be- most probably not intersect in 3D space. Triangulation by
tween two images using the ICIA formulation over the optimization as (shown in Fig. 11) aims to estimate a land-
SE(3) transform relating them. The minimization is started mark pose corresponding to the associated features, by min-
using a constant velocity motion model unless the photomet- imizing its re-projection error e1 and e2 onto both frames. To
ric error increases after applying it. If the latter is true, the gain resilience against outliers and to obtain better accuracy,
motion model is disregarded and the pose of the last tracked some systems employ a similar optimization over features
frame is used. Similar to PTAM the optimization takes place associated across more than two views.
in the tangent space Sξ (3) that minimally parameterizes the
rigid body transform by six parameters.
X

3.5 Map generation

The map generation module represents the world as a dense

(for direct) or sparse (for indirect) cloud of points. Figure
10 presents the flowchart of a map generation module. The
system proceeds by triangulating the 2D points of interest 𝑥"$
x1 𝑥"#
into 3D landmarks and keeps track of their 3D coordinates x2
and then localizes the camera within what is referred to as
a metric map. However, as the camera explores large envi-
Fig. 11 Landmark triangulation by optimization.
ronments, metric maps suffer from the unbounded growth of
their size, thereby leading to system failure.
Topological maps were introduced to alleviate this short- As shown in Fig. 12, filter-based landmark estimation
coming, in an effort to minimize the metric information in techniques recover the position of a landmark by populating
A survey on non-filter-based monocular Visual SLAM systems 13

MAP F

Data association Triangulate new

Frame data Does Map T
between new KF and landmarks / Update Pose Graph Map
from Tracking have a Pose
neighboring KFs from update depth links Maintenance
thread Graph?
the map filters

Fig. 10 Generic map generation pipeline.

a particle filter with a uniform distribution (D1 ) of landmark quality of the landmarks and to allow for the map refinement
position estimates, which are then updated as the landmark step to remove corrupt data.
is observed across multiple views. This continues until the New landmarks are generated by establishing and trian-
filter converges from a uniform distribution to a Gaussian gulating feature matches between the newly added keyframe
with a small variance (D3 ). In this type of landmark estima- and its nearest keyframe (in terms of position) from the
tion, outliers are easily flagged as landmarks whose distribu- map. Already existant landmarks from the map are pro-
tion remain approximately uniform after significant observa- jected onto both keyframes and feature matches from the
tions. Filter-based methods result in a time delay before an current keyframe are searched for along their correspond-
observed landmark can be used for pose tracking, in contrast ing epipolar line in the other keyframe at regions that do
to triangulation by optimization methods that can be used as not contain projected landmarks. The average depth of the
soon as the landmark is triangulated from two views. projected landmarks is used to constrain the epipolar search
A major limitation in all these methods is that they re- from a line to a segment; this limits the computation cost
quire a baseline between the images observing the feature, of the search and avoids adding landmarks in regions where
and hence are all prone to failure when the camera’s motion nearby landmarks exist. However, this also limits the newly
is constrained to pure rotations. To counter such a mode of created landmarks to be within the epipolar segment, and
failure, DT SLAM introduced into the map 2D landmarks hence very large variations in the scene’s depth may lead to
that can be used for rotation estimation before they are tri- the negligence of possible landmarks.
angulated into 3D landmarks. SVO. The map generation thread in SVO runs parallel
to the tracking thread and is responsible for creating and up-
dating the map. SVO parametrizes 3D landmarks using an
inverse depth parameterization model (Civera et al, 2008).
Upon insertion of a new keyframe, features possessing the
highest Shi-Tomasi scores are chosen to initialize a num-
ber of depth filters. These features are labeled as seeds and
are initialized to be along a line propagating from the cam-
era center to the 2D location of the seed in the originating
keyframe. The only parameter that remains to be solved for
x1 is then the depth of the landmark, which is initialized to the
mean of the scene’s depth, as observed from the keyframe
of origin.
Fig. 12 Landmark estimation using filter based methods.
During the times when no new keyframe is being pro-
cessed, the map management thread monitors and updates
Table 6 summarizes map generation methods employed map seeds by subsequent observations in newly acquired
by different Visual SLAM systems which can be divided frames. The seed is searched for in new frames along an
into two main categories: triangulation by optimization epipolar search line, which is limited by the uncertainty of
(PTAM and ORB SLAM) and filter based landmark esti- the seed and the mean depth distribution observed in the
mation (SVO, LSD SLAM and DPPTAM). current frame. As the filter converges, its uncertainty de-
PTAM. When a new keyframe is added in PTAM, creases and the epipolar search range decreases. If seeds fail
all bundle adjustment operations are halted, and the new to match frequently, if they diverge to infinity, or if a long
keyframe inherits the pose from the coarse tracking stage. time has passed since their initialization, they are considered
The potentially visible set of landmarks, estimated by the bad seeds and removed from the map.
tracker, is then re-projected onto the new keyframe, and fea- The filter converges when the distribution of the depth
ture matches are established. Correctly matched landmarks estimate of a seed transitions from the initially assumed
are marked as seen again; this is done to keep track of the uniform distribution into a Gaussian one. The seed in then
14 Georges Younes et al.

Table 6 Map generation. Abbreviations: 2 view triangulation (2.v.t.), particle filter with inverse depth parametrization (p.f.), 2D landmarks trian-
gulated to 3D landmarks (2D.l.t.), depth map propagation from previous frame (p.f.p.f.), depth map refined through small baseline observations
(s.b.o.), multiple hypotheses photometric error minimization (m.h.p.m.)

LSD ORB
PTAM SVO DT SLAM DPPTAM
SLAM SLAM
p.f.p.f or
Map generation 2.v.t. p.f. 2D.l.t. 2.v.t. m.h.p.m
s.b.o.
Map type metric metric metric hybrid hybrid metric

added into the map, with the mean of the Gaussian distri- The Sim(3) of a newly added keyframe is then estimated
bution as its depth. This process however limits SVO to op- and refined in a direct, scale-drift aware image alignment
erate in environments of relatively uniform depth distribu- scheme, which is similar to the one done in the tracking
tions. Since the initialization of landmarks in SVO relies on thread, but with respect to other keyframes in the map and
many observations in order for the features to be triangu- over the 7d.o. f . Sim(3) transform.
lated, the map contains few if any outliers and hence no out- Due to the non-convexity of the direct image alignment
lier deletion method is required. However, this comes at the method on Sim(3), an accurate initialization to the mini-
expense of a delayed time before the features are initialized mization procedure is required; for such purpose, ESM (Ef-
as landmarks and added to the map. ficient Second Order minimization) (Benhimane and Malis,
DT SLAM. DT SLAM aims to add keyframes when 2007) and a coarse to fine pyramidal scheme with very low
enough visual change has occurred; the three criteria for resolutions proved to increase the convergence radius of the
keyframe addition are (1) for the frame to contain a suffi- task.
cient number of new 2D features that can be created from If the map generation module deems the current frame
areas not covered by the map, or (2) a minimum number of as not being a keyframe, depth map refinement takes place
2D features can be triangulated into 3D landmarks, or (3) a by establishing stereo matches for each pixel in a suitable
given number of already existing 3D landmarks have been reference frame. The reference frame for each pixel is deter-
observed from a significantly different angle. The map con- mined by the oldest frame the pixel was observed in, where
tains both 2D features and 3D landmarks, where the trian- the disparity search range and the observation angle do not
gulation of 2D features into 3D landmarks is done through exceed a certain threshold. A 1-D search along the epipolar
two view triangulation and is deferred until enough paral- line for each pixel is performed with an SSD metric.
lax between keyframes is observed–hence the name of the To minimize computational cost and reduce the effect of
algorithm. outliers on the map, not all established stereo matches are
LSD SLAM. LSD SLAM’s map generation module used to update the depth map; instead, a subset of pixels
functions can be divided into two main categories, depend- is selected for which the accuracy of a disparity search is
ing on whether the current frame is a keyframe or not; if it sufficiently large. The accuracy is determined by three crite-
is, depth map creation takes place by keyframe accommoda- ria: the photometric disparity error, the geometric disparity
tion; if not, depth map refinement is done on regular frames. error, and the pixel to inverse depth ratio. Further details re-
To maintain tracking quality, LSD SLAM requires frequent garding these criteria are outside the scope of this work, the
addition of keyframes into the map as well as relatively high interested reader is referred to Engel et al (2013). Finally,
frame rate cameras. depth map regularization and outlier handling, similar to the
keyframe processing step, take place.
If a frame is labeled a keyframe, the estimated depth map
ORB SLAM. ORB SLAM’s local mapping thread is re-
from the previous keyframe is projected onto it and serves as
sponsible for keyframe insertion, map point triangulation,
its initial depth map. Spatial regularization then takes place
map point culling, keyframe culling and local bundle adjust-
by replacing each projected depth value by the average of its
ment. The keyframe insertion step is responsible for updat-
surrounding values and the variance is chosen as the mini-
ing the co-visibility and essential graphs with the appropri-
mal variance value of the neighboring measurements.
ate edges, as well as computing the bag of words represent-
In LSD SLAM, outliers are detected by monitoring the ing the newly added keyframe in the map. The co-visibility
probability of the projected depth hypothesis at each pixel to graph is a pose graph that represents all keyframes in the
be an outlier or not. To make the outliers detection step pos- system by nodes, in contrast to the essential graph that al-
sible, LSD SLAM keeps records of all successfully matched low every node to have two or less edges, by keeping the
pixels during the tracking thread, and accordingly increases strongest two edges only for every node. The map point cre-
or decreases the probability of it being an outlier. ation module spawns new landmarks by triangulating ORB
A survey on non-filter-based monocular Visual SLAM systems 15

Establish data association Outlier

Keyframe
MAP between landmarks and all Landmark
culling
KFs they are visible in culling

T Is Dense
Perform dense Map optimization through LBA Search for
reconstruction
reconstruction and/or GBA and/or PGO Loop closures
required?

Fig. 13 Generic map maintenance pipeline.

features that appear in two or more views from connected computationally involved and intractable if performed on all
keyframes in the co-visibility graph. Triangulated landmarks frames and all poses. The breakthrough that enabled its ap-
are tested for positive depth, re-projection error, and scale plication in PTAM is the notion of keyframes, where only
consistency in all keyframes they are observed in, in order select frames labeled as keyframes are used in the map cre-
to accommodate them into the map. ation and passed to the bundle adjustment process in contrast
DPPTAM. Landmark triangulation in DPPTAM takes to SfM methods that uses all available frames. Different al-
place over several overlapping observations of the scene gorithms apply different criteria for keyframe labeling, as
using inverse depth parametrization; the map maker aims well as different strategies for BA, some use jointly a local
to minimize the photometric error between a high gradient (over a local number of keyframes) LBA and global (over
pixel patch in the last added keyframe and the correspond- the entire map) GBA, while others argue that a local BA
ing patch of pixels, found by projecting the feature from the only is sufficient to maintain a good quality map. To reduce
keyframe onto the current frame. The minimization is re- the computational expenses of bundle adjustment, Strasdat
peated ten times for all high-gradient pixels when the frame et al (2011) proposed to represent the visual SLAM map
exhibits enough translation; the threshold for translation is by both a Euclidean map for LBA, along with a topological
increased from an iteration to another to ensure enough map for pose graph optimization that explicitly distributes
baseline between the frames. The end result is ten hypothe- the accumulated drift along the entire map.
ses for the depth of each high-gradient pixel. To deduce the Pose Graph Optimization (PGO) returns inferior results
final depth estimate from the hypotheses, three consecutive to those produced by GBA. The reasons is that while PGO
tests are performed, including gradient direction test, tem- optimizes only for the keyframe poses–and accordingly ad-
poral consistency, and spatial consistency. justs the 3D structure of landmarks–GBA jointly optimizes
for both keyframe poses and 3D structure. The stated ad-
vantage of GBA comes at the cost of computational time,
3.6 Map maintenance with PGO exhibiting significant speed up compared to GBA.
However, pose graph optimization requires efficient loop
Map maintenance takes care of optimizing the map through closure detection and may not yield an optimal result as the
either bundle adjustment or pose graph optimization (Kum- errors are distributed along the entire map, leading to lo-
merle et al, 2011). Figure 13 presents the steps required for cally induced inaccuracies in regions that were not originally
map maintenance of a generic Visual SLAM. During a map wrong.
exploration phase, new 3D landmarks are triangulated based Map maintenance is also responsible for detecting and
on the camera pose estimates. After some time, system drift removing outliers in the map due to noisy and faulty
manifests itself in wrong camera pose measurements, due matched features. While the underlying assumption of most
to accumulated errors in previous camera poses that were Visual SLAM algorithms is that the environment is static,
used to expand the map. Figure 14 describes the map main- some algorithms such as RD SLAM exploits map main-
tenance effect where the scene’s map is refined through out- tenance methods to accommodate slowly varying scenes
lier removal and error minimizations, to yield a more accu- (lighting and structural changes).
rate scene representation. PTAM. The map making thread in PTAM runs parallel
Bundle adjustment (BA) is inherited from SfM and con- to the tracking thread and does not operate on a frame by
sists of a nonlinear optimization process for refining a vi- frame basis; instead, it only processes keyframes. When the
sual reconstruction, to jointly produce an optimal structure map making thread is not processing new keyframes, it per-
and coherent camera pose estimates. Bundle adjustment is forms various optimizations and maintenance to the map,
16 Georges Younes et al.

keyframe in the co-visibility graph and all other keyframes

that observe any landmark present in the current keyframe.
DPPTAM. DPPTAM does not employ optimization
tasks to its map as previous systems do, however it produces
Actual!scene! real-time dense maps by employing a dense mapping thread
Map!before!maintenance!
Map!after!maintenance!
that exploits planar properties of man-made indoor environ-
ments. Keyframes are first segmented into a set of 2D super-
pixels and all 3D landmarks from the map are projected onto
Fig. 14 Map maintenance effects on the map. the keyframe and assigned to different superpixels according
to the distance of their projections to the appropriate super-
pixel in the keyframe. 3D points belonging to contours of
such as an LBA for local map convergence and a global the superpixels are used to fit 3D planes to each superpixel
adjustment for the map’s global convergence. The compu- before three tests determine if the superpixel’s plane is to
tational cost in PTAM scales with the map and becomes be added into the map. The tests include (1) the normalized
intractable as the number of keyframes get large; for this residual test, (2) the degenerate case detection, and (3) tem-
reason PTAM is designed to operate in small workspaces. poral consistency. Finally a full dense map is reconstructed
Finally, the optimization thread applies data refinement by by estimating depth at every pixel using depth priors of the
first searching and updating landmark observations in all 3D planes associated with the superpixels.
keyframes, and then by removing all landmarks that failed Table 7 summarizes the map maintenance procedure
many times to successfully match features. adopted by different Visual SLAM system.
SVO. For runtime efficiency reasons, SVO’s map man-
agement maintains only a fixed number of keyframes in 3.7 Failure recovery
the map and removes distant ones when new keyframes are
added. This is performed so that the algorithm maintains Whether due to wrong user movement (abrupt change in the
real-time performance after prolonged periods of operation camera pose and motion blur), or camera observing a fea-
over large distances. tureless region or failure to match enough features, or for
DT SLAM. Aside the map generation module, DT any other reason, Visual SLAM methods can eventually fail.
SLAM employ a third thread that continuously optimize the A key module essential for the usability of any visual SLAM
entire map in the background through a sparse GBA. system is its ability to correctly recover from such failures.
PTAM. Upon detecting failure, PTAM’s tracker initiates
LSD SLAM. LSD SLAM runs a third parallel thread a recovery procedure, where the SBI (small blurry image) of
that continuously optimizes the map in the background by a each incoming frame is compared to the database of SBIs
generic implementation of a pose graph optimization using for all keyframes. If the intensity difference between the
the g2o-framework (Kummerle et al, 2011). This however incoming frame and its nearest looking keyframe is below
leads to an inferior accuracy when compared to other meth- a certain threshold, the current frame’s pose is assumed to
ods for reasons stated earlier. be equivalent to that of the corresponding keyframe. ESM
ORB SLAM. To maintain a good quality map and tracking takes place (similar to that of the tracking thread)
counter the effect of frequently adding features from to estimate the rotational change between the keyframe
keyframes, ORB SLAM employs rigorous landmark culling and the current frame. If converged, a PVS of landmarks
to ensure few outliers in the map. A landmark must be cor- from the map is projected onto the estimated pose and the
rectly matched to twenty five percent of the frames in which tracker attempts to match the landmarks to the features in
it is predicted to be visible. It also must be visible from at the frame. If enough features are correctly matched, the
least three keyframes after more than one keyframe has been tracker resumes normally, otherwise a new frame is acquired
accommodated into the map since it was spawned. Other- and the tracker remains lost. For successful re-localization,
wise, the landmark is removed. To maintain lifelong opera- this method requires the lost camera’s pose to be near the
tion and to counter the side effects (computational complex- recorded keyframe’s pose, and otherwise would fail when
ity, outliers, redundancy) of the presence of high number of there is a large displacement between the two.
keyframes in the map, a rigorous keyframe culling proce- SVO. When tracking quality is deemed insufficient,
dure takes place as well. Keyframes that have ninety per- SVO initiates its failure recovery method. The first proce-
cent of their associated landmarks observed in three other dure in the recovery process is to apply image alignment
keyframes are deemed redundant and removed. The local between the incoming frame and the closest keyframe to the
mapping thread also performs a local bundle adjustment last known correctly tracked frame. If more than thirty fea-
over all keyframes connected to the latest accommodated tures are correctly matched during the image alignment step,
A survey on non-filter-based monocular Visual SLAM systems 17

Table 7 Map maintenance. Abbreviations used: Local Bundle Adjustment (LBA), Global Bundle Adjustment (GBA), Pose Graph Optimization
(PGO),
PTAM SVO DT SLAM LSD SLAM ORB SLAM DPPTAM
Optimization LBA & GBA LBA LBA & GBA PGO PGO& LBA Dense mapping
Scene type static & small uniform depth static & small static &small or large static & small or large static & indoor planar

then the re-localizer considers itself converged and contin- 3.8 Loop closure
ues tracking regularly; otherwise, it attempts to re-localize
using new incoming frames. Such a re-localizer is sensitive Since Visual SLAM is an optimization problem, it is prone
to any change in the lighting conditions of the scene, and the to drifts in camera pose estimates. Returning to a certain
lost frame location should be close enough to the queried pose after an exploration phase may not yield the same cam-
keyframe for successful re-localization to take place. era pose measurement as it was at the start of the run (See
LSD SLAM. LSD SLAM’s recovery procedure first Fig. 15). Such camera pose drift can also manifest itself in a
chooses randomly a keyframe from the map that has more map scale drift that will eventually lead the system to erro-
than two neighboring keyframes connected to it in the pose neous measurements and fatal failure. To address this issue,
graph. It then attempts to align the currently lost frame to some algorithms detect loop closures in an on-line Visual
it. If the outlier-to-inlier ratio is large, the keyframe is dis- SLAM session and optimize the loops track, in an effort to
carded and replaced by another keyframe at random; other- correct the drift and the error in the camera pose and in all
wise, all neighboring keyframes connected to it in the pose relevant map data that were created during the loop. The
graph are then tested. If the number of neighbors with a large loop closure thread attempts to establish loops upon the in-
inlier-to-outlier ratio is larger than the number of neighbors sertion of a new keyframe in order to correct and minimize
with a large outlier-to-inlier ratio, or if there are more than any accumulated drift by the system over time.
five neighbors with a large inlier-to-outlier ratio, the neigh-
boring keyframe with the largest ratio is set as the active
keyframe and regular tracking is accordingly resumed.
ORB SLAM. Triggered by tracking failure, ORB
SLAM invokes its global place recognition module. Upon
running, the re-localizer transforms the current frame into a
bag of words and queries the database of keyframes for all
possible keyframes that might be used to re-localize from.
The place recognition module implemented in ORB SLAM,
used for both loop detection and failure recovery, relies on
Actual path
bags of words as frames observing the same scene share
a big number of common visual vocabulary. In contrast to Estimated path
other bag of words methods that return the best queried hy-
pothesis from the database of keyframes, the place recogni- Fig. 15 Drift suffered by the Visual SLAM pose estimate after return-
ing to its starting point.
tion module of ORB SLAM returns all possible hypotheses
that have a probability of being a match larger than seventy
five percent of the best match. The combined added value LSD SLAM. Whenever a keyframe is processed by LSD
of the ORB features, along with the bag of words imple- SLAM, loop closures are searched for within its ten nearest
mentation of the place recognition module, manifests itself keyframes as well as through the appearance based model
in a real-time, high recall, and relatively high tolerance to of FABMAP (Glover et al, 2012) to establish both ends of
viewpoint changes during re-localization and loop detection. a loop. Once a loop edge is detected, a pose graph opti-
All hypotheses are then tested through a RANSAC imple- mization minimizes the similarity error established at the
mentation of the PnP algorithm (Lepetit et al, 2009) that loops edge by distributing the error over the loops keyframes
determines the camera pose from a set of 3D to 2D corre- poses.
spondences. The found camera pose with the most inliers is ORB SLAM. Loop detection in ORB SLAM takes place
then used to establish more matches to features associated via its global place recognition module that returns all hy-
with the candidate keyframe, before an optimization over potheses of keyframes from the database that might corre-
the camera’s pose using the established matches takes place. spond to the opposing loop end. To ensure enough distance
Table 8 summarizes the failure recovery mechanisms change has taken place, they compute what they refer to as
used by different Visual SLAM system. the similarity transform between all connected keyframes to
18 Georges Younes et al.

Table 8 Failure recovery. Abbreviations used: photometric error minimization of SBIs (p.e.m.), image alignment with last correctly tracked
keyframe (i.a.l.), image alignment with random keyframe (i.a.r.), bag of words place recognition (b.w.)

DT LSD ORB
PTAM SVO DPPTAM
SLAM SLAM SLAM
Failure recovery p.e.m. i.a.l. none i.a.r. b.w. i.a.l

the current keyframe in the thresholded co-visibility graph. map. To ensure sufficient baseline, the system is initialized
It the similarity score is less than a threshold, the hypothesis by automatic insertion of three keyframes. However the sys-
of a loop is removed. If enough inliers support the refined tem does not use the 3 views to perform the initialization,
similarity transform, the queried keyframe is considered to instead it solves for the initialization using the 5-point al-
be the other end of the loop and loop fusion takes place. gorithm of Nistér (2004) between the 1st and 3rd keyframes
The loop fusion first merges duplicate map points in both only therefore, it is still susceptible to planar scenes.
keyframes and inserts a new edge in the co-visibility graph Silveira et al (2008) proposed a real-time direct solution
that closes the loop by correcting the Sim(3) pose of the by assuming relatively large patches of pixels surrounding
current keyframe using the similarity transform. Using the regions of high intensity gradients as planar, and perform-
corrected pose, all landmarks associated with the queried ing image alignment by minimizing the photometric error
keyframe and its neighbors are projected to and searched of these patches across incoming frames, in a single op-
for in all keyframes associated with the current keyframe in timization step that incorporates Cheirality, geometric and
the co-visibility graph. The initial set of inliers, as well as photometric constraints. To gain resilience against lighting
the found matches are used to update the co-visibility and changes and outliers, the system employs a photogeometric
Essential graphs, establishing many edges between the two generative model and monitors the errors in the minimiza-
ends of the loop. Finally, a pose graph optimization over the tion process to flag outliers.
essential graph takes place similar to that of LSD SLAM, In 2010, =Strasdat et al (2010a) introduced similarity
which minimizes and distributes the loop closing error along transforms into Visual SLAM, allowing for scale drift esti-
the loop nodes. mation and correction once the system detects loop closure.
Table 9 summarizes the Loop closure mechanisms used Feature tracking is performed by a mixture of top-bottom
by different Visual SLAM system. and bottom-up approach, using a dense variational optical
flow and a search over a window surrounding the projected
landmarks. Landmarks are triangulated by updating infor-
Table 9 Loop closure. Abbreviations used: Bag of Words place recog-
nition (B.W.p.r), sim(3) optimization (s.o.) mation filters and loop detection is performed using a bag of
words discretization of SURF features (Bay et al, 2008). The
DT LSD ORB loop is finally closed by applying a pose graph optimization
PTAM SVO DPPTAM
SLAM SLAM SLAM
over the similarity transforms relating the keyframes.
Loop FabMap B.W.p.r.
none none none none Also in 2010, Newcombe and Davison (2010) suggested
closure +s.o. +s.o.
a hybrid Visual SLAM system that relied on feature-based
SLAM (PTAM) to fit a dense surface estimate of the en-
vironment that is refined using direct methods. A surface-
based model is then computed and polygonized to best fit
4 Closed source systems the triangulated landmarks from the feature-based front end.
A parallel process chooses a batch of frames that have a po-
We have discussed so far methods presented in open-source tentially overlapping surface visibility in order to estimate a
Visual SLAM systems; however, plenty of closed source dense refinement over the base mesh using a GPU acceler-
methods exist in the literature. This section aims to provide ated implementation of variational optical flow.
a quick overview of these systems, which include many in- In an update to this work, Newcombe released in 2011
teresting ideas for the reader. Table 10 lists in chronological Dense Tracking and Mapping (DTAM) (Newcombe et al,
order each of these systems. To avoid repetition, we will not 2011) that removed the need for PTAM as a front-end to
outline the complete details of each system; rather, we will the system and generalized the dense reconstruction to fully
focus on what we feel has additive value to the reader from solve the Visual SLAM pipeline, by performing on-line a
the information provided in Section 3. In 2006, Mouragnon dense reconstruction, given camera pose estimates that are
et al (2006) were the first to introduce in the concept of found through whole image alignment.
keyframes in Visual SLAM and employed a local Bundle Similar to the work of Newcombe(Newcombe et al,
Adjustment in real-time over a subset of keyframes in the 2011), Pretto et al (2011) modeled the environment as a 3D
A survey on non-filter-based monocular Visual SLAM systems 19

Table 10 Closed source Visual SLAM systems

Year Paper Additive information beyond open-source techniques

Real Time Localization and 3D Reconstruction – Introduced keyframes and LBA for real-time Visual SLAM
2006 – Utilize 3 keyframes for initialization
Mouragnon et al (2006)

– Real-time direct image alignment

An efficient direct approach to Visual SLAM Sil- – Employed a photogeometric generative model
2008
veira et al (2008) – Outlier handling using phometric and geometric errors

– Introduced similarity transforms to visual SLAM

– Data association using dense variational optical flow and search win-
Scale-Drift Aware Large Scale Monocular SLAM dows.
2010 – Use information filters to triangulate landmarks
Strasdat et al (2010a)
– Loop detection and closure techniques using bags of words discretiza-
tion of SURF features and a pose graph optimization

Live dense reconstruction with a single moving – Fit a base mesh to a sparse set of landmarks triangulated using PTAM
2010 – GPU parallelization of variational optical flow to refine the base mesh
camera Newcombe and Davison (2010)

– Modeled the environment as a 3D piecewise smooth surface.

Omnidirectional dense large scale mapping and – Meshing using delaunay triangulation
2011 navigation based on meaningful triangulation – incorporated edgelets into the mesh
Pretto et al (2011) – interpolated a dense reconstruction using the mesh vertices.

Dense Tracking and Mapping (DTAM) New- – Removed the need for PTAM as a front-end from Pretto et al (2011)
2011
combe et al (2011)

– Handle short and long term environmental changes

Continuous Localization and Mapping in a Dy- – Use HOC descriptors to gain rotational change immunity
2011 – re-localization through FABMAP
namic world Pirker et al (2011)
– Loop closure through pose graph optimization

– Handle occlusions and slowly varying scenes and illumination changes

– Prior-based adaptive RANSAC sampling of features
Robust Monocular SLAM in Dynamic Environ- – Employed a GPU accelerated SIFT extraction and KD-Tree feature
2013
ments Tan et al (2013) matching
– Landmark and keyframe update and culling using color histograms.

Handling pure camera rotation in keyframe based – panoramic submaps (2D landmarks) to handle pure camera rotation
2013 – phonySIFT feature extraction and matching using hierarchical k-means.
SLAM Pirchheim et al (2013)

Real-Time 6-DOF Monocular Visual SLAM in a – Hybrid topological and metric map
2014 – Tracking, mapping and loop closure all using the same binary descriptor
large scale environment Lim et al (2014)

– Off-line method
Robust Large Scale monocular Visual SLAM – Divide the map into submaps stored in a graph
2015 – Suggested a loop closure outlier detection mechanism in submaps
Bourmaud and Megret (2015)
– Employed a loopy belief propagation algorithm (LS-RSA)

piecewise smooth surface and used a sparse feature based significant rotational changes, CD SLAM suggests the use
front-end as a base for a Delaunay triangulation to fit a mesh of a modified Histogram of Oriented Cameras descriptor
that is used to interpolate a dense reconstruction of the envi- (HOC) (Pirker, 2010) with a GPU accelerated descriptor
ronment. update and a probabilistic weighting scheme to handle out-
Pirker et al (2011) released CD SLAM in 2011 with liers. Furthermore, it suggests the use of large-scale nested
the objectives to handle short- and long-term environmen- loop closures with scale drift correction and provide a ge-
tal changes and to handle mixed indoor/outdoor environ- ometric adaptation to update the feature descriptors after
ments. To limit the map size and gain robustness against loop closure. Keyframes are organized in an undirected, un-
20 Georges Younes et al.

weighted pose graph. Re-localization is performed using a To cope with a very large number of submaps, a loopy belief
non-linear least squared minimization initialized, with the propagation algorithm cuts the main graph into subgraphs,
pose of the best matching candidate keyframe from the map before a non-linear optimization takes place.
found through FABMAP; whereas loop closure takes place
using pose graph optimization.
In 2013, RD SLAM (Tan et al, 2013) was released 5 Conclusions
with the aim to handle occlusions and slowly varying, dy-
namic scenes. RD SLAM employs a heavily parallelized During the course of this work, we have outlined the build-
GPU accelerated SIFT and stores them in a KD-Tree (Bent- ing blocks of a generic Visual SLAM system; including data
ley, 1975) that further accelerates feature matching based type, initialization, data association, pose estimation, map
on the nearest neighbor of the queried feature in the tree. generation, map maintenance, failure recovery and loop clo-
To cope with dynamic objects (moving) and slowly vary- sure. We also discussed the details of the latest open-source
ing scenes, RD SLAM suggests a prior-based adaptive state of the art systems in Visual SLAM including PTAM,
RANSAC scheme, that samples, based on the outlier ratio of SVO, DT SLAM, LSD SLAM, ORB SLAM and DPP-
features in previous frames, the features in the current frame TAM. Finally, we compiled and summarized what added in-
from which to estimate the camera pose along with a land- formation closed-source non-filter-based monocular Visual
mark and keyframe culling mechanism, using histograms of SLAM systems have to offer.
colors to detect and update changed image locations, while Although extensive research has been dedicated to this
sparing temporarily occluded landmarks. field, it is our opinion that each of the building blocks dis-
Pirchheim et al (2013) dealt with the problem of pure cussed above could benefit from improvements. Robust data
rotations in the camera motion by building local panorama association against illumination changes, dynamic scenes
maps, whenever the system explores a new scene with pure and occluded environments; an initialization method that
rotational motion. The system extracts phonySIFT descrip- can operate without an initial scene assumption nor a large
tors as described in Wagner et al (2010)and establish fea- number of processed frames; an accurate camera pose esti-
ture correspondences using an accelerated matching method mate that is not affected by sudden movements, blur, noise,
through hierarchical k-means. When insufficient 3D land- large depth variations nor moving objects; a map making
marks are observed during pose estimation, the system tran- module capable of generating an efficient dense scene rep-
sitions into a rotation-only estimate mode and starts building resentation in regions of little texture, a map maintenance
a panorama map until the camera observes part of the finite method that improves the map with resilience against dy-
map. namic, changing small and large scale environments; a fail-
In the work of Lim et al (2014) the sought after objective ure recovery procedure capable of reviving the system from
is to handle tracking, mapping, and loop closure, all using significantly large changes in camera view points, are all de-
the same binary feature, through a hybrid map representa- sired properties that unfortunately most state of the art sys-
tion (topological and metric). Whenever a loop is detected, tems lack and remain challenging topics in the field.
the map is converted to its metric form where a local Bundle We are currently working on creating a set of experi-
Adjustment take place before returning the map back to its ments, which cater to the requirements of each of the above
topological form. open-source systems–for initialization, camera frame rate,
In 2015, Bourmaud and Megret (2015) released an of- and depth homogeneity. Upon completion of these experi-
fline Visual SLAM system (requiring around two hours and ments, we will benchmark all the open-source Visual SLAM
a half for a dataset of 10000 images). The system employs systems to better identify the advantages and disadvantages
a divide- and-conquer strategy, by segmenting the map into of each system module. The long-term goal would then be
submaps. A similarity transform is estimated between each to leverage on all of the advantages to create a superior non-
submap and its ten nearest neighbors. A global similarity filter-based Visual SLAM.
transform, relating every submap to a single global reference
frame, is computed by a pose graph optimization, where Acknowledgements This work was funded by the ENPI (European
the reference frames are stored in a graph of submaps. The Neighborhood Partnership Instrument) grant # I-A/1.2/113 as well as
above procedure is susceptible to outliers in the loop detec- the Lebanese National Research Council (LNCSR).
tion module and hence the need for an efficient outlier han-
dling mechanism. For such purpose, and to prevent outliers,
temporally consecutive similarity measurements are always References
considered as inliers. The outlier rejection module proceeds
then by integrating the similarities over the shortest loop it Alahi A, Ortiz R, Vandergheynst P (2012) Freak: Fast retina
can find, and monitors the closure error to accept the loop. keypoint. In: Computer Vision and Pattern Recognition
A survey on non-filter-based monocular Visual SLAM systems 21

(CVPR), 2012 IEEE Conference on, pp 510–517, DOI IROS.2015.7354184

10.1109/CVPR.2012.6247715 Dalal N, Triggs B (2005) Histograms of oriented gradi-
Baker S, Matthews I (2004) Lucas-Kanade 20 Years On: A ents for human detection. In: Computer Vision and Pat-
Unifying Framework. International Journal of Computer tern Recognition, 2005. CVPR 2005. IEEE Computer So-
Vision 56(3):221–255 ciety Conference on, vol 1, pp 886–893 vol. 1, DOI
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) 10.1109/CVPR.2005.177
Speeded-up robust features (surf). Comput Vis Im- Davison A (2003) Real-time simultaneous localisation and
age Underst 110(3):346–359, DOI 10.1016/j.cviu.2007. mapping with a single camera. In: Ninth IEEE Interna-
09.014, URL http://dx.doi.org/10.1016/j.cviu. tional Conference on Computer Vision, pp 1403–1410
2007.09.014 vol.2
Beaudet PR (1978) Rotationally invariant image operators. Davison A, Cid YG, Kita N (2004) Real-time 3D SLAM
In: International Conference on Pattern Recognition with wide-angle vision. In: Proc. IFAC Symposium on In-
Benhimane S, Malis E (2007) Homography-based 2D Vi- telligent Autonomous Vehicles, Lisbon
sual Tracking and Servoing. International Journal of Davison AJ, Reid ID, Molton ND, Stasse O (2007)
Robotics Research 26(7):661–676 MonoSLAM: real-time single camera SLAM. Pattern
Bentley JL (1975) Multidimensional Binary Search Trees Analysis and Machine Intelligence (PAMI), IEEE Trans-
Used for Associative Searching. Communications ACM actions on 29(6):1052–67
18(9):509–517 Eade E, Drummond T (2006) Scalable monocular slam. In:
Bourmaud G, Megret R (2015) Robust large scale monocu- Computer Vision and Pattern Recognition, 2006 IEEE
lar visual slam. In: Computer Vision and Pattern Recogni- Computer Society Conference on, vol 1, pp 469–476,
tion (CVPR), 2015 IEEE Conference on, pp 1638–1647, DOI 10.1109/CVPR.2006.263
DOI 10.1109/CVPR.2015.7298772 Eade E, Drummond T (2007) Monocular slam as a graph
Calonder M, Lepetit V, Ozuysal M, Trzcinski T, Strecha of coalesced observations. In: Computer Vision, 2007.
C, Fua P (2012) Brief: Computing a local binary de- ICCV 2007. IEEE 11th International Conference on, pp
scriptor very fast. IEEE Transactions on Pattern Anal- 1–8, DOI 10.1109/ICCV.2007.4409098
ysis and Machine Intelligence 34(7):1281–1298, DOI Engel J, Sturm J, Cremers D (2013) Semi-dense Visual
10.1109/TPAMI.2011.222 Odometry for a Monocular Camera. In: Computer Vi-
Castle RO, Klein G, Murray DW (2008) Video-rate Local- sion (ICCV), IEEE International Conference on, IEEE,
ization in Multiple Maps for Wearable Augmented Real- pp 1449–1456
ity. In: Proc 12th {IEEE} Int Symp on Wearable Comput- Engel J, Schöps T, Cremers D (2014) LSD-SLAM: Large-
ers, pp 15–22 Scale Direct Monocular SLAM. In: Fleet D, Pajdla T,
Celik K, Somani AK (2013) Monocular vision slam for in- Schiele B, Tuytelaars T (eds) Computer Vision – ECCV
door aerial vehicles. Journal of Electrical and Computer 2014 SE - 54, Lecture Notes in Computer Science, vol
Engineering 2013:15, URL http://dx.doi.org/10. 8690, Springer International Publishing, pp 834–849
1155/2013/374165%]374165 Engel J, Stuckler J, Cremers D (2015) Large-scale direct
Civera J, Davison AJ, Montiel JM (2007) Dimen- slam with stereo cameras. In: Intelligent Robots and Sys-
sionless monocular slam. In: Proceedings of the tems (IROS), 2015 IEEE/RSJ International Conference
3rd Iberian Conference on Pattern Recognition on, pp 1935–1942, DOI 10.1109/IROS.2015.7353631
and Image Analysis, Part II, Springer-Verlag, Faugeras O, Lustman F (1988) Motion and structure from
Berlin, Heidelberg, IbPRIA ’07, pp 412–419, motion in a piecewise planar environment. International
DOI 10.1007/978-3-540-72849-8 52, URL http: Journal of Pattern Recognition and Artificial Intelligence
//dx.doi.org/10.1007/978-3-540-72849-8_52 02(03):485–508
Civera J, Davison A, Montiel J (2008) Inverse Depth Fernández-Moral E, Arévalo V, González-Jiménez J (2015)
Parametrization for Monocular SLAM. IEEE Transac- Hybrid Metric-topological Mapping for Large Scale
tions on Robotics 24(5):932–945 Monocular SLAM . Springer International Publishing, pp
Clemente L, Davison A, Reid I, Neira J, Tardós J 217–232
(2007) Mapping Large Loops with a Single Hand- Forster C, Pizzoli M, Scaramuzza D (2014) SVO : Fast
Held Camera. Atlanta, GA, USA, URL http://www. Semi-Direct Monocular Visual Odometry. In: Robotics
roboticsproceedings.org/rss03/p38.html and Automation (ICRA), IEEE International Conference
Concha A, Civera J (2015) Dpptam: Dense piecewise planar on
tracking and mapping from a monocular sequence. In: In- Fuentes-Pacheco J, Ruiz-Ascencio J, Rendón-Mancha JM
telligent Robots and Systems (IROS), 2015 IEEE/RSJ In- (2012) Visual simultaneous localization and mapping:
ternational Conference on, pp 5686–5693, DOI 10.1109/ a survey. Artificial Intelligence Review 43(1):55–81,
22 Georges Younes et al.

DOI 10.1007/s10462-012-9365-8, URL http://dx. Klein G, Murray D (2007) Parallel Tracking and Mapping
doi.org/10.1007/s10462-012-9365-8 for Small AR Workspaces. 6th IEEE and ACM Interna-
Galvez-López D, Tardos JD (2012) Bags of Binary Words tional Symposium on Mixed and Augmented Reality pp
for Fast Place Recognition in Image Sequences. Robotics, 1–10
IEEE Transactions on 28(5):1188–1197, DOI 10.1109/ Klein G, Murray D (2008) Improving the Agility of
TRO.2012.2197158 Keyframe-Based {SLAM}. In: Proc. 10th European Con-
Glover A, Maddern W, Warren M, Reid S, Milford M, ference on Computer Vision (ECCV), Marseille, pp 802–
Wyeth G (2012) OpenFABMAP: An open source tool- 815
box for appearance-based loop closure detection. In: 2012 Kneip L, Siegwart R, Pollefeys M (2012) Finding the
IEEE International Conference on Robotics and Automa- Exact Rotation between Two Images Independently
tion (ICRA), IEEE, pp 4730–4735 of the Translation, Springer Berlin Heidelberg, Berlin,
Grasa O, Bernal E, Casado S, Gil I, Montiel J (2014) Vi- Heidelberg, chap Finding th, pp 696–709. DOI
sual slam for handheld monocular endoscope. Medical 10.1007/978-3-642-33783-3{\ }50, URL http://dx.
Imaging, IEEE Transactions on 33(1):135–146, DOI doi.org/10.1007/978-3-642-33783-3{_}50
10.1109/TMI.2013.2282997 Konolige K (2010) Sparse sparse bundle adjustment. In:
Hall BC (2015) Lie Groups, Lie Algebras, and Representa- Proceedings of the British Machine Vision Conference,
tions, vol 222, number 102 edn. Springer- Verlag BMVA Press, pp 102.1–102.11, doi:10.5244/C.24.102
Harris C, Stephens M (1988) A combined corner and edge Kummerle R, Grisetti G, Strasdat H, Konolige K, Burgard
detector. In: In Proc. of Fourth Alvey Vision Conference, W (2011) G2o: A general framework for graph optimiza-
pp 147–151 tion. In: Robotics and Automation (ICRA), IEEE Interna-
Hartley R, Zisserman A (2003) Multiple View Geometry in tional Conference on, IEEE, pp 3607–3613
Computer Vision. Cambridge University Press Kwon J, Lee KM (2010) Monocular slam with locally pla-
Hartmann J, Klussendorff JH, Maehle E (2013) A compar- nar landmarks via geometric rao-blackwellized particle
ison of feature descriptors for visual SLAM. In: Mobile filtering on lie groups. In: Computer Vision and Pat-
Robots (ECMR), 2013 European Conference on, pp 56– tern Recognition (CVPR), 2010 IEEE Conference on, pp
61, DOI 10.1109/ECMR.2013.6698820 1522–1529, DOI 10.1109/CVPR.2010.5539789
Herrera D, Kannala J, Pulli K, Heikkila J (2014) DT-SLAM: Lee SH (2014) Real-time camera tracking using a parti-
Deferred Triangulation for Robust SLAM. In: 3D Vision, cle filter combined with unscented kalman filters. Jour-
2nd International Conference on, IEEE, vol 1, pp 609– nal of Electronic Imaging 23(1):013,029, DOI 10.1117/1.
616 JEI.23.1.013029, URL http://dx.doi.org/10.1117/
Hietanen A, Lankinen J, Kämäräinen JK, Buch AG, Krüger 1.JEI.23.1.013029
N (2016) A comparison of feature detectors and descrip- Lemaire T, Lacroix S (2007) Monocular-vision based slam
tors for object class matching. Neurocomputing using line segments. In: Robotics and Automation, 2007
Hochdorfer S, Schlegel C (2009) Towards a robust visual IEEE International Conference on, pp 2791–2796, DOI
slam approach: Addressing the challenge of life-long op- 10.1109/ROBOT.2007.363894
eration. In: Advanced Robotics, 2009. ICAR 2009. Inter- Lepetit V, Moreno-Noguer F, Fua P (2009) EPnP: An Accu-
national Conference on, pp 1–6 rate O(n) Solution to the PnP Problem. International Jour-
Holmes SA, Klein G, Murray DW (2008) A square root un- nal of Computer Vision 81(2):155–166
scented kalman filter for visual monoslam. IEEE Trans- Leutenegger S, Chli M, Siegwart RY (2011) Brisk: Bi-
actions on Pattern Analysis and Machine Intelligence nary robust invariant scalable keypoints. In: Computer Vi-
31(7):1251–1263, DOI http://doi.ieeecomputersociety. sion (ICCV), 2011 IEEE International Conference on, pp
org/10.1109/TPAMI.2008.189 2548–2555, DOI 10.1109/ICCV.2011.6126542
Horn BKP (1987) Closed-form solution of absolute orienta- Lim H, Lim J, Kim HJ (2014) Real-time 6-DOF monocular
tion using unit quaternions. Journal of the Optical Society visual SLAM in a large-scale environment. In: Robotics
of America A 4(4):629–642 and Automation (ICRA), IEEE International Conference
Jeong W, Lee KM (2005) Cv-slam: a new ceiling vision- on, pp 1532–1539
based slam technique. In: Intelligent Robots and Sys- Lim J, Frahm JM, Pollefeys M (2011) Online environment
tems, 2005. (IROS 2005). 2005 IEEE/RSJ International mapping. In: Computer Vision and Pattern Recognition
Conference on, pp 3195–3200, DOI 10.1109/IROS.2005. (CVPR), 2011 IEEE Conference on, pp 3489–3496, DOI
1545443 10.1109/CVPR.2011.5995511
Jérôme Martin JLC (1995) Experimental Comparison of Lindeberg T (1998) Feature detection with automatic scale
Correlation Techniques. In: IAS-4, International Confer- selection. Int J Comput Vision 30(2):79–116, DOI 10.
ence on Intelligent Autonomous Systems 1023/A:1008045108935, URL http://dx.doi.org/
A survey on non-filter-based monocular Visual SLAM systems 23

10.1023/A:1008045108935 Mur-Artal R, Tardós JD (2014) Fast relocalisation and loop

Lowe D (1999) Object recognition from local scale- closing in keyframe-based slam. In: Robotics and Au-
invariant features. In: International Conference on Com- tomation (ICRA), 2014 IEEE International Conference
puter Vision (ICCV), the seventh IEEE, IEEE, vol 2, pp on, pp 846–853, DOI 10.1109/ICRA.2014.6906953
1150–1157 vol.2 Mur-Artal R, Montiel JMM, Tardos JD (2015) ORB-SLAM:
Lowe DG (2004) Distinctive image features from scale- A Versatile and Accurate Monocular SLAM System.
invariant keypoints. Int J Comput Vision 60(2):91– IEEE Transactions on Robotics PP(99):1–17
110, DOI 10.1023/B:VISI.0000029664.99615.94, Newcombe RA, Davison AJ (2010) Live dense reconstruc-
URL http://dx.doi.org/10.1023/B:VISI. tion with a single moving camera. In: Computer Vision
0000029664.99615.94 and Pattern Recognition (CVPR), 2010 IEEE Conference
Lucas BD, Kanade T (1981) An Iterative Image Registra- on, IEEE, pp 1498–1505
tion Technique with an Application to Stereo Vision. In: Newcombe RA, Lovegrove SJ, Davison AJ (2011) Dtam:
International Joint Conference on Artificial Intelligence - Dense tracking and mapping in real-time. In: Proceed-
Volume 2, Morgan Kaufmann Publishers Inc., San Fran- ings of the 2011 International Conference on Computer
cisco, CA, USA, IJCAI’81, pp 674–679 Vision, IEEE Computer Society, Washington, DC, USA,
Mahon I, Williams SB, Pizarro O, Johnson-Roberson M ICCV ’11, pp 2320–2327, DOI 10.1109/ICCV.2011.
(2008) Efficient view-based slam using visual loop clo- 6126513, URL http://dx.doi.org/10.1109/ICCV.
sures. IEEE Transactions on Robotics 24(5):1002–1014, 2011.6126513
DOI 10.1109/TRO.2008.2004888 Nistér D (2004) An efficient solution to the five-point rela-
Mair E, Hager GD, Burschka D, Suppa M, Hirzinger G tive pose problem. Pattern Analysis and Machine Intelli-
(2010) Adaptive and Generic Corner Detection Based gence (PAMI), IEEE Transactions on 26(6):756–77
on the Accelerated Segment Test. In: Proceedings of the Pinies P, Tardos J (2008) Large-scale slam building condi-
European Conference on Computer Vision (ECCV’10), tionally independent local maps: Application to monoc-
DOI 10.1007/978-3-642-15552-9{\ }14 ular vision. Robotics, IEEE Transactions on 24(5):1094–
Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide 1106, DOI 10.1109/TRO.2008.2004636
baseline stereo from maximally stable extremal regions. Pirchheim C, Reitmayr G (2011) Homography-based planar
In: Proc. BMVC, pp 36.1–36.10, doi:10.5244/C.16.36 mapping and tracking for mobile phones. In: Mixed and
Meltzer J, Gupta R, Yang MH, Soatto S (2004) Simul- Augmented Reality (ISMAR), 2011 10th IEEE Interna-
taneous localization and mapping using multiple view tional Symposium on, pp 27–36, DOI 10.1109/ISMAR.
feature descriptors. In: Intelligent Robots and Systems, 2011.6092367
2004. (IROS 2004). Proceedings. 2004 IEEE/RSJ Inter- Pirchheim C, Schmalstieg D, Reitmayr G (2013) Han-
national Conference on, vol 2, pp 1550–1555 vol.2, DOI dling pure camera rotation in keyframe-based slam. In:
10.1109/IROS.2004.1389616 Mixed and Augmented Reality (ISMAR), 2013 IEEE In-
Migliore D, Rigamonti R, Marzorati D, Matteucci M, Sor- ternational Symposium on, pp 229–238, DOI 10.1109/
renti DG (2009) Use a single camera for simultaneous lo- ISMAR.2013.6671783
calization and mapping with mobile object tracking in dy- Pirker K (2010) Histogram of oriented cameras - a new
namic environments descriptor for visual slam in dynamic environments. In:
Moranna Ra, Martin RD, Yohai VJ (2006) Robust Statistics. Proceedings of the British Machine Vision Conference,
Wiley BMVA Press, pp 76.1–76.12, doi:10.5244/C.24.76
Moreels P, Perona P (2007) Evaluation of features detectors Pirker K, Rüther M, Bischof H (2011) CD SLAM - Con-
and descriptors based on 3d objects. Int J Comput Vision tinuous localization and mapping in a dynamic world. In:
73(3):263–284, DOI 10.1007/s11263-006-9967-1, URL IEEE International Conference on Intelligent Robots Sys-
http://dx.doi.org/10.1007/s11263-006-9967-1 tems (IROS), IEEE, pp 3990–3997
Mouragnon E, Lhuillier M, Dhome M, Dekeyser F, Sayd Pretto A, Menegatti E, Pagello E (2011) Omnidirectional
P (2006) Real time localization and 3d reconstruction. dense large-scale mapping and navigation based on
In: Computer Vision and Pattern Recognition, 2006 IEEE meaningful triangulation. In: Robotics and Automation
Computer Society Conference on, vol 1, pp 363–370, (ICRA), 2011 IEEE International Conference on, pp
DOI 10.1109/CVPR.2006.236 3289–3296, DOI 10.1109/ICRA.2011.5980206
Muja M, Lowe DG (2009) Fast approximate nearest neigh- Pupilli M, Calway A (2005) Real-time camera tracking us-
bors with automatic algorithm configuration. In: In VIS- ing a particle filter. In: In Proc. British Machine Vision
APP International Conference on Computer Vision The- Conference, pp 519–528
ory and Applications, pp 331–340 Rey-Otero I, Delbracio M, Morel J (2014) Comparing fea-
ture detectors: A bias in the repeatability criteria, and
24 Georges Younes et al.

how to correct it. CoRR abs/1409.2465, URL http:// Weiss S, Achtelik MW, Lynen S, Achtelik MC, Kneip
arxiv.org/abs/1409.2465 L, Chli M, Siegwart R (2013) Monocular vision for
Rosten E, Drummond T (2006) Machine Learning for High- long-term micro aerial vehicle state estimation: A com-
speed Corner Detection. In: 9th European Conference on pendium. Journal of Field Robotics 30(5):803–831,
Computer Vision - Volume Part I, Proceedings of the, DOI 10.1002/rob.21466, URL http://dx.doi.org/
Springer-Verlag, Berlin, Heidelberg, ECCV’06, pp 430– 10.1002/rob.21466
443 Williams B, Reid I (2010) On combining visual slam and
Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: visual odometry. In: Proc. International Conference on
An efficient alternative to SIFT or SURF. In: International Robotics and Automation
Conference on Computer Vision (ICCV), pp 2564–2571 Yousif K, Bab-Hadiashar A, Hoseinnezhad R (2015) An
Scaramuzza D, Fraundorfer F (2011) Visual odometry [tuto- overview to visual odometry and visual slam: Applica-
rial]. IEEE Robotics Automation Magazine 18(4):80–92, tions to mobile robotics. Intelligent Industrial Systems
DOI 10.1109/MRA.2011.943233 1(4):289–311
Shi J, Tomasi C (1994) Good features to track. In: Com- Zhou H, Zou D, Pei L, Ying R, Liu P, Yu W (2015) Struct-
puter Vision and Pattern Recognition, 1994. Proceedings slam: Visual slam with building structure lines. Vehicu-
CVPR ’94., 1994 IEEE Computer Society Conference on, lar Technology, IEEE Transactions on 64(4):1364–1375,
pp 593–600 DOI 10.1109/TVT.2015.2388780
Silveira G, Malis E, Rives P (2008) An efficient direct ap-
proach to visual slam. Robotics, IEEE Transactions on
24(5):969–979, DOI 10.1109/TRO.2008.2004829
Smith P, Reid I, Davison A (2006) Real-time monocular
slam with straight lines. pp 17–26, URL http://hdl.
handle.net/10044/1/5648
Strasdat H, Montiel J, Davison A (2010a) Scale drift-aware
large scale monocular slam. The MIT Press, URL http:
//www.roboticsproceedings.org/rss06/
Strasdat H, Montiel JMM, Davison AJ (2010b) Real-time
monocular SLAM: Why filter? In: Robotics and Automa-
tion (ICRA), IEEE International Conference on, pp 2657–
2664
Strasdat H, Davison AJ, Montiel JMM, Konolige K (2011)
Double Window Optimisation for Constant Time Visual
SLAM. In: International Conference on Computer Vision,
Proceedings of the, IEEE Computer Society, Washington,
DC, USA, ICCV ’11, pp 2352–2359
Tan W, Liu H, Dong Z, Zhang G, Bao H (2013) Robust
monocular SLAM in dynamic environments. 2013 IEEE
International Symposium on Mixed and Augmented Re-
ality (ISMAR) pp 209–218
Tomasi C, Kanade T (1991) Detection and Tracking of Point
Features. Tech. rep., International Journal of Computer
Vision
Torr P, Fitzgibbon A, Zisserman A (1999) The Problem
of Degeneracy in Structure and Motion Recovery from
Uncalibrated Image Sequences. International Journal of
Computer Vision 32(1):27–44
Torr PHS, Zisserman A (2000) MLESAC. Computer Vision
and Image Understanding 78(1):138–156
Wagner D, Reitmayr G, Mulloni A, Drummond T, Schmal-
stieg D (2010) Real-time detection and tracking for aug-
mented reality on mobile phones. Visualization and Com-
puter Graphics, IEEE Transactions on 16(3):355–368,
DOI 10.1109/TVCG.2009.99

View publication stats

Kinetek Controlador Fusion K mc1000 Ingles
No ratings yet
Kinetek Controlador Fusion K mc1000 Ingles
149 pages
Recent Advances in Simultaneous Localiza
No ratings yet
Recent Advances in Simultaneous Localiza
34 pages
How Nerfs and 3D Gaussian Splatting Are Reshaping Slam: A Survey
No ratings yet
How Nerfs and 3D Gaussian Splatting Are Reshaping Slam: A Survey
31 pages
Visual Simultaneous Localization and Mapping - A Survey
No ratings yet
Visual Simultaneous Localization and Mapping - A Survey
33 pages
electronics-12-02006
No ratings yet
electronics-12-02006
29 pages
State of The Art in Vision-Based Localization Tech
No ratings yet
State of The Art in Vision-Based Localization Tech
30 pages
typeinst
No ratings yet
typeinst
29 pages
saputra2018
No ratings yet
saputra2018
36 pages
Adaptive Monocular Visual-Inertial SLAM For Real-T
No ratings yet
Adaptive Monocular Visual-Inertial SLAM For Real-T
25 pages
Deep Learning Techniques For Visual SLAM A Survey
No ratings yet
Deep Learning Techniques For Visual SLAM A Survey
25 pages
Multiperspective 3 DPanoramas
No ratings yet
Multiperspective 3 DPanoramas
22 pages
Robotics 11 00024 With Cover
No ratings yet
Robotics 11 00024 With Cover
28 pages
Vdo Slam
No ratings yet
Vdo Slam
15 pages
SVM Active Learning Approach For Image Classification Using Spatial Information
No ratings yet
SVM Active Learning Approach For Image Classification Using Spatial Information
19 pages
OMNIVIS09
No ratings yet
OMNIVIS09
9 pages
Dynamic SLAM A Visual SLAM in Outdoor Dynamic Scen
No ratings yet
Dynamic SLAM A Visual SLAM in Outdoor Dynamic Scen
15 pages
Gil 2010
No ratings yet
Gil 2010
13 pages
2012-Dense Reconstruction On The Fly
No ratings yet
2012-Dense Reconstruction On The Fly
9 pages
Structure and Motion From Image Sequences
No ratings yet
Structure and Motion From Image Sequences
9 pages
Articulo
No ratings yet
Articulo
10 pages
Article Soumis V1 PDF
No ratings yet
Article Soumis V1 PDF
14 pages
Teaching Architectural History Through Virtual Reality
No ratings yet
Teaching Architectural History Through Virtual Reality
7 pages
Mary Project
No ratings yet
Mary Project
7 pages
OpenSeqSLAM (Matlab)
No ratings yet
OpenSeqSLAM (Matlab)
7 pages
Autonomous Design Report ASURT FS AI 2021 v4
No ratings yet
Autonomous Design Report ASURT FS AI 2021 v4
6 pages
A Review of Monocular Visual Odometry: The Visual Computer June 2019
No ratings yet
A Review of Monocular Visual Odometry: The Visual Computer June 2019
16 pages
A survey of state-of-the-art on visual SLAM - 1-s2.0-S0957417422010156-main
No ratings yet
A survey of state-of-the-art on visual SLAM - 1-s2.0-S0957417422010156-main
17 pages
Coslam: Collaborative Visual Slam in Dynamic Environments: Danping Zou and Ping Tan, Member, Ieee
No ratings yet
Coslam: Collaborative Visual Slam in Dynamic Environments: Danping Zou and Ping Tan, Member, Ieee
13 pages
hcis21-030
No ratings yet
hcis21-030
11 pages
117 FullManuscript 495 2 10 20220402
No ratings yet
117 FullManuscript 495 2 10 20220402
13 pages
Visual SLAM algorithms and their application for AR, mapping, localization
No ratings yet
Visual SLAM algorithms and their application for AR, mapping, localization
14 pages
Active Monocular LocalizationTowards Autonomous Monocular Exploration For Multirotor MAVs
No ratings yet
Active Monocular LocalizationTowards Autonomous Monocular Exploration For Multirotor MAVs
8 pages
A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments
No ratings yet
A Robust Deep Learning Enhanced Monocular SLAM System For Dynamic Environments
8 pages
PID42646391
No ratings yet
PID42646391
6 pages
Application of machine vision image feature
No ratings yet
Application of machine vision image feature
9 pages
sensors-23-07921-v2
No ratings yet
sensors-23-07921-v2
17 pages
A_flexible_and_scalable_SLAM_system_with_full_3D_m
No ratings yet
A_flexible_and_scalable_SLAM_system_with_full_3D_m
7 pages
Velodyne_SLAM
No ratings yet
Velodyne_SLAM
7 pages
Simultaneous Localization & Mapping of Textured Surfaces
No ratings yet
Simultaneous Localization & Mapping of Textured Surfaces
1 page
Lesson 1.3 Computer Hardware and Software
No ratings yet
Lesson 1.3 Computer Hardware and Software
37 pages
Sensors: A Robust Approach For A Filter-Based Monocular Simultaneous Localization and Mapping (SLAM) System
No ratings yet
Sensors: A Robust Approach For A Filter-Based Monocular Simultaneous Localization and Mapping (SLAM) System
22 pages
The SLAM Problem a Survey
No ratings yet
The SLAM Problem a Survey
10 pages
Automatic Driving Technology Based on SLAM Technol
No ratings yet
Automatic Driving Technology Based on SLAM Technol
6 pages
IEEE-Review-VSLAM
No ratings yet
IEEE-Review-VSLAM
1 page
Map Comparison Lidar v9
No ratings yet
Map Comparison Lidar v9
6 pages
Neuromorphic Visual Odometry System For Intelligent Vehicle Application With Bio-Inspired Vision Sensor
No ratings yet
Neuromorphic Visual Odometry System For Intelligent Vehicle Application With Bio-Inspired Vision Sensor
8 pages
Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual
No ratings yet
Sensors-24-01274 From Pixels To Precision A Survey of Monocular Visual
17 pages
e3sconf_wfces2024_05004
No ratings yet
e3sconf_wfces2024_05004
10 pages
V-SLAM: Vision-Based Simultaneous Localization and Map Building For An Autonomous Mobile Robot
No ratings yet
V-SLAM: Vision-Based Simultaneous Localization and Map Building For An Autonomous Mobile Robot
6 pages
Applsci 09 02105 PDF
No ratings yet
Applsci 09 02105 PDF
17 pages
Detection of Floor From Image Captured by Camera by Izaak Van Crombrugge, Luc Mertens, Rudi Penne1
No ratings yet
Detection of Floor From Image Captured by Camera by Izaak Van Crombrugge, Luc Mertens, Rudi Penne1
9 pages
Uw Cse 11 02 02 PDF
No ratings yet
Uw Cse 11 02 02 PDF
8 pages
Vision-Based Mobile Robot Localization and Mapping Using Scale-Invariant Features
No ratings yet
Vision-Based Mobile Robot Localization and Mapping Using Scale-Invariant Features
8 pages
MonoSLAM Real-Time Single Camera SLAM
No ratings yet
MonoSLAM Real-Time Single Camera SLAM
16 pages
Camera-based simultaneous localization and mapping: methods, camera types, and deep learning trends
No ratings yet
Camera-based simultaneous localization and mapping: methods, camera types, and deep learning trends
11 pages
Enabling Robust Visual Navigation: PHD Research Proposal
No ratings yet
Enabling Robust Visual Navigation: PHD Research Proposal
14 pages
GRED HD User Manual
No ratings yet
GRED HD User Manual
107 pages
Visual Slam
No ratings yet
Visual Slam
16 pages
Real-Time 6-DOF Monocular Visual SLAM in A Large-Scale Environment
No ratings yet
Real-Time 6-DOF Monocular Visual SLAM in A Large-Scale Environment
8 pages
Camera Motion PDF
No ratings yet
Camera Motion PDF
13 pages
Building Mapping 3
No ratings yet
Building Mapping 3
9 pages
VIJAY BRAOU-UG Semester-I Assignments
No ratings yet
VIJAY BRAOU-UG Semester-I Assignments
29 pages
Mans io-AbtAejs9
No ratings yet
Mans io-AbtAejs9
91 pages
Afs Andrew File System
No ratings yet
Afs Andrew File System
28 pages
ISIS Manual
No ratings yet
ISIS Manual
178 pages
DX Diag
No ratings yet
DX Diag
34 pages
Ict Syllabus For Primary School Students
No ratings yet
Ict Syllabus For Primary School Students
12 pages
DWIN COF Screen User Guide
No ratings yet
DWIN COF Screen User Guide
19 pages
MARIE: An Introduction To A Simple Computer
No ratings yet
MARIE: An Introduction To A Simple Computer
68 pages
9th Grade Digital Document
No ratings yet
9th Grade Digital Document
3 pages
Manual Coriolis Interface Module For Roc800 Series Floboss 107 Flow Managers Micro Motion en 132276 PDF
No ratings yet
Manual Coriolis Interface Module For Roc800 Series Floboss 107 Flow Managers Micro Motion en 132276 PDF
82 pages
2d Arcade Game
No ratings yet
2d Arcade Game
20 pages
Fujitsu Siemens - Primergy - tx1320 m2 d3373
No ratings yet
Fujitsu Siemens - Primergy - tx1320 m2 d3373
8 pages
D2230311
No ratings yet
D2230311
10 pages
1Z0-820-Demo
No ratings yet
1Z0-820-Demo
6 pages
Computer Fundamentals 15 Marks
No ratings yet
Computer Fundamentals 15 Marks
7 pages
TVL-ICT (Computer System Servicing) Activity Sheet Quarter 2 - Lesson 3
No ratings yet
TVL-ICT (Computer System Servicing) Activity Sheet Quarter 2 - Lesson 3
24 pages
Quiz Mid Test
No ratings yet
Quiz Mid Test
3 pages
Led Blink With Switch: 1152EC255-Embedded C Laboratory
No ratings yet
Led Blink With Switch: 1152EC255-Embedded C Laboratory
7 pages
Achieving Secure, Scalable, and Fine-Grained Data Access Control in Cloud Computing
No ratings yet
Achieving Secure, Scalable, and Fine-Grained Data Access Control in Cloud Computing
49 pages
Overview 30 60 90 Days Action Plan Powerpoint Slide Themes Powerpoint Templates
No ratings yet
Overview 30 60 90 Days Action Plan Powerpoint Slide Themes Powerpoint Templates
7 pages
Fariz Rachman Yusuf
No ratings yet
Fariz Rachman Yusuf
1 page
CV - Chelsea Chen Xi
No ratings yet
CV - Chelsea Chen Xi
4 pages
Satellite C645 Detailed Product Specification: Genuine
No ratings yet
Satellite C645 Detailed Product Specification: Genuine
3 pages
Taurus Series Multimedia Player TB3 Specifications V1.6.5
No ratings yet
Taurus Series Multimedia Player TB3 Specifications V1.6.5
9 pages
Panasonic VRF User Guide
No ratings yet
Panasonic VRF User Guide
13 pages
P.A.Hilton LTD Product Data Sheet: HSM34 - Creep Testing Machine
No ratings yet
P.A.Hilton LTD Product Data Sheet: HSM34 - Creep Testing Machine
3 pages
Comp8 - Quarter 4 Module 2-3
No ratings yet
Comp8 - Quarter 4 Module 2-3
4 pages
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
From Everand
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
Fouad Sabry
No ratings yet
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet

A Survey On Non-Filter-Based Monocular Visual SLAM

Uploaded by

A Survey On Non-Filter-Based Monocular Visual SLAM

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A survey on non-ﬁlter-based monocular Visual SLAM systems

Article · July 2016

Georges Younes Daniel Asmar

SEE PROFILE SEE PROFILE

Measurement of Soil Water Content View project

motion planning View project

The user has requested enhancement of the downloaded file.

A survey on non-filter-based monocular Visual SLAM systems

Received: date / Accepted: date

Frame Poses Frame Poses

Map Landmarks Map Landmarks

(a) Filter based systems (b) Non-filter based systems

Year Name Method Type Reference

3.2 Data association

Table 3 Feature extractors and descriptors. Abbreviations:local patch of pixels (L.P.P.)

PTAM SVO DT SLAM LSD SLAM ORB SLAM DPPTAM

SVO. SVO generates a five level pyramid representation

Process Keyframe: For every Extract and Is 2nd

Fig. 6 Generic initialization pipeline.

ing the median of the baseline between features is not a good

Previous Data association

Fig. 8 Generic pose estimation pipeline.

3.5 Map generation

The map generation module represents the world as a dense

Data association Triangulate new

Fig. 10 Generic map generation pipeline.

Establish data association Outlier

Fig. 13 Generic map maintenance pipeline.

keyframe in the co-visibility graph and all other keyframes

Table 10 Closed source Visual SLAM systems

– Real-time direct image alignment

– Introduced similarity transforms to visual SLAM

– Modeled the environment as a 3D piecewise smooth surface.

– Handle short and long term environmental changes

– Handle occlusions and slowly varying scenes and illumination changes

(CVPR), 2012 IEEE Conference on, pp 510–517, DOI IROS.2015.7354184

10.1023/A:1008045108935 Mur-Artal R, Tardós JD (2014) Fast relocalisation and loop

View publication stats

You might also like