Visual Simultaneous Localization and Mapping - A Survey
Visual Simultaneous Localization and Mapping - A Survey
net/publication/234081012
CITATIONS READS
805 25,610
3 authors:
J. M. Rendon-Mancha
Universidad Autónoma del Estado de Morelos
30 PUBLICATIONS 979 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jose Ruiz Ascencio on 17 August 2019.
Ruiz-Ascencio José.
Centro Nacional de Investigación y Desarrollo Tecnológico, Morelos, México.
[email protected]
Abstract: Visual SLAM (Simultaneous Localization and Mapping) refers to the problem of using
images, as the only source of external information, in order to establish the position of a robot, a
vehicle, or a moving camera in an environment, and at the same time, construct a representation
of the explored zone. SLAM is an essential task for the autonomy of a robot. Nowadays, the
problem of SLAM is considered solved when range sensors such as lasers or sonar are used to
built 2D maps of small static environments. However SLAM for dynamic, complex and large scale
environments, using vision as the sole external sensor, is an active area of research. The
computer vision techniques employed in visual SLAM, such as detection, description and
matching of salient features, image recognition and retrieval, among others, are still susceptible
of improvement. The objective of this article is to provide new researchers in the field of visual
SLAM a brief and comprehensible review of the state-of-the-art.
Keywords: visual SLAM, salient feature selection, image matching, data association,
topological and metric maps.
1. Introduction
The problem of autonomous navigation of mobile robots is divided into three
main areas: localization, mapping and path planning (Cyrill 2009). Localization
consists in determining in an exact manner the current pose of the robot in an
environment (Where am I?). Mapping integrates the partial observations of the
surroundings into a single consistent model (What does the world look like?) and
path planning determines the best route in the map to navigate through the
environment (How can I reach a given location?).
1
in an environment, a correct map is necessary, but in order to construct a good
map it is necessary to be properly localized when elements are added to the map.
Currently, this problem is known as Simultaneous Localization and Mapping
(SLAM). When cameras are employed as the only exteroceptive sensor, it is
called visual SLAM. The terms vision-based SLAM (Se et al. 2005; Lemaire et al.
2007 ) or vSLAM (Solà 2007) are also used. In this article the term visual SLAM
is used because it is the best known. Visual SLAM systems can be complemented
with information from proprioceptive sensors, with the aim of increasing accuracy
and robustness. This approach is known as visual-inertial SLAM (Jones and Soatto
2011). However, when vision is used as the only system of perception (without
making use of information extracted from the robot odometry or inertial sensors)
it can be called vision-only SLAM (Paz et al. 2008; Davison et al. 2007) or
camera-only SLAM (Milford and Wyeth 2008).
Many visual SLAM systems fail when work under the following conditions: in
external environments, in dynamic environments, in environments with too many
or very few salient features, in large scale environments, during erratic
movements of the camera and when partial or total occlusions of the sensor occur.
A key to a successful visual SLAM system is the ability to operate correctly
despite these difficulties.
In this article a detailed study of visual SLAM is presented, as well as the most
recent contributions and diverse current problems. Previously, Durrant and Bailey
presented a tutorial divided into two parts that summarizes the SLAM problem
(Durrant and Bailey 2006; Bailey and Durrant 2006). The latter tutorial describes
works that are centered on the use of laser range-finder sensors, building 2D maps
under a probabilistic approach. Similarly, Thrun and Leonard (2008) presented an
introduction to the SLAM problem, analyzed three paradigms of solution (the first
is based on the Extended Kalman Filter, and the other two use optimization
techniques based on graphs and particle filters) and proposed a taxonomy of the
problem. Nevertheless, the above-mentioned articles are not focused on methods
using vision as the only external sensor. On the other hand, Kragic and Vincze
(2009) present a review of computer vision for robotics in a general context,
considering the visual SLAM problem but not in detail as it is intended in this
article.
This article is structured in the following way: Section 2 describes the SLAM
problem in general. In section 3, the use of cameras as the only external sensor is
discussed and the weak points of such systems are mentioned. Section 4 describes
the type of salient features that can be extracted and the descriptors used to
achieve invariance to various transformations that the images may suffer. Section
5 deals with image matching and the data association problem. Section 6 gives a
2
detailed review of the different methods to solve the visual SLAM problem and
weaknesses and strengths of each one are discussed. The different ways of
representing the observed world are described on section 7. Section 8 provides
conclusions and potential problems for further investigations. Finally, section 9
presents bibliographic references.
In order to build a map from the environment, the entity must possess sensors that
allow it to perceive and obtain measurements of the elements from the
surrounding world. These sensors are classified into exteroceptive and
proprioceptive. Among the exteroceptive sensors it is possible to find: sonar
(Tardós et al. 2002; Ribas et al. 2008), range lasers (Nüchter et al. 2007; Thrun et
al. 2006), cameras (Se et al. 2005; Lemaire et al. 2007; Davison 2003; Bogdan et
al. 2009) and global positioning systems (GPS) (Thrun et al. 2005a). All of these
sensors are noisy and have limited range capabilities. In addition, only local views
of the environment can be obtained using the first three aforementioned sensors.
Laser sensors and sonar allow precise and very dense information of the
environment structure. Nevertheless, they have the following problems: not useful
in highly cluttered environments or for recognizing objects; both are expensive,
heavy and consist of large pieces of equipment, making their use difficult for
airborne robots or humanoids. On the other hand, a GPS sensor does not work
well in narrow streets (urban canyons), under water, on other planets, and
occasionally is not available indoors.
Works also exist that make use of multi-camera rigs with or without overlapping
between the views (Kaess and Dellaert 2010; Carrera et al. 2011) and cameras
with special lens such as wide-angle (Davison et al. 2004) or omnidirectional
(Scaramuzza and Siegwart 2008) with the goal of increasing visual range and thus
decrease, to some extent, the cumulative error of pose estimation. Recently, RGB-
D (color images and depth maps) sensors have been used to map indoor
environments (Huang et al. 2011), proving to be a promising alternative for
SLAM applications.
4
If the camera calibration is performed off-line, then it is assumed that the intrinsic
properties of the camera will not change during the entire period of operation of
the SLAM system. This is the most popular option, since it reduces the number of
parameters calculated online. Nevertheless, the intrinsic camera information may
change due to some environmental factors of the environment, such as humidity
or temperature. Furthermore, a robot that works in real world conditions can be hit
or damaged, which could invalidate the previously acquired calibration (Koch et
al. 2010).
The idea of utilizing one camera has become popular since the emergence of
single camera SLAM or MonoSLAM (Davison 2003). This is probably also
because it is now easier to access a single camera than a stereo pair, through cell
phones, personal digital assistants or personal computers. This monocular
approach offers a very simple, flexible and economic solution in terms of
hardware and processing times.
Even though many contributions have been made to visual SLAM, there are still
many problems. The solutions proposed for the visual SLAM problem are
reviewed in section 6. Many visual SLAM systems suffer from large accumulated
errors while the environment is being explored (or fail completely in visually
complex environments), which leads to inconsistent estimates of robot position
and totally incongruous maps. Three primary reasons exist:
1) First, generally it is assumed that camera movement is smooth and that there
will be consistency in the appearance of salient features (Davison 2003; Nistér et
al. 2004), but in general this is not true. The above assumptions are highly related
to the selection of the salient feature detector and of the matching technique used.
5
This originates an inaccuracy in camera position when capturing images with little
texture or that are blurred due to rapid movements of the sensor (e.g. due to
vibration or quick direction changes) (Pupilli and Calway 2006). These
phenomena are typical when the camera is carried by a person, humanoid robots,
and quad-rotor helicopters, among others. One way of alleviating this problem to
some extent is by the use of keyframes (see Appendix I) (Mouragnon et al. 2006;
Klein and Murray 2008). Alternatively, Pretto et al. (2007) and Mei et al.(2008)
analyze the problem of visual tracking in real time over blurred image sequences
due to an out-of-focus camera.
2) Second, most of researchers assume that the environments to explore are static
and that they only contain stationary and rigid elements; the majority of the
environments contain people and objects in motion. If this is not considered, the
moving elements will originate false matches and consequently will generate
unpredictable errors in all the system. The first approaches to this problem are
proposed by Wang et al.(2007), Wangsiripitak and Murray (2009), Migliore et al.
(2009), as well as Lin and Wang (2010).
3) Third, the world is visually repetitive. There are many similar textures, such as
the repeated architectural elements, foliage and walls of brick or stone. Also some
objects such as traffic signals appear repeatedly within an urban outdoor
environment. This makes it difficult to recognize a previously explored area and
also to do SLAM on large extensions of land.
The salient features that are easiest to locate, are those produced by artificial
landmarks (Frintrop and Jensfelt 2008). These landmarks are added intentionally
to the environment with the purpose of serving as an aid for navigation, e.g. red
squares or circles situated on the floor or walls. These landmarks have the
advantage that their appearance is known in advance, making them easy to detect
at any time. However, the environment has to be prepared by a person before the
system is initialized. Natural landmarks are those that exist habitually in the
environment (Se et al. 2002). For indoor environments it is common to use as
natural landmarks the corners of doors or windows. In outdoor environments, tree
trunks (Asmar 2006), regions (Matas et al. 2002), or interest points (Lowe 2004)
are used. An interest point is an image pixel with such a neighborhood that is easy
to distinguish from other points using a given detector.
One good-quality feature has the following properties: it must be notable (easy to
extract), precise (it may be measured with precision) and invariant to rotation,
translation, scale and illumination changes (Lemaire et al. 2007). Therefore, a
6
good-quality landmark has a similar appearance from different viewpoints in 3D
space. The salient feature extraction process is composed of two phases: detection
and description. The detection consists in processing the image to obtain a
number of salient elements. The description consists in building a feature vector
based on visual appearance in the image. The invariance of the descriptor to
changes in position and orientation will permit to improve the image matching
and data association processes (described in section 5).
4.1. Detectors
There is a large number of salient feature detectors. Some examples are: Harris
corners detector (Harris and Stephens 1988), Harris-Laplace and Hessian-Laplace
points detectors, as well as their respective affine invariants versions Harris-
Affine and Hessian-Affine (Mikolajcczyk and Schmid 2002); Difference of
Gaussians (DoG) used on SIFT (Scale Invariant Feature Transform) (Lowe 2004);
Maximally Stable Extremal Regions (MSERs) (Matas 2002), FAST (Features
from Accelerated Segment Test) (Rosten and Drummond 2006) and the Fast-
Hessian used on SURF (Speeded Up Robust Features) (Bay et al. 2006).
Mikolajczyk et al.(2005) made an evaluation of the performance of these
algorithms with respect to viewpoint, zoom, rotation, out-of-focus, JPEG
compression and lighting changes. The Hessian-Affine and MSER detectors had
the best performance, MSER was the most robust with respect to viewpoint and
lighting changes, and the Hessian-Affine was the best in presence of out-of-focus
features and JPEG compression. In (Tuytelaars and Mikolajczyk 2008) these
detectors and some others are classified, taking into consideration their
repeatability, precision, robustness, efficiency and invariant characteristics.
The majority of visual SLAM systems use corners as landmarks due to their
invariant features and their wide study in the computer vision context. However,
Eade and Drummond (2006a) propose to use edge segments called edgelets in a
real-time MonoSLAM system, allowing the construction of maps with high levels
of geometrical information. The authors demonstrated that edges are good
features for tracking and SLAM, due to their invariance to lighting, orientation
and scale changes. The use of edges as features looks promising, since edges are
little affected by blurring caused by the sudden movements of the camera (Klein
and Murray 2008). Nonetheless, edges have the limitation of not being easy to
extract and match. On the other hand, Gee et al. (2008) and Martínez et al.(2010)
investigate the fusion of features (i.e. points, lines and planar structures) in a
single map, with the purpose of increasing the precision of SLAM systems and
creating a better representation of the environment.
4.2. Descriptors
One of the most commonly used descriptors for object recognition is the
histogram-type SIFT descriptor, proposed by Lowe (2004), which is based on the
7
spatial distribution of local features in the neighborhood of the salient point,
obtaining a vector of 128 components. Ke and Sukthankar (2004) propose a
modification to SIFT called PCA-SIFT, whose main idea is to obtain a descriptor
as distinctive and robust as SIFT but with a vector of less components. The
reduction is accomplished by means of the Principal Component Analysis
technique. The histogram-type descriptors have the property of being invariant to
translation, rotation, and scale, and partially invariant to lighting and viewpoint
changes. An exhaustive evaluation of several description algorithms and a
proposal for an extension of the SIFT descriptor (Gradient Location-Orientation
Histogram- GLOH) may be found in (Moreels and Perona 2005) and
(Mikolajcczyk and Schmid 2005), respectively.
Baseline is the line separating the optical centers of two cameras used to capture a
pair of images. When the difference between the images taken from different
viewpoint is small, the corresponding point will have almost the same position
8
and appearance in both images, reducing the complexity of the problem. In this
case, the point is characterized simply by the intensity values of a set of sampled
pixels from a rectangular window (also known as patch) that is centered over the
salient feature. The intensity values of the pixels are compared by means of
correlation measures like cross correlation, sum of squared differences and sum of
absolute differences, among others. In (Ciganek and Siebert 2009) there is a list of
formulae to determine the similarity between two patches. Articles (Konolige and
Agrawal 2008; Nistér et al. 2004) manifest that the measure of normalized crossed
correlation (NCC) is the one exhibiting the best results. Normalization makes this
method invariant to uniform changes in brightness. In (Davison 2003) and
(Molton et al. 2004) an homography is calculated to deform the patch and make
the correspondences with NCC invariant to viewpoints, which allows more
camera movement freedom. Unfortunately, the correspondence with NCC is
susceptible to false positives and false negatives. In an image region with repeated
texture, two o more points inside a search region can get a strong response to
NCC.
When working with long baselines, images present big changes in scale or
perspective, which originates that a point in an image moves to any place in the
other image. This creates a difficult correspondence problem. Data from the image
in the neighborhood of a point are distorted by changes in viewpoint and lighting,
and the correlation measures will not give good results.
The easiest way to find correspondences is to compare all the features of an image
versus all the features of some other image (approach known as “brute force”).
Unfortunately, this process grows in a quadratic manner for the number of
extracted features, which is impractical for many applications that must work in
real time.
9
In recent years, there has been considerable progress in the development of
matching algorithms for long baseline that are invariant to several transformations
of the image. Many of these algorithms obtain a descriptor for each detected
feature, calculate dissimilarity measures between descriptors and use data
structures to perform the search of pairs quickly and efficiently.
Other researches like (Lepetit and Fua 2006; Grauman 2010; Kulis et al. 2009;
Özuysal et al. 2010) use learning strategies to determine similarity between
features. This re-formulates the problem of correspondence as a problem of
classification, which seems very promising. In the specific case of SLAM
applications in real time, this could not look quite adequate since it is necessary to
make a constant training on-line. Nonetheless, (Hinterstoisser et al. 2009; Taylor
and Drummond 2009) have proposed faster methods for achieving on-line
learning, which could be utilized in the future for SLAM applications.
Aguilar et al. (2009), Li et al. (2010) y Gu et al. (2010) propose a different image
correspondence approach, where points neighborhood relationships are
represented by means of a graph. Correspondent graphs are those that are the
same or similar in both images. In the same way, Sanromá et al. (2010) propose
an iterative matching algorithm based on graphs, which is used to retrieve the
pose of a mobile robot. Unfortunately, these researches are still limited because
they do not work in real time and cannot handle temporary occlusions.
Loop closure detection consists in recognizing a place that has already been
visited in a cyclical excursion of arbitrary length (Ho and Newman 2007;
Clemente et al. 2007; Mei et al. 2010). This problem has been one of the greatest
impediments to perform large scale SLAM and recover from critical errors. From
this problem arises another one called perceptual aliasing (Angeli et al. 2008;
Cummins and Newman 2008); where two different places from the surrounding
11
are recognized as the same. This represents a problem even when using cameras
as sensors due to the repetitive features of the environment, e.g. hallways, similar
architectural elements or zones with a large quantity of bushes. A good loop
closure detection method must not return any false positive and must obtain a
minimum of false negatives.
According to Williams et al. (2009a) detection methods for loop closures in visual
SLAM can be divided into three categories: i) map to map; ii) image to image;
and iii) image to map. Categories differ mainly about where the association data
are taken from (metric map space or image space). However the ideal would be to
build a system that combines the advantages of all three categories. Loop closure
detection is an important problem for any SLAM system, and taking into account
that cameras have become a very common sensor for robotic applications, many
researchers focus on vision methods to solve it.
All the loop closure works described above, aim to achieve a precision of
100%.This is due to the fact that a single false positive can cause irremediable
failures during the creation of the map. In the context of SLAM, false positives
are graver than false negatives (Magnusson et al. 2009). False negatives reduce
recall percentage but have no impact on precision percentage. Thus, in order to
determine the efficiency of a loop closure detector, the recall rate should be as
high as possible, with a precision of 100%.
In the problem of the kidnapped robot, robot pose in the map is determined
without previous information of its whereabouts. This case can occur if the robot
12
is put back into an already mapped zone, without the knowledge of its
displacement while it is being transported to that place, or when robot performs
blind movements due to occlusions, temporary sensor malfunction, or fast camera
movements (Eade and Drummond 2008; Chekhlov et al. 2008; Williams et al.
2007).
The multi-session and cooperative mapping consists in align two or more partial
maps of the environment collected by a robot in different periods of operation or
by several robots at the same time (visual cooperative SLAM) (Ho and Newman
2007; Gil et al. 2010; Vidal et al. 2011).
In the past, the problem of associating measurements with landmarks on the map
was solved through algorithms such as Nearest Neighbor, Sequential
Compatibility Nearest Neighbor and Joint Compatibility Branch and Bound
(Neira and Tardós 2001). However, these techniques are similar because they
work only if a good initial guess of the robot in the map is available (Cummins
and Newman 2008).
13
6.1. Probabilistic filters
14
MonoSLAM system uses a motion model with constant linear and angular
velocities. This represents an inconvenient due to the inability of the model to
properly dealing with sudden movements, limiting camera mobility. Therefore,
the distance that the salient features can be moved between frames is very small,
in order to ensure tracking (otherwise, it could turn out to be very expensive, since
a large region to search for features is proposed).
To face erratic movement of the camera with MonoSLAM, Gemeiner et al. (2008)
developed an optimized version, capable of operating at 200 Hz using an extended
motion model that takes into account acceleration, and linear and angular
velocties; however, its performance in real time is limited to only a few seconds,
because the map size and the computational cost grow extremely fast.
SfM allows a high precision in the location of the cameras but does not
necessarily intend to create consistent maps. Despite this, several proposals have
been made using SFM to locate with precision while creating a good
representation of the environment.
One method to solve the problem of SfM incrementally is the visual odometry
published by Nistér et al. (2004). Visual odometry consist in determine
simultaneously the camera pose for each video frame and the position of features
in 3D world, using only images in a causal way and in real time. Mouragnon et al.
(2006; 2009) uses a visual odometry similar to Nister’s proposal, but adding a
technique called Local Bundle Adjustment, reporting trajectories up to 500
15
meters. The visual odometry allow to work with thousands of features per frame,
while probabilistic techniques handle only few features.
Klein and Murray (2007) present a monocular method called Parallel Tracking
and Mapping (PTaM). It uses an approach based on keyframes (see Appendix I)
with two parallel processing threads. The first thread of execution perform the
task of robustly tracking a lot of features, while the other one produces a 3D point
map aided by BA techniques. PTaM system presents tracking failures in presence
of similar textures and moving objects.
In (Konolige et al. 2008) and (Konolige et al. 2009) the authors use a technique
called FrameSLAM and View-Based Maps, respectively. The two methods are
based on making a representation of the map as a "skeleton" consisting of a non-
linear constraint graph between frames (rather than individual 3D features). The
authors use a stereo device mounted on a wheeled robot. Their results show a
good performance on long trajectories (approximately 10 km) under changing
conditions such as passing through an urban environment.
Recently Strasdat et al. (2010b) have recognized that in order to increase accuracy
of the position of a monocular SLAM system it is recommended to increase the
number of features (essential property of SfM) rather than the number of frames;
as well as, that Bundle Adjustment optimization techniques are better than filters.
However, they manifest that the filter might be beneficial in situations of high
uncertainty. The ideal SLAM system would exploit the benefits of both SfM
techniques and probabilistic filters.
Milford et al. (2004) use models of the hippocampus (responsible for spatial
memory) of rodents to create a location and mapping system called RatSLAM.
RatSLAM can generate consistent and stable representations of complex
environments using a single camera. The experiments carried out in (Milford and
Wyeth 2008; Glover et al. 2010) shows a good performance in real-time tasks in
both indoor and outdoor environments. In addition it has the ability to close more
than 51 loops of up to 5 km in length and at different hours of day. In (Milford
2008) a larger study of RatSLAM and other biological and navigation systems of
bees, ants, primates and humans is presented.
Collett (2010) examines the behavior of ants in desert to analyze how they are
guided by visual landmarks and not pheromone trails. Although this research
focuses on understanding how ants navigate using visual information, the author
states that the proposed solution would be viable and easy to implement in a
robot.
16
divided in metric and topological maps. Metric maps capture the geometric
properties of the environment, whereas topological maps describe the connectivity
between different locations.
In the metric maps category it can be considered the occupancy grid maps
(Gutmann et al. 2008) and landmark-based maps (Klein and Murray 2007; Se et
al. 2002; Sáez and Escolano 2006; Mouragnon et al. 2006). Grid maps model free
and occupied space by means of a discretization of the environment in form of
cells, which may contain 2D, 2.5D o 3D information. Landmark-based maps
identify and keep the location 3D of certain salient features in the environment.
Thrun (2002) performs a detailed study on the topic of robotic mapping using
probabilistic techniques in indoor environments.
Topological maps represent the environment as a list of significant places that are
connected by arcs (similar to a graph) (Fraundorfer et al. 2007; Eade and
Drummond 2008; Konolige et al. 2009; Botterill et al. 2010). A representation of
the world based on graphs simplifies the problem of mapping large extensions.
However, it is necessary to perform a global optimization of the map to reduce
local error (Frese et al. 2005; Olson et al. 2006). A tutorial to formulate the SLAM
problem by means of graphs can be consulted in (Grisetti et al. 2010). Other
relevant schemes based on graphs are the following: Konolige et al. (2008; 2009)
built a sequence of relative poses between frames, which can recover from critical
errors. They show results over 10 km trajectories using stereo vision, although it
requires positions generated by an IMU (Inertial Measurement Unit) sensor when
an occlusion of the cameras occurs. The authors state that their scheme is
applicable to monocular SLAM, even though it is not demonstrated. Another
alternative is presented by Mei et al. (2009), which manages to maintain a
constant complexity in time to optimize local sub-maps consisting of the closest
nodes using a technique called relative bundle adjustment. They generate a
trajectory of approximately two kilometers, through stereo cameras.
17
Several datasets containing real image sequences for the evaluation of visual
SLAM systems are described in Appendix III.
The key characteristics of some visual SLAM systems reviewed in this paper are
summarized in Table 1. The first column provides the author name and its
respective reference number while the other columns provide the most important
elements of each system. Specifically, we report: 2) the type of sensing device
used, 3) the core of the visual SLAM solution; 4) the kind of environment
representation; 5) details of the feature extraction process; 6) the ability and
robustness of the system to operate under a variety of conditions: moving objects,
abrupt movements and large environments, and also to perform loop closures, and
7) the type of environment used to test the performance of the system.
8. Conclusions
This work verifies that there is a great concern to solve the SLAM problem using
vision as the only exteroceptive sensor. This is due mainly to the fact that a
camera is an ideal sensor, since it is light, passive, has low-energy consumption,
and captures abundant and distinctive information of a scene. However, the use of
vision requires reliable algorithms with good performance and consistent under
variable light conditions, occlusions or changes in appearance of the environment
due to moving people or objects, the apparition of featureless regions, transitions
between day and night or any other unforeseen situation. Therefore, SLAM
systems using vision as the only sensor are still a challenging and promising
research area.
Image matching and the data association are still open research areas in the fields
of computer vision and robotic vision respectively. The detector and the descriptor
chosen directly affect the performance of the system to track the salient features,
recognize areas previously seen, build a consistent model of the environment, and
work in real time. Particular to data association is the need for navigation in the
long term, in spite of a growing data base and changing and extremely loopy
environments. The acceptance of a bad association will cause serious errors in the
entire SLAM system, meaning that both the computation of location and map
construction will be inconsistent. Therefore, it is important to propose new
strategies to reduce the rate of false positives.
Appearance based methods have been very popular for solving data association
problem in visual SLAM. The most common technique in this category is the
BoVW, due to its speed to find similar images. However, the BoVW is affected
by the phenomenon of perceptual aliasing. Likewise, this technique has not been
yet thoroughly tested to detect images with large variations of viewpoint or scale,
which are transformations that often occur during the loop closure detection, the
kidnapped robot problem and multi-session and cooperative mapping. Also, it
does not take into account the spatial distribution between the detected features
and 3D geometric information, which could be useful when establishing
associations.
18
Although there have been several proposals to build lifelong maps, this issue
remains a topic of interest, as well as the ability to build maps in spite of all the
problems caused by working in real world environments.
To date, there are no standards for evaluating and comparing the general
efficiency and effectiveness of a complete visual SLAM system. Nonetheless,
there are several indicators that may characterize their performance, such as the
degree of human intervention, accuracy of location, map consistency, real time
operation and the control of computational cost that arises with the growth of the
map, among others.
Appendix I – Keyframes
A keyframe is a video frame that is different enough from its predecessor in the
sequence, to represent a new location. Keyframes are also used to estimate
efficiently the pose of the camera and reduce the redundancy of information. The
easiest way to classify a video frame as a keyframe is to compare a video frame
with respect to another taken earlier, selecting those that maximize both the
distance at which they were captured and the number of feature matches that exist
between them. In (Zhang et al. 2010) a comparative study of different techniques
to detect keyframes oriented to the visual SLAM problem is presented.
Recently, most contributions to solve data association in visual SLAM use BoVW
(Sivic and Zisserman 2003) and its improved version called Vocabulary tree
(Nistér and Stewenius 2006). The BoVW has seen a great success in the area of
information retrieval (Manning et al. 2008) and content-based image retrieval
developed by the computer vision community, due to its speed in finding similar
images. However, this technique is not completely precise because it detects
several false positives. To solve this problem to some extent, spatial information
is normally introduced in the last phase of retrieval, conducting a post-verification
taking into account the epipolar constraint (Angeli et al. 2008) or, recently, by
means of Conditional Random Fields (Cadena et al. 2010). This verification
allows rejecting those recovered images that are not geometrically consistent with
the image of reference.
The classic model of BoVW describes images as a set of local features called
visual words and the full set of these words is known as visual vocabulary. Many
BoVW schemes generate an off-line vocabulary by means of a K-means
clustering (but any other can be used) of descriptors from a large corpus of
training images (Ho and Newman 2007; Cummins and Newman 2008). An
alternative and more effective approach is to dynamically construct the
vocabulary from the features that are found as the environment is explored. Such a
scheme is described by Angeli et al. (2008) and Botterill et al. (2010).
Some visual words are more useful than others to identify if two images show the
same place. The most common scheme to assign each word a specific weight is
the TF-IDF. It combines the importance of the words in the image (TF- Term
Frequency) and the importance of the words in the collection (IDF- Inverse
19
Document Frequency). In addition, there are other schemes, which are divided
into local (Squared TF, Frequency logarithm, Binary, BM25 TF, among others)
and global (Probabilistic IDF, Squared IDF, etc.) (Tirilly et al. 2010). An inverted
index is used to speed up queries, which organizes the entire set of visual words
representing images. An inverted index is structured as a book index. It has one
entry for each word of the image collection, followed by a list of all the images in
which the word is present.
Some public datasets available to test the visual SLAM systems are: a) New
College and City Centre Datasets (outdoor) (Cummins 2008), used by Cummins
and Newman (2008); b) The New College Vision and Laser Data Set (outdoor)
(Smith 2012), captured by Smith et al. (2009); c) Bovisa (outdoor) and Bicocca
(indoor) Datasets of Rawseeds project (Rawseeds 2012), captured by Ceriani et
al. (2009); d) The Cheddar Gorge Data Set (outdoor), captured by Simpson et al.
(2011) and RGB-D datasets (indoor)(Sturm 2012)(Sturm et al. 2011).
Acknowledgments
This paper has been made possible thanks to the generous support from the following
institutions which we are pleased to acknowledge: CONACYT (Consejo Nacional de Ciencia y
Tecnología) and CENIDET (Centro Nacional de Investigación y Desarrollo Tecnológico).
20
Table 1. Summary of some systems reviewed.
Cope with
Type of sensing Feature extraction Type of
Author Core of the solution Type of map loop the kidnappe
device moving large-scale environment
closure d robot probl
Detector Descriptor objects? mapping?
events? em?
Shi and Tomasi
(Davison 2003) Monocular camera MonoSLAM (EKF) Metric Image Patches No No No No Indoor
Operator
(Nister et al. Stereo or monocular
Visual Odometry Metric Harris corners Image Patches No No No No Outdoor
2004) cameras
(Sáez and Global Entropy Minimization
Stereo Camera Metric Nitzberg Operator Image Patches No No No No Outdoor/Indoor
Escolano 2006) Algorithm
(Mouragnon et Visual Odometry +Local
Monocular Camera Metric Harris corners Image Patches Yes No No Yes Outdoor/Indoor
al. 2006) Bundle Adjustment
Parallel Tracking and
(Klein and
Monocular Camera Mapping ( Visual Odometry Metric Fast-10 Image Patches No No No No Indoor
Murray 2007)
+ Bundle Adjustment)
(Ho and Monocular Camera Harris Affine
Delayed State Formulation Metric 128D SIFT No Yes No Yes Outdoor/Indoor
Newman 2007) and laser Regions
(Clemente et al. Shi and Tomasi
Monocular Camera Hierarchical Map +EKF Metric Image Patches Yes Yes No Yes Outdoor
2007) Operator
(Lemaire et al. Stereo or monocular
EKF Metric Harris corners Image Patches No Yes No No Outdoor
2007) cameras
RatSLAM (models of the
(Milford, 2008) Monocular Camera Topological Appearance-based matching Yes Yes Yes Yes Outdoor
rodent Hippocampus)
(Scaramuzza
Omnidirectional SIFT(Difference
and Siegwart Visual Odometry Metric Image Patches No No No Yes Outdoor
camera of Gaussians)
2008)
(Eade and
Scale Space
Drummond Monocular Camera GraphSLAM Topological 16D SIFT Yes Yes Yes Yes Outdoor/Indoor
Extrema Detector
2008)
(Paz et al. Conditionally Independent Shi and Tomasi
Stereo Camera Metric Image Patches No No No Yes Outdoor/Indoor
2008) Divide and Conquer (EKF) Operator
128D SIFT +
(Angeli et al. Topological + SIFT(Difference
Monocular Camera EKF Local Hue No Yes No No Indoor
2008) Metric of Gaussians)
histograms
Monocular Camera
(Cummins and Fast Appearance Based
mounted on a pan- Topological Harris-Affine U-SURF 128D Yes Yes No Yes Outdoor
Newman 2008) Mapping (FAB-MAP)
tilt
(Piniés and Conditionally Independent
Monocular Camera Metric Harris corners Image Patches Yes Yes No Yes Outdoor
Tardós 2009) Local Maps (EKF)
(Konolige et Stereo Camera Visual Odometry + Sparse Random tree
Topological Fast Yes Yes Yes Yes Outdoor
al., 2009) +IMU Bundle adjustment signatures
(Williams Hierarchical Map + EKF + Image Patches+
Monocular Camera Metric Fast Yes Yes Yes Yes Outdoor
2009b) Visual Odometry 16D SIFT
(Kaess and Expectation Maximization +
Multi-camera rig Metric Harris corners Image Patches No Yes No No Outdoor
Dellaert 2010) Standard Bundle Adjustment
(Botterill et al. Odometría visual + Bag of
Monocular Camera Topological Fast Image Patches Yes Yes Yes Yes Outdoor/Indoor
2010) words
Visual Odometry +Relative
(Mei et al. Topological +
Stereo Camera Bundle adjustment+ FAB- Fast 128D SIFT Yes Yes Yes Yes Outdoor
2010) Metric
MAP
21
References
Aguilar W, Frauel Y, Escolano F, et al. (2009) A robust graph matching for non-rigid registration.
Image Vis Comput, 27(7): 897-910.
Andrade J, Sanfeliu A (2002) Concurrent map building and localization with landmark validation.
th
In: Proceedings of the 16 IAPR International Conference on Pattern Recognition, 2:693-696.
Angeli A, Doncieux S, Filliat D (2008) Real time visual loop closure detection. In: Proceedings of
the IEEE International Conference on Robotics and Automation.
Angeli A, Doncieux S, Meyer J (2009) Visual topological SLAM and global localization. In:
Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4300-4305.
Artieda J, Sebastian J, Campoy P, et al. (2009) Visual 3-D SLAM from UAVs. J Intell Robot Syst,
55(4):299-321.
Auat C, Lopez N, Soria C, et al. (2010) SLAM algorithm applied to robotics assistance for
navigation in unknown environments. Journal of NeuroEngineering and Rehabilitation,
doi:10.1186/1743-0003-7-10.
Bailey T, Durrant H (2006) Simultaneous Localization and Mapping (SLAM): Part II. IEEE Robot
Autom Mag, 13(3): 108-117.
Bay H, Tuytelaars T, Van L (2006) SURF: Speeded Up Robust Features. In: Proceedings of the
European Conference on Computer Vision.
Bazeille S, Filliat D (2010) Combining Odometry and Visual Loop-Closure Detection for Consistent
Topo-Metrical Mapping. RAIRO: An International Journal on Operations Research, 44(4):365-377.
Beis J, Lowe D (1997) Shape indexing using approximate nearest neighbour search in high-
dimensional spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1000-1006.
Bogdan R, Sundaresan A, Morisset B, et al. (2009) Leaving flatland: efficient real-time three-
dimensional perception and motion planning. J Field Robot: Special Issue on Three-Dimensional
Mapping, 26(10): 841-862.
Bosse M, Newman P, Leonard J, et al. (2003) An atlas framework for scalable mapping. In:
Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1899-1906.
Cadena C, Gálvez-López D, Ramos F, et al. (2010) Robust Place Recognition with Stereo Cameras.
In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems, pp. 5182-
5189.
22
Calonder M, Lepetit V, et al. (2010) BRIEF: Binary Robust Independent Elementary Features. In:
Proceedings of the European Conference on Computer Vision.
Cannons K (2008) A review of visual tracking. Technical Report CSE-2008-07, York University,
Department of Computer Science and Engineering.
Castellanos J, Tardós JD, Neira J (2001) Multisensor fusion for simultaneous localization and map
building. IEEE Trans Robot Autom, 17(6):908 – 914.
Ceriani S, Fontana G, Giusti A, et al. (2009) Rawseeds ground truth collection systems for indoor
self-localization and mapping. Journal of Autonomous Robots, 27(4):353-371.
Chatila R, Laumond J (1985) Position referencing and consistent world modeling for mobile
robots. In: Proceedings of the IEEE International Conference on Robotics and Automation, 2:138-
145.
Chekhlov D, Mayol W, Calway A (2007) Ninja on a plane: automatic discovery of physical planes
for augmented reality using visual SLAM. In: Proceedings of the 6th IEEE and ACM International
Symposium on Mixed and Augmented Reality, pp. 1-4.
Chekhlov D, Mayol W, Calway A (2008) Appearance based indexing for relocalisation in real-time
visual SLAM. In: Proceedings of the British Machine Vision Conference, pp. 363-372.
Chli M, Davison A (2008) Active matching. In: Proceedings of the European Conference on
Computer Vision: Part I. doi:10.1007/978-3-540-88682-2_7
Chli M, Davison A (2009) Active matching for visual tracking. Robot Autonom Syst, 57(12), pp.
1173-1187.
Clemente L, Davison A, Reid I, et al. (2007) Mapping large loops with a single hand-held camera.
In: Proceedings of Robotics: Science and Systems Conference.
Collett M (2010) How desert ants use a visual landmark for guidance along a habitual route. In:
Psychological and Cognitive Sciences, 107(25):11638-11643.
Cummins M, Newman P (2008) FAB-MAP: Probabilistic localization and mapping in the space of
appearance. Int J Robot Res, 27(6): 647-665.
Cyrill S (2009) Robotic mapping and exploration. Springer Tracts in Advanced Robotics, vol. 55,
ISBN: 978-3-642-01096-5.
Davison A (2003) Real-time simultaneous localisation and mapping with a single camera. In:
Proceedings of the IEEE International Conference on Computer Vision, 2:1403-1410.
th
Davison A, González Y, Kita N (2004) Real-time 3D SLAM with wide-angle vision. In: 5
IFAC/EURON Symposium on Intelligent Autonomous Vehicles.
23
Davison A, Reid I, Molton N (2007) MonoSLAM: real-time single camera SLAM, IEEE Trans Pattern
Anal Mach Intell, 29(6):1052-1067.
Dufournaud Y, Schmid C, Horaud R (2004) Image matching with scale adjustment. Comput Vis
Image Understand, 93(2):175-194.
Durrant H, Bailey T (2006) Simultaneous Localization and Mapping (SLAM): Part I the Essential
Algorithms. IEEE Robot Autom Mag, 13(2): 99-110.
Eade E, Drummond T (2006a) Edge landmarks in monocular SLAM. In: Proceedings of the British
Machine Vision Conference.
Eade E, Drummond T (2006b) Scalable monocular SLAM. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 1:469-476.
Eade E, Drummond T (2008) Unified loop closing and recovery for real time monocular SLAM. In
Proceedings of the British Machine Vision Conference.
Engels C, Stewénius H, Nistér D (2006) Bundle adjustment rules. In: Photogrammetric Computer
Vision.
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett, pp. 861-874.
Fraundorfer F, Engels C, Nister C (2007) Topological mapping, localization and navigation using
image collections. In: Proceedings of the IEEE International Conference on Intelligent Robots and
Systems, pp. 3872-3877.
Frintrop S, Jensfelt P (2008) Attentional landmarks and active gaze control for visual SLAM. IEEE
Trans Robot, 24(5): 1054-1065.
Gee A, Chekhlov D, Calway A, Mayol W (2008) Discovering higher level structure in visual SLAM.
IEEE Trans Robot, 24(5):980-990.
Gil A, Reinoso O, Ballesta M, Juliá M (2010) Multi-robot visual SLAM using a rao-blackwellized
particle filter. Robot Autonom Syst, 58(1): 68-80.
Glover A, Maddern W, Milford M, et al. (2010) FAB-MAP + RatSLAM: appearance-based slam for
multiple times of day. In: Proceedings of the IEEE International Conference on Robotics and
Automation.
Grasa O, Civera J, Montiel J (2011) EKF monocular SLAM with relocalization for laparoscopic
sequences. In: Proceedings of the IEEE International Conference on Robotics and Automation,
pp:4816-4821.
24
Grauman K, Darrell T (2007) Pyramid match hashing: sub-linear time indexing over partial
correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition.
Grauman K (2010) Efficiently searching for similar images. Comm ACM, 53(6): 84-94.
Gu S, Zheng Y, Tomasi C (2010) Critical nets and beta-stable features for image matching. In:
Proceedings of the European Conference on Computer Vision, pp. 663-676.
Gutmann J, Fukuchi M, Fujita M (2008) 3D Perception and Environment Map Generation for
Humanoid Robot. Int J Robot Res, 27(10): 1117-1134.
Handa A, Chli M, Strasdat H, Davison A (2010) Scalable active matching. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1546-1533.
Harris C, Stephens M (1988) A combined corner and edge detector. In: Proceedings of the fourth
Alvey Vision Conference, pp. 147-151.
Hartley R, Zisserman A (2003) Multiple View Geometry in computer vision. Cambridge (ed),
Second edn., ISBN: 0521540518.
Ho K, Newman P (2007) Detecting loop closure with scene sequences. Int J Comput Vis, 74(3):
261-286.
Huang A, Bachrach A, Henry P, et al. (2011) Visual odometry and mapping for autonomous flight
using rgb-d camera. International Symposium on Robotics Research.
Jones E, Soatto S (2011) Visual-inertial navigation, mapping and localization: a scalable real-time
causal approach. Int J Robot Res, 30 (4): 407-430.
Kaess M, Dellaert F (2010) Probabilistic structure matching for Visual SLAM with a multi-camera
rig. Comput Vis Image Understand, 114: 286-296.
Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive representation for local image descriptors.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2:506-513.
25
Klein G, Murray D (2007) Parallel tracking and mapping for small AR workspaces. In: Proceedings
th
of the 6 IEEE and ACM International Symposium on Mixed and Augmented Reality.
Klein G, Murray D (2008) Improving the agility of keyframe-based SLAM. In: Proceedings of the
European Conference on Computer Vision, pp. 802-815.
Koch O, Walter M, Huang A, Teller S (2010) Ground robot navigation using uncalibrated cameras.
In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2423-
2430.
Konolige K, Bowman J, Chen J (2009) View-Based Maps, In: Proceedings of Robotics: Science and
Systems.
Kragic D, Vincze M (2009) Vision for robotics. Foundations and Trends in Robotics, 1(1): 1-78,
ISBN: 978-1-60198-260-5.
Kulis B, Jain P, Grauman K (2009) Fast similarity search for learned metrics. IEEE Trans Pattern
Anal Mach Intell, 31(12):2143-2157.
Lemaire T, Berger C, Jung I, et al. (2007) Vision-Based SLAM: stereo and monocular approaches.
Int J Comput Vis, 74( 3):343-364.
Lepetit V, Fua P (2005) Monocular model-based 3D tracking of rigid objects. In: Foundations and
Trends in Computer Graphics and Computer Vision, 1(1):1-89.
Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal
Mach Intell, 28(9).
Li H, Kimi E, Huang X, He L (2010) Object Matching with a Locally Affine-Invariant Constraint In:
Proceedings of the International Conference on Pattern Recognition, pp. 1641-1648.
Lin K, Wang C (2010) Stereo-based simultaneous localization, mapping and moving object
tracking. In: Proceedings of the IEEE International Conference on Intelligent Robots and Systems
pp. 3975-3980.
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis,
60(2):91-110.
Majumder S, Scheding S, Durrant H (2005) Sensor fusion and map building for underwater
navigation. In: Proceedings of Australian Conference on Robotics and Automation.
26
Martinez J, Calway (2010) A unifying planar and point mapping in monocular SLAM. In:
Proceedings of the British Machine Vision Conference, pp. 1-11.
Matas J, Chum O, et al. (2002) Robust wide baseline stereo from maximally stable extremal
regions. In: Proceedings of the British Machine Vision Conference, 22(10):761-767.
Mei C, Reid I (2008) Modeling and generating complex motion blur for real-time tracking. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-8.
Mei C, Sibley G, Cummins M, et al. (2009) A constant-time efficient stereo SLAM system. In:
Proceedings of the British Machine Vision Conference.
Mei C, Sibley G, Cummins M, et al. (2010) RSLAM: a system for large-scale mapping in constant-
time using stereo. Int J Comput Vis, 94(2): 1-17.
Mei C, Sommerlade E, Sibley C, et al. (2011) Hidden view synthesis using real-time visual SLAM
for simplifying video surveillance analysis. In: Proceedings of the IEEE International Conference
on Robotics and Automation, 8: 4240-4245.
Migliore D, Rigamonti R, Marzorati D, et al. (2009) Use a single camera for simultaneous
localization and mapping with mobile object tracking in dynamic environments. In: ICRA
Workshop on Safe navigation in open and dynamic environments: Application to autonomous
vehicles.
Mikolajcczyk K, Schmid C (2002) An affine invariant interest point detector. In: Proceedings of the
European Conference on Computer Vision, pp. 128-142.
Mikolajczyk K, Tuytelaars T, Schmid S, et al. (2005) A comparision of affine region detectors. Int J
Comput Vis, 65:43-72.
Mikolajcczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern
Anal Mach Intell, 27(10): 1615-1630.
Milford M, Wyeth G (2008) Mapping a suburb with a single camera using a biologically inspired
SLAM system. IEEE Trans Robot, 24(5):1038-1053.
Milford M (2008) Robot navigation from nature: simultaneous, localisation, mapping, and path
planning based on hippocampal models. Springer Tracts in Advanced Robotics, ISBN:
3540775196, vol. 41.
Molton N, Davison A, Reid I (2004) Locally planar patch features for real-time structure from
motion. In: Proceedings of the British Machine Vision Conference.
Montemerlo M, Thrun S, Koller D, et al. (2002) FastSLAM: a factored solution to the simultaneous
localization and mapping problem. In: Proceedings of the AAAI National Conference on Artificial
Intelligence, pp. 593-598.
27
Montiel J, Civera J, Davison A (2006) Unified inverse depth parametrization for monocular SLAM.
In: Proceedings of Robotics: Science and Systems.
Morel J, Yu G (2009) ASIFT: a new framework for fully affine invariant image comparison. In: SIAM
Journal on Imaging Sciences, 2(2).
Moreels P, Perona P (2005) Evaluation of features detectors and descriptors based on 3D objects.
In: Proceedings of the IEEE International Conference on Computer Vision, pp. 800-807.
Mouragnon E, Dhome M, Dekeyser F, et al. (2006) Monocular vision based SLAM for mobile
robots. In: Proceedings of the International Conference on Pattern Recognition, pp. 1027-1031.
Mouragnon E, Lhuillier M, Dhome M, et al. (2009) Generic and real time structure from motion
using local bundle asjustment. Image Vis Comput, pp. 1178-1193, ISSN: 0262-8856.
Neira J, Tardós JD (2001) Data association in stochastic mapping using the joint compatibility test.
In: Proceedings of the IEEE International Conference on Robotics and Automation, 17(6).
Newman P, Leonard J, Neira J, Tardós J (2002) Explore and return: experimental validation of real
time concurrent mapping and localization. In: Proceedings of the IEEE International Conference
on Robotics and Automation, 2:1802-1809.
Nistér D (2004) An efficient solution to the five-point relative pose problem. IEEE Trans Pattern
Anal Mach Intell, 26(6): 756-770.
Nistér D, Naroditsky O, Bergen J (2004) Visual Odometry. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 1:652-659.
Nistér D, Stewenius H (2006) Scalable Recognition with a Vocabulary Tree. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161-2168.
Nützi G, Weiss S, Scaramuzza D, Siegwart R (2010) Fusion of IMU and vision for absolute scale
estimation in monocular SLAM. J Intell Robot Syst, doi:10.1007/s10846-010-9490-z.
Olson C, Matthies L, Schoppers M, Maimone M (2003) Rover navigation using stereo ego-motion.
Robot Autonom Syst, 43(4): 215-229.
Olson E, Leonard J, Teller S (2006) Fast iterative optimization of pose graphs with poor initial
estimates. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp.
2262-2269.
Olson C, Matthies L, Wright J, et al. (2007) Visual terrain mapping for mars exploration. Comput
Vis Image Understand, 105(1):73-85.
Özuysal M, Calonder M, Lepetit V, Fua P (2010) Fast keypoint recognition using random ferns.
IEEE Trans Pattern Anal Mach Intell, 32(3): 448 – 461.
28
Paz L, Piniés P, Tardós JD, Neira J (2008) Large-Scale 6DOF SLAM with stereo-in-hand. IEEE Trans
Robot, 24(5): 946-957.
Piniés P, Tardós JD, Neira J (2006) Localization of avalanche victims using robocentric SLAM. In:
Proceedings of the IEEE International Conference on Intelligent Robots and Systems, pp. 3074 –
3079.
Piniés P, Tardós JD (2008) Large scale SLAM building conditionally independent local maps:
application to monocular vision. IEEE Trans Robot, 24(5): 1094-1106.
Pollefeys M, Van L, Vergauwen M, et al. (2004) Visual modeling with a hand-held camera. Int J
Comput Vis, 59(3): 207-232.
Pretto A, Menegatti E, Pagello E (2007) Reliable features matching for humanoid robots. In: IEEE-
RAS International Conference on Humanoid Robots, pp. 532-538.
Pupilli M, Calway A (2006) Real-time visual SLAM with resilience to erratic motion. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1:1244–1249.
Ribas D, Ridao P, Tardós JD, et al. (2008) Underwater SLAM in man-made structured
environments. J Field Robot, 25(11): 898-921.
Rosten E, Drummond T (2006) Machine learning for high-speed corner detection. In: Proceedings
of the European Conference on Computer Vision, pp. 430-443.
Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: an efficient alternative to SIFT or SURF. In:
Proceedings of the IEEE International Conference on Computer Vision.
Scaramuzza (2011) “OcamCalib Toolbox: Omnidirectional Camera and Calibration Toolbox for
Matlab”. https://sites.google.com/site/scarabotix/ocamcalib-toolbox. Accesed 06 March 2012.
Se S, Lowe D, Little J (2002) Mobile robot localization and mapping with uncertainty using scale-
invariant visual landmarks. Int J Robot Res, 21(8): 735-758.
Se S, Lowe D, Little J (2005) Vision- based global localization and mapping for mobile robots. IEEE
Trans Robot, 21(3): 364-375.
Sáez J, Escolano F (2006) 6DOF entropy minimization SLAM. In: Proceedings of the IEEE
International Conference on Robotics and Automation, pp. 1548-1555.
Sanromá G, Alquézar R, Serratosa F (2010) Graph matching using SIFT descriptors – an application
to pose recovery of a mobile robot. In: 13th Joint IAPR International Workshop on Structural,
Syntactic and Statistical Pattern Recognition, pp. 254-263.
29
Silpa C, Hartley R (2008) Optimised KD-trees for fast image descriptor matching. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition.
Sinha S, Frahm J, Pollefeys M, Genc Y (2006) GPU-based video feature tracking and matching. In:
Workshop on Edge Computing Using New Commodity Architectures.
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos.
In: Proceedings of the IEEE International Conference on Computer Vision.
Smith M, Baldwin I, Churchill W, et al. (2009) The New College Vision and Laser Data Set. Int J
Robot Res, 28(5):595-599.
Smith (2012) “The New College Vision and Laser Data Set”.
http://www.robots.ox.ac.uk/NewCollegeData/. Accesed 06 March 2012.
Solà J (2007) Multi-camera VSLAM: from former information losses to self-calibration. In:
Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Workshop
on Visual SLAM.
Steder B, Grisetti G, Stachniss C, et al. (2008) Visual SLAM for flying vehicles. IEEE Trans Robot,
24(5): 1088-1093.
Sturm J, Magnenat S, et al (2011) Towards a benchmark for RGB-D SLAM evaluation. In:
Proceedings of the RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics:
Science and Systems Conference.
Strasdat H, Montiel J, Davison A (2010a) Scale drift-aware large scale monocular SLAM. In
Proceeding of Robotics: Science and Systems.
Strasdat H, Montiel J, Davison A (2010b) Real-time monocular SLAM: Why filter?. In: Proceedings
of the IEEE International Conference on Robotics and Automation.
Tardós JD, Neira J, Newman P, et al. (2002) Robust mapping and localization in indoor
environments using sonar data. Int J Robot Res, 21:311-330.
Taylor S, Drummond T (2009) Multiple target localization at over 100 FPS. In: Proceedings of the
British Machine Vision Conference.
Thrun S. (2002) Robotic mapping: a survey. Exploring artificial intelligence in the new millennium,
ISBN:1-55860-811-7.
30
Thrun S, Koller D, Ghahramani Z, et al. (2002) Simultaneous mapping and localization with sparse
extended information filters: theory and initial results, “Technical Report CMU-CS-02-112,
Carnegie Mellon”.
Thrun S (2003) A system for volumetric robotic mapping of abandoned mines. In: Proceedings of
the IEEE International Conference on Robotics and Automation, 3: 4270-4275.
Thrun S, Montemerlo M, Dahlkamp H, et al. (2005a) Stanley: the robot that won the DARPA
Grand Challenge. J Field Robot, 23(9):661-692.
Thrun S, Burgard W, Fox D, (2005b) Probabilistic Robotics. “The MIT Press”, ISBN: 0262201623.
Thrun S, Montemerlo M, Aron A (2006) Probabilistic terrain analysis for high speed desert driving.
In: Proceedings of Robotics: Science and Systems.
Tirilly P, Claveau V, Gros P (2010) Distances and weighting schemes for bag of visual words image
retrieval. In: Proceedings of the International Conference on Multimedia Information Retrieval,
pp. 323-333.
Tuytelaars T, Van-Gool L (2004) Matching widely separated views based on affine invariant
regions. Int J Comput Vis, 59(1):61-85.
Tuytelaars T, Mikolajczyk K (2008) Local Invariant Feature Detectors: A survey. In: Foundations
and Trends in Computer Graphics and Vision.
Vidal T, Bryson M, Sukkarieh S, et al. (2007) On the observability of bearing-only SLAM. In:
Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4114-4119.
Vidal T, Berger C, Sola J, Lacroix S (2011) Large scale multiple robot visual mapping with
heterogeneous landmarks in semi-structured terrain. Robot Autonom Syst, pp. 654-674.
Wang C, Thorpe Ch, Thrun S, et al. (2007) Simultaneous localization, mapping and moving object
tracking. Int J Robot Res, 26(9): 889-916.
Wangsiripitak S, Murray D (2009) Avoiding moving outliers in visual SLAM by tracking moving
objects. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp.
375-380.
Williams B, Klein G, Reid I (2007) Real-time SLAM relocalisation. In: Proceedings of the IEEE
International Conference on Computer Vision.
Williams B (2009b) Simultaneous Localisation and Mapping Using a Single Camera. PhD. Thesis,
Oxford University, England.
31
Willson (1995) “Tsai Camera Calibration Software”. http://www.cs.cmu.edu/~rgw/TsaiCode.html.
Accesed 06 March 2012.
Yilmaz A, Javed O, Shah M (2006) Object tracking: a survery. ACM Comput Surv, 38(4).
Zhang H, Li B, Yang D (2010) Keyframe detection for appearance-based visual SLAM. In:
Proceedings of the IEEE International Conference on Intelligent Robots and Systems, pp. 2071-
2076.
Zhang W, Kosecka J (2006) Image based localization in urban environments. In: Proceedings of
the Third International Symposium on 3D Data Processing, Visualization, and Transmission.
Zhang Z, Deriche R, Faugeras O, Luong Q (1994) A robust technique for matching two
uncalibrated images through the recovery of the unknown epipolar geometry. Journal of Artificial
Intelligence, Special volume on Computer Vision, vol. 78.
Zhang Z (2000) A flexible new technique for camera calibration. IEEE Trans Pattern Anal Mach
Intell, 22(11): 1330-1334.
32