0% found this document useful (0 votes)
104 views16 pages

MonoSLAM Real-Time Single Camera SLAM

1) MonoSLAM is an algorithm that allows real-time tracking of a camera's 3D trajectory as it moves through an unknown environment using only a single camera as input. 2) It is the first successful application of Simultaneous Localization and Mapping (SLAM) techniques from robotics to the domain of an uncontrolled single camera. 3) It achieves drift-free, real-time performance by creating an online, sparse map of natural landmarks within a probabilistic framework.

Uploaded by

Armando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views16 pages

MonoSLAM Real-Time Single Camera SLAM

1) MonoSLAM is an algorithm that allows real-time tracking of a camera's 3D trajectory as it moves through an unknown environment using only a single camera as input. 2) It is the first successful application of Simultaneous Localization and Mapping (SLAM) techniques from robotics to the domain of an uncontrolled single camera. 3) It achieves drift-free, real-time performance by creating an online, sparse map of natural landmarks within a probabilistic framework.

Uploaded by

Armando
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1052 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO.

6, JUNE 2007

MonoSLAM: Real-Time Single Camera SLAM


Andrew J. Davison, Ian D. Reid, Member, IEEE, Nicholas D. Molton, and
Olivier Stasse, Member, IEEE

Abstract—We present a real-time algorithm which can recover the 3D trajectory of a monocular camera, moving rapidly through a
previously unknown scene. Our system, which we dub MonoSLAM, is the first successful application of the SLAM methodology from
mobile robotics to the “pure vision” domain of a single uncontrolled camera, achieving real time but drift-free performance inaccessible
to Structure from Motion approaches. The core of the approach is the online creation of a sparse but persistent map of natural
landmarks within a probabilistic framework. Our key novel contributions include an active approach to mapping and measurement, the
use of a general motion model for smooth camera movement, and solutions for monocular feature initialization and feature orientation
estimation. Together, these add up to an extremely efficient and robust algorithm which runs at 30 Hz with standard PC and camera
hardware. This work extends the range of robotic systems in which SLAM can be usefully applied, but also opens up new areas. We
present applications of MonoSLAM to real-time 3D localization and mapping for a high-performance full-size humanoid robot and live
augmented reality with a hand-held camera.

Index Terms—Autonomous vehicles, 3D/stereo scene analysis, tracking.

1 INTRODUCTION

T HE last 10 years have seen significant progress in


autonomous robot navigation and, specifically, Simulta-
neous Localization and Mapping (SLAM) has become well-
coming from a camera are so much higher than those from
other sensors.
Instead, vision researchers concentrated on reconstruc-
defined in the robotics community as the question of a moving tion problems from small image sets, developing the field
sensor platform constructing a representation of its environ- known as Structure from Motion (SFM). SFM algorithms
ment on the fly while concurrently estimating its ego-motion. have been extended to work on longer image sequences,
SLAM is today is routinely achieved in experimental robot (e.g., [1], [2], [3]), but these systems are fundamentally
systems using modern methods of sequential Bayesian offline in nature, analyzing a complete image sequence to
inference and SLAM algorithms are now starting to cross produce a reconstruction of the camera trajectory and scene
over into practical systems. Interestingly, however, and structure observed. To obtain globally consistent estimates
despite the large computer vision research community, until over a sequence, local motion estimates from frame-to-
very recently the use of cameras has not been at the center of frame feature matching are refined in a global optimization
progress in robot SLAM, with much more attention given to moving backward and forward through the whole sequence
other sensors such as laser range-finders and sonar. (called bundle adjustment). These methods are perfectly
This may seem surprising since for many reasons vision is suited to the automatic analysis of short image sequences
an attractive choice of SLAM sensor: cameras are compact, obtained from arbitrary sources—movie shots, consumer
accurate, noninvasive, and well-understood—and today video, or even decades-old archive footage—but do not scale
cheap and ubiquitous. Vision, of course, also has great to consistent localization over arbitrarily long sequences in
intuitive appeal as the sense humans and animals primarily real time.
use to navigate. However, cameras capture the world’s Our work is highly focused on high frame-rate real-time
geometry only indirectly through photometric effects and it performance (typically 30Hz) as a requirement. In applica-
was thought too difficult to turn the sparse sets of features tions, real-time algorithms are necessary only if they are to
popping out of an image into reliable long-term maps be used as part of a loop involving other components in the
generated in real-time, particularly since the data rates dynamic world—a robot that must control its next motion
step, a human that needs visual feedback on his actions or
another computational process which is waiting for input.
. A.J. Davison is with the Department of Computing, Imperial College, 180
Queen’s Gate, SW7 2AZ, London, UK. E-mail: [email protected].
In these cases, the most immediately useful information to
. I.D. Reid is with the Robotics Research Group, Department of Engineering be obtained from a moving camera in real time is where it is,
Science, University of Oxford, OX1 3PJ, UK. E-mail: [email protected]. rather than a fully detailed “final result” map of a scene
. N.D. Molton is with Imagineer Systems Ltd., The Surrey Technology ready for display. Although localization and mapping are
Centre, 40 Occam Road, The Surrey Research Park, Guildford GU2 7YG,
UK. E-mail: [email protected].
intricately coupled problems and it has been proven in
. O. Stasse is with the Joint Japanese-French Robotics Laboratory (JRL), SLAM research that solving either requires solving both, in
CNRS/AIST, AIST Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki, 305- this work we focus on localization as the main output of
8568, Japan. E-mail: [email protected]. interest. A map is certainly built, but it is a sparse map of
Manuscript received 13 Dec. 2005; revised 29 June 2006; accepted 6 Sept. landmarks optimized toward enabling localization.
2006; published online 18 Jan. 2007. Further, real-time camera tracking scenarios will often
Recommended for acceptance by C. Taylor.
For information on obtaining reprints of this article, please send e-mail to:
involve extended and looping motions within a restricted
[email protected], and reference IEEECS Log Number TPAMI-0705-1205. environment (as a humanoid performs a task, a domestic
Digital Object Identifier no. 10.1109/TPAMI.2007.1049. robot cleans a home, or room is viewed from different angles
0162-8828/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1053

with graphical augmentations). Repeatable localization, in The current paper draws on earlier work published in
which gradual drift from ground truth does not occur, will be conference papers [5], [6], [7]. We also present new unpub-
essential here and much more important than in cases where a lished results demonstrating the advanced use of the
moving camera continually explores new regions without algorithm in humanoid robotics and augmented reality
returning. This is where our fully-probabilistic SLAM applications.
approach comes into its own: It will naturally construct a
persistent map of scene landmarks to be referenced indefi-
nitely in a state-based framework and permit loop closures to
2 RELATED WORK
correct long-term drift. Forming a persistent world map The work of Harris and Pike [8], whose DROID system built
means that if camera motion is restricted, the processing visual maps sequentially using input from a single camera,
requirement of the algorithm is bounded and continuous is perhaps the grandfather of our research and was far
real-time operation can be maintained, unlike in tracking ahead of its time. Impressive results showed 3D maps of
approaches such as [4], where loop-closing corrections are features from long image sequences, and a later real-time
achieved by matching to a growing history of past poses. implementation was achieved. A serious oversight of this
work, however, was the treatment of the locations of each of
1.1 Contributions of This Paper the mapped visual features as uncoupled estimation
Our key contribution is to show that it is indeed possible to problems, neglecting the strong correlations introduced by
achieve real-time localization and mapping with a single the common camera motion. Closely-related approaches
freely moving camera as the only data source. We achieve this were presented by Ayache [9] and later Beardsley et al. [10]
by applying the core of the probabilistic SLAM methodology in an uncalibrated geometrical framework, but these
with novel insights specific to what here is a particularly approaches also neglected correlations, the result being
difficult SLAM scenario. The MonoSLAM algorithm we overconfident mapping and localization estimates and an
explain and demonstrate achieves the efficiency required inability to close loops and correct drift.
for real-time operation by using an active, guided approach to Smith et al. [11] and, at a similar time, Moutarlier and
feature mapping and measurement, a general motion model Chatila [12], had proposed taking account of all correlations
for smooth 3D camera movement to capture the dynamical in general robot localization and mapping problems within a
prior information inherent in a continuous video stream and a single state vector and covariance matrix updated by the
novel top-down solution to the problem of monocular feature Extended Kalman Filter (EKF). Work by Leonard [13],
initialization. Manyika [14], and others demonstrated increasingly sophis-
In a nutshell, when compared to SFM approaches to ticated robot mapping and localization using related EKF
sequence analysis, using SLAM we are able both to techniques, but the single state vector and “full covariance”
implement on-the-fly probabilistic estimation of the state approach of Smith et al. did not receive widespread attention
of the moving camera and its map and benefit from this in until the mid to late 1990s, perhaps when computing power
using the running estimates to guide efficient processing. reached the point where it could be practically tested. Several
This aspect of SLAM is often overlooked. Sequential SLAM early implementations [15], [16], [17], [18], [19] proved the
is very naturally able for instance to select a set of highly single EKF approach for building modest-sized maps in real
salient and trackable but efficiently spaced features to put robot systems and demonstrated convincingly the impor-
into its visual map, with the use of only simple mapping tance of maintaining estimate correlations. These successes
heuristics. Sensible confidence bound assumptions allow all gradually saw very widespread adoption of the EKF as the
but the most important image processing to be avoided and core estimation technique in SLAM and its generality as a
at high frame-rates all but tiny search regions of incoming Bayesian solution was understood across a variety of
images are completely ignored by our algorithm. Our different platforms and sensors.
approach to mapping can be summarized as “a sparse map In the intervening years, SLAM systems based on the
of high quality features.” EKF and related probabilistic filters have demonstrated
In this paper, we are able to demonstrate real-time impressive results in varied domains. The methods deviat-
MonoSLAM indoors in room-sized domains. A long term ing from the standard EKF have mainly aimed at building
goal in SLAM shared by many would be to achieve a system large scale maps, where the EKF suffers problems of
with the following performance: A single low-cost camera computational complexity and inaccuracy due to lineariza-
attached to a portable computer would be switched on at an tion, and have included submapping strategies (e.g., [20],
arbitrary location in an unknown scene, then carried off by a [21]) and factorized particle filtering (e.g., [22]). The most
fast-moving robot (perhaps flying or jumping) or even a impressive results in terms of mapping accuracy and scale
running human through an arbitrarily large domain, all the have come from robots using laser range-finder sensors.
time effortlessly recovering its trajectory in real time and These directly return accurate range and bearing scans over
building a detailed, persistent map of all it has seen. While a slice of the nearby scene, which can either be processed to
others attack the large map issue, but continue to work with extract repeatable features to insert into a map (e.g., [23]) or
the same slow-moving robots and multisensor platforms as simply matched whole-scale with other overlapping scans
before, we are approaching the problem from the other to accurately measure robot displacement and build a map
direction and solve issues relating to highly dynamic of historic robot locations each with a local scan reference
3D motion, commodity vision-only sensing, processing (e.g., [24], [25]).
efficiency and relaxing platform assumptions. We believe
that our results are of both theoretical and practical 2.1 Vision-Based SLAM
importance because they open up completely new avenues Our algorithm uses vision as the only outward-looking
for the application of SLAM techniques. sense. In Section 1, we mentioned the additional challenges
1054 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

posed by vision over laser sensors, which include the very or for localizing a “lost robot,” which involve matching with
high input data rate, the inherent 3D quality of visual data, very weak priors, is clear. They are less suited to continuous
the lack of direct depth measurement and the difficulty in tracking, however, due to the high-computational cost of
extracting long-term features to map. These factors have extracting them—a method like ours using active search will
combined to mean that there have been relatively few always outperform invariant matching for speed.
successful vision-only SLAM systems (where now we A stress of our work is to simplify the hardware required
define a SLAM system as one able to construct persistent for SLAM to the simplest case possible, a single camera
maps on the fly while closing loops to correct drift). In this connected to a computer, and to require a minimum of
section, we review some of the most interesting and place assumptions about this camera’s free 3D movement. Several
our work into context. authors have presented real-time camera tracking systems
Neira et al. presented a simple system mapping vertical with goals similar to our own. McLauchlan and Murray [37]
line segments in 2D in a constrained indoor environment introduced the VSDF (Variable State-Dimension Filter) for
[26], but the direct ancestor of the approach in the current simultaneous structure and motion recovery from a moving
paper was the work by Davison and Murray [18], [27], [28] camera using a sparse information filter framework, but were
whose system using fixating active stereo was the first not able to demonstrate long-term tracking or loop closing.
visual SLAM system with processing in real time (at 5Hz), The approach of Chiuso et al. [38] shared several of the ideas of
able to build a 3D map of natural landmarks on the fly and our work, including the propagation of map and localization
control a mobile robot. The robotic active head that was uncertainty using a single Extended Kalman Filter, but only
used forced a one-by-one choice of feature measurements limited results of tracking small groups of objects with small
and sparse mapping. Nevertheless, it was proven that a camera motions were presented. Their method used simple
small set of landmarks could provide a very accurate SLAM gradient descent feature tracking and was therefore unable to
reference if carefully chosen and spread. Davison and Kita match features during high acceleration or close loops after
[29] extended this method to the case of a robot able to periods of neglect. Nistér et al. [39] presented a real-time
localize while traversing nonplanar ramps by combining system based very much on the standard structure from
stereo vision with an inclinometer. motion methodology of frame-to-frame matching of large
In more recent work, vision-based SLAM has been used in numbers of point features which was able to recover
a range of different systems. Jung and Lacroix [30] presented a instantaneous motions impressively but again had no ability
stereo vision SLAM system using a downward-looking stereo to rerecognize features after periods of neglect and, therefore,
rig to localize a robotic airship and perform terrain mapping. would lead inevitably to rapid drift in augmented reality or
Their implementation was sequential, but did not run in real localization. Foxlin [40] has taken a different approach in a
time and relied on a wide baseline fixed stereo rig to obtain single camera system by using fiducial markers attached to
depth measurements directly. Kim and Sukkarieh [31] used the ceiling in combination with high-performance inertial
monocular vision in combination with accurate inertial sensing. This system achieved very impressive and repea-
sensing to map ground-based targets from a dynamically table localization results, but with the requirement for
maneuvering UAV in an impressive system, though the substantial extra infrastructure and cost. Burschka and Hager
targets were artificially placed and estimation of their [41] demonstrated a small-scale visual localization and
locations is made easier by the fact that they can be assumed mapping system, though by separating the localization and
to lie in a plane. mapping steps they neglect estimate correlations and the
Bosse et al. [20], [32] used omnidirectional vision in ability of this method to function over long time periods is
combination with other sensors in their ATLAS submap- doubtful.
ping framework, making particular use of lines in a man- In the following section, we will present our method step
made environment as consistent bearing references. Most by step in a form accessible to readers unfamiliar with the
recently Eustice et al. [33] have used a single downward- details of previous SLAM approaches.
looking camera and inertial sensing to localize an under-
water remote vehicle and produce detailed seabed recon-
structions from low frame-rate image sequences. Using an
3 METHOD
efficient sparse information filter their approach scales well 3.1 Probabilistic 3D Map
to large-scale mapping in their experimental setup where The key concept of our approach, as in [11], is a probabilistic
loop closures are relatively infrequent. feature-based map, representing at any instant a snapshot of
Recently published work by Sim et al. [34] uses an the current estimates of the state of the camera and all features
algorithm combining SIFT features [35] and FastSLAM of interest and, crucially, also the uncertainty in these
filtering [22] to achieve particularly large-scale vision-only estimates. The map is initialized at system start-up and
SLAM mapping. Their method is processor-intensive and at persists until operation ends, but evolves continuously and
an average of 10 seconds processing time per frame is dynamically as it is updated by the Extended Kalman Filter.
currently a large factor away from real-time operation. The The probabilistic state estimates of the camera and features
commercial vSLAM system [36] also uses SIFT features, are updated during camera motion and feature observation.
though within a SLAM algorithm which relies significantly on When new features are observed the map is enlarged with
odometry to build a connected map of recognizable locations new states and, if necessary, features can also be deleted.
rather than fully continuous accurate localization. There is The probabilistic character of the map lies in the propaga-
little doubt that invariant features such as SIFT provide a high tion over time not only of the mean “best” estimates of the
level of performance in matching and permit high fidelity states of the camera and features but a first order uncertainty
“location recognition” in the same way as they were designed distribution describing the size of possible deviations from
for use in visual object recognition. Their value in loop-closing these values. Mathematically, the map is represented by a
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1055

Fig. 1. (a) A snapshot of the probabilistic 3D map, showing camera position estimate and feature position uncertainty ellipsoids. In this, and other
figures, in the paper the feature color code is as follows: red = successfully measured, blue = attempted but failed measurement, and yellow = not
selected for measurement on this step. (b) Visually salient feature patches detected to serve as visual landmarks and the 3D planar regions deduced
by back-projection to their estimated world locations. These planar regions are projected into future estimated camera positions to predict patch
appearance from new viewpoints.

state vector x^ and covariance matrix P. State vector x ^ is often observed simultaneously by the camera will have
composed of the stacked state estimates of the camera and position estimates whose difference (relative position) is
features and P is a square matrix of equal dimension which very well-known, while the position of the group as a whole
can be partitioned into submatrix elements as follows: relative to the global coordinate frame may not be. This
0 1 2 3 situation is represented in the map covariance matrix P by
^v
x Pxx Pxy1 Pxy2 . . . nonzero entries in the off-diagonal matrix blocks and comes
By C 6 Py1 x Py1 y1 Py1 y2 . . . 7
B ^1 C 6 7 about naturally through the operation of the algorithm.
^ ¼ By
x ^ C; P ¼ 6 Py x Py y Py y . . . 7 : ð1Þ
@ 2A 4 2 2 1 2 2 5 The total size of the map representation is order OðN 2 Þ,
.. .. .. .. where N is the number of features and the complete SLAM
. . . .
algorithm we use has OðN 2 Þ complexity. This means that
In doing this, the probability distribution over all map the number of features which can be maintained with real-
parameters is approximated as a single multivariate time processing is bounded—in our system to around 100 in
Gaussian distribution in a space of dimension equal to the current 30 Hz implementation.
total state vector size. There are strong reasons why we have chosen, in this work,
Explicitly, the camera’s state vector xv comprises a metric to use the “standard” single, full covariance EKF approach to
3D position vector rW , orientation quaternion qRW , velocity SLAM rather than variants which use different probabilistic
vector vW , and angular velocity vector !R relative to a fixed representations. As we have stated, our current goal is long-
world frame W and “robot” frame R carried by the camera term, repeatable localization within restricted volumes. The
(13 parameters): pattern of observation of features in one of our maps is quite
0 W 1 different from that seen in many other implementations of
r SLAM for robot mapping, such as [25], [34], or [22]. Those
B qW R C
xv ¼ B
@ vW A:
C ð2Þ robots move largely through corridor-like topologies, follow-
R
ing exploratory paths until they infrequently come back to
! places they have seen before, at that stage correcting drift
In this work, feature states yi are the 3D position vectors of around loops. Relatively ad hoc approaches can be taken to
the locations of point features. Camera and feature distributing the correction around the well-defined loops,
geometry and coordinate frames are defined in Fig. 3a. whether this is through a chain of uncertain pose-to-pose
The role of the map is primarily to permit real-time transformations or submaps or by selecting from a potentially
localization rather than serve as a complete scene description, impoverished discrete set of trajectory hypotheses repre-
and we therefore aim to capture a sparse set of high-quality sented by a finite number of particles.
landmarks. We assume that the scene is rigid and that each In our case, as a free camera moves and rotates in 3D
landmark is a stationary world feature. Specifically, in this around a restricted space, individual features will come in
work, each landmark is assumed to correspond to a well- and out of the field of view in varying sequences, various
localized point feature in 3D space. The camera is modeled as subsets of features at different depths will be covisible as
a rigid body needing translation and rotation parameters to the camera rotates, and loops of many different sizes and
describe its position and we also maintain estimates of its interlinking patterns will be routinely closed. We have
linear and angular velocity: This is important in our algorithm considered it very important to represent the detailed,
since we will make use of motion dynamics as will be flexible correlations which will arise between different parts
explained in Section 3.4. of the map accurately. Within the class of known methods,
The map can be pictured as in Fig. 1a: All geometric this is only computationally feasible with a sparse map of
estimates can be considered as surrounded by ellipsoidal features maintained within a single state vector and
regions representing uncertainty bounds (here correspond- covariance matrix. One hundred well-chosen features turn
ing to three standard deviations). What Fig. 1a cannot show out to be sufficient with careful map management to span a
is that the various ellipsoids are potentially correlated to room. In our opinion, it remains to be proven whether a
various degrees: In sequential mapping, a situation which method (for instance, FastSLAM [22], [42]) which can cope
commonly occurs is that spatially close features which are with a much larger number of features, but represent
1056 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

Fig. 2. (a) Matching the four known features of the initialization target on
Fig. 3. (a) Frames and vectors in camera and feature geometry. (b) Active
the first frame of tracking. The large circular search regions reflect the
search for features in the raw images from the wide-angle camera.
high uncertainty assigned to the starting camera position estimate.
(b) Visualization of the model for “smooth” motion: At each camera Ellipses show the feature search regions derived from the uncertainty in
position, we predict a most likely path together with alternatives with small the relative positions of camera and features and only these regions are
deviations. searched.

correlations less accurately will be able to give such good to which the SIFT descriptor is invariant) and some amount of
repeatable localization results in agile single camera SLAM. other warping.
Note that we do not update the saved templates for
3.2 Natural Visual Landmarks features over time—since the goal is repeatable localization,
Now, we turn specifically to the features which make up the we need the ability to exactly remeasure the locations of
map. We have followed the approach of Davison and Murray features over arbitrarily long time periods. Templates which
[5], [27], who showed that relatively large (11  11 pixels) are updated over time will tend to drift gradually from their
image patches are able to serve as long-term landmark initial positions.
features, the large templates having more unique signatures
than standard corner features. However, we extend the 3.3 System Initialization
power of such features significantly by using the camera In most SLAM systems, the robot has no specific knowledge
localization information we have available to improve about the structure of the world around it when first
matching over large camera displacements and rotations. switched on. It is free to define a coordinate frame within
Salient image regions are originally detected automati- which to estimate its motion and build a map and the
cally (at times and in locations guided by the strategies of obvious choice is to fix this frame at the robot’s starting
Section 3.7) using the detection operator of Shi and Tomasi position, defined as the origin. In our single camera SLAM
[43] from the monochrome images obtained from the camera algorithm, we choose to aid system start-up with a small
(note that, in the current work, we use monochrome images amount of prior information about the scene in the shape of
primarily for reasons of efficiency). The goal is to be able to a known target placed in front of the camera. This provides
identify these same visual landmarks repeatedly during several features (typically four) with known positions and
potentially extreme camera motions and, therefore, straight- of known appearance. There are two main reasons for this:
forward 2D template matching (as in [5]) is very limiting, as
after only small degrees of camera rotation and translation the 1. In single camera SLAM, with no direct way to
appearance of a landmark can change greatly. To improve on measure feature depths or any odometry, starting
this, we make the approximation that each landmark lies on a from a target of known size allows us to assign a
locally planar surface—an approximation that will be very precise scale to the estimated map and motion—
good in many cases and bad in others, but a great deal better rather than running with scale as a completely
than assuming that the appearance of the patch will not unknown degree of freedom. Knowing the scale of
change at all. Further, since we do not know the orientation of the map is desirable whenever it must be related to
this surface, we make the assignment that the surface normal other information such as priors on motion dy-
is parallel to the vector from the feature to the camera at namics or features depths and makes it much more
initialization (in Section 3.8, we will present a method for easy to use in real applications.
updating estimates of this normal direction). Once the 2. Having some features in the map right from the start
3D location, including depth, of a feature has been fully
means that we can immediately enter our normal
initialized using the method of Section 3.6, each feature is
predict-measure-update tracking sequence without
stored as an oriented planar texture (Fig. 1b). When making
any special first step. With a single camera, features
measurements of a feature from new camera positions, its
cannot be initialized fully into the map after only one
patch can be projected from 3D to the image plane to produce
a template for matching with the real image. This template measurement because of their unknown depths and,
will be a warped version of the original square template therefore, within our standard framework we would
captured when the feature was first detected. In general, this be stuck without features to match to estimate the
will be a full projective warping, with shearing and camera motion from frames one to two. (Of course,
perspective distortion, since we just send the template standard stereo algorithms provide a separate ap-
through backward and forward camera models. Even if the proach which could be used to bootstrap motion and
orientation of the surface on which the feature lies is not structure estimation.)
correct, the warping will still take care successfully of rotation Fig. 2a shows the first step of tracking with a typical
about the cyclotorsion axis and scale (the degrees of freedom initialization target. The known features—in this case, the
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1057

corners of the black rectangle—have their measured diagonal, representing uncorrelated noise in all linear and
positions placed into the map at system start-up with zero rotational components. The state update produced is:
uncertainty. It is now these features which define the world 0 W 1 0 W 1
coordinate frame for SLAM. On the first tracking frame, the rnew r þ ðvW þ VW Þt
camera is held in a certain approximately known location B qW R C B qW R  qðð!R þ R ÞtÞ C
B C B C
relative to the target for tracking to start. In the state vector f v ¼ B new C¼B W C: ð4Þ
@ vW
new
A @ v þV W A
the initial camera position is given an initial level of
!R ! R þ R
uncertainty corresponding to a few degrees and centi- new

meters. This allows tracking to “lock on” very robustly in Here, the notation qðð!R þ R ÞtÞ denotes the quater-
the first frame just by starting the standard tracking cycle. nion trivially defined by the angle-axis rotation vector
ð!R þ R Þt.
3.4 Motion Modeling and Prediction
In the EKF, the new state estimate f v ðxv ; uÞ must be
After start-up, the state vector is updated in two alternating accompanied by the increase in state uncertainty (process
ways: 1) the prediction step, when the camera moves in the noise covariance) Qv for the camera after this motion. We
“blind” interval between image capture and 2) the update find Qv via the Jacobian calculation:
step, after measurements have been achieved of features. In
this section, we consider prediction. @f v @f v >
Constructing a motion model for an agile camera which is Qv ¼ Pn ; ð5Þ
@n @n
carried by an unknown person, robot, or other moving body
may, at first glance, seem to be fundamentally different to where Pn is the covariance of noise vector n. EKF implemen-
@f v
modeling the motion of a wheeled robot moving on a plane: tation also requires calculation of the Jacobian @x v
. These
The key difference is that, in the robot case, one is in Jacobian calculations are complicated but a tractable matter of
possession of the control inputs driving the motion, such as differentiation; we do not present the results here.
“move forward 1 m with steering angle 5 degrees,” whereas The rate of growth of uncertainty in this motion model is
we do not have such prior information about the agile determined by the size of Pn and setting these parameters to
camera’s movements. However, it is important to remember small or large values defines the smoothness of the motion we
that both cases are just points on the continuum of types of expect. With small Pn , we expect a very smooth motion with
model for representing physical systems. Every model must small accelerations, and would be well placed to track
stop at some level of detail and a probabilistic assumption motions of this type, but unable to cope with sudden rapid
made about the discrepancy between this model and reality: movements. High Pn means that the uncertainty in the system
This is what is referred to as process uncertainty (or noise). In increases significantly at each time step and while this gives
the case of a wheeled robot, this uncertainty term takes the ability to cope with rapid accelerations the very large
account of factors such as potential wheel slippage, surface uncertainty means that a lot of good measurements must be
irregularities, and other predominantly unsystematic effects made at each time step to constrain estimates.
which have not been explicitly modeled. In the case of an agile
camera, it takes account of the unknown dynamics and 3.5 Active Feature Measurement and Map Update
intentions of the human or robot carrier, but these too can be In this section, we consider the process of measuring a
probabilistically modeled. feature already in the SLAM map (we will discuss
We currently use a “constant velocity, constant angular initialization in the next section).
velocity model.” This means not that we assume that the A key part of our approach is to predict the image
camera moves at a constant velocity over all time, but that position of each feature before deciding which to measure.
our statistical model of its motion in a time step is that on Feature matching itself is carried out using a straightfor-
average we expect undetermined accelerations occur with a ward normalized cross-correlation search for the template
Gaussian profile. The model is visualized in Fig. 2b. The patch projected into the current camera estimate using the
implication of this model is that we are imposing a certain method of Section 3.2 and the image data—the template is
smoothness on the camera motion expected: very large scanned over the image and tested for a match at each
accelerations are relatively unlikely. This model is subtly location until a peak is found. This searching for a match is
effective and gives the whole system important robustness computationally expensive; prediction is an active approach,
even when visual measurements are sparse. narrowing search to maximize efficiency.
We assume that, in each time step, unknown accelera- First, using the estimates, xv of camera position and yi of
tion aW and angular acceleration W processes of zero feature position, the position of a point feature relative to
mean and Gaussian distribution cause an impulse of the camera is expected to be:
velocity and angular velocity:
 W  W  hR
L ¼R
RW
ðyW W
i  r Þ: ð6Þ
V a t
n¼ ¼ : ð3Þ With a perspective camera, the position ðu; vÞ at which
R R t
the feature would be expected to be found in the image is
Depending on the circumstances, VW and R may be coupled found using the standard pinhole model:
together (for example, by assuming that a single force 0 1
  hR
impulse is applied to the rigid shape of the body carrying u0  fku hLx
R
the camera at every time step, producing correlated changes u B Lz C
hi ¼ ¼@ A; ð7Þ
in its linear and angular velocity). Currently, however, we v hR
Ly
v0  fkv hR
Lz
assume that the covariance matrix of the noise vector n is
1058 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

where fku , fkv , u0 , and v0 are the standard camera searches with high Si (where the result is difficult to predict)
calibration parameters. will provide more information [45] about estimates of camera
In the current work, however, we are using wide-angle and feature positions. In Davison and Murray’s work on
cameras with fields of view of nearly 100 degrees since vision-based SLAM for a robot with steerable cameras [27]
showing in [6] that the accuracy of SLAM is significantly this led directly to active control of the viewing direction
improved by trading per-pixel angular resolution for in- toward profitable measurements; here we cannot control the
creased field of view—camera and map estimates are much camera movement, but in the case that many candidate
better constrained when features at very different viewing measurements are available we select those with high
angles can be simultaneously observed. The imaging char- innovation covariance, limiting the maximum number of
acteristics of such cameras are not well approximated as feature searches per frame to the 10 or 12 most informative.
perspective—as Fig. 3b shows, their images show significant Choosing measurements like this aims to squash the
nonperspective distortion (straight lines in the 3D world do uncertainty in the system along the longest axis available
not project to straight lines in the image). Nevertheless, we and helps ensures that no particular component of uncer-
perform feature matching on these raw images rather than tainty in the estimated state gets out of hand.
undistorting them first (note that the images later must be The obvious points of comparison for our active search
transformed to a perspective projection for display in order technique are very fast bottom-up feature detection algo-
to use them for augmented reality since OpenGL only rithms, which treat an image indiscriminately, but can
supports perspective camera models). extract all of the features in it in a few milliseconds. With
We therefore warp the perspective-projected coordinates active search, we will always be able to reduce the amount
u ¼ ðu; vÞ with a radial distortion to obtain the final predicted of image processing work, but at the potentially significant
image position ud ¼ ðud ; vd Þ: The following radial distortion cost of extra calculations to work out where to search [45]. We
do not claim that active search is sensible if the camera were
model was chosen because, to a good approximation, it is
to become lost—a different process would be needed to
invertible [44]:
relocalize in the presence of very high uncertainty.
u  u0
ud  u0 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 ; ð8Þ 3.6 Feature Initialization
1 þ 2K1 r
With our monocular camera, the feature measurement model
v  v0 cannot be directly inverted to give the position of a new
vd  v0 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 ; ð9Þ feature given an image measurement and the camera position
1 þ 2K1 r
since the feature depth is unknown. Estimating the depth of a
where feature will require camera motion and several measure-
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ments from different viewpoints. However, we avoid the
r¼ ðu  u0 Þ2 þ ðv  v0 Þ2 : ð10Þ approach of tracking the new feature in the image for several
frames without attempting to estimate its 3D location at all,
Typical values from a calibration of the camera used in then performing a minibatch estimation step to initialize its
Section 4, calibrated using standard software and a calibra- depth from multiple view triangulation. This would violate
tion grid, were fku ¼ fkv ¼ 195 pixels, ðu0 ; v0 Þ ¼ ð162; 125Þ, our top-down methodology and waste available information:
K1 ¼ 6  106 for capture at 320  240 resolution. 2D tracking is potentially very difficult when the camera is
The Jacobians of this two-step projection function with moving fast. Additionally, we will commonly need to
respect to camera and feature positions are also computed initialize features very quickly because a camera with a
(this is a straightforward matter of differentiation easily narrow field of view may soon pass them by.
performed on paper or in software). These allow calculation The method we use instead after the identification and
of the uncertainty in the prediction of the feature image first measurement of a new feature is to initialize a 3D line
location, represented by the symmetric 2  2 innovation into the map along which the feature must lie. This is a
covariance matrix Si : semiinfinite line, starting at the estimated camera position
and heading to infinity along the feature viewing direction,
@udi @udi > @udi @udi > @udi @udi > and like other map members has Gaussian uncertainty in its
Si ¼ Pxx þ Pxyi þ Pyi x parameters. Its representation in the SLAM map is:
@xv @xv @xv @yi @yi @xv
ð11Þ  W
@udi @udi > r
þ Pyi yi þ R: ypi ¼ ^iW ;
@yi @yi hi
The constant noise covariance R of measurements is taken to where ri is the position of its one end and h ^ W is a unit
i
be diagonal with magnitude determined by image resolution. vector describing its direction.
Knowledge of Si is what permits a fully active approach to All possible 3D locations of the feature point lie some-
image search; Si represents the shape of a 2D Gaussian PDF where along this line, but we are left with one degree of
over image coordinates and choosing a number of standard freedom of its position undetermined—its depth or distance
deviations (gating, normally at 3) defines an elliptical search along the line from the endpoint. A set of discrete depth
window within which the feature should lie with high hypotheses is uniformly distributed along this line, which can
probability. In our system, correlation searches always occur be thought of as a one-dimensional probability density over
within gated search regions, maximizing efficiency and depth represented by a 1D particle distribution or histogram.
minimizing the chance of mismatches. See Fig. 3b. Now, we make the approximation that over the next few time-
Si has a further role in active search; it is a measure of the steps as this new feature is reobserved, measurements of its
information content expected of a measurement. Feature image location provide information only about this depth
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1059

Fig. 4. A close-up view of image search in successive frames during feature initialization. In the first frame, a candidate feature image patch is
identified within a search region. A 3D ray along which the feature must lie is added to the SLAM map, and this ray is projected into subsequent
images. A distribution of depth hypotheses from 0.5 m to 5 m translates via the uncertainty in the new camera position relative to the ray into a set of
ellipses which are all searched to produce likelihoods for Bayesian reweighting of the depth distribution. A small number of time-steps is normally
sufficient to reduce depth uncertainly sufficiently to approximate as Gaussian and enable the feature to be converted to a fully-initialized point
representation.

coordinate, and that their effect on the parameters of the line is The important factor of this initialization is the shape of
negligible. This is a good approximation because the amount the search regions generated by the overlapping ellipses. A
of uncertainty in depth is very large compared with the depth prior has removed the need to search along the entire
uncertainty in the line’s direction. While the feature is epipolar line and improved the robustness and speed of
represented in this way with a line and set of depth initialization. In real-time implementation, the speed of
hypotheses we refer to it as partially initialized. Once we collapse of the particle distribution is aided (and correlation
have obtained a good depth estimate in the form of a peaked search work saved) by deterministic pruning of the weakest
depth PDF, we convert the feature to “fully initialized” with a hypotheses at each step, and during typical motions around
standard 3D Gaussian representation. 2-4 frames is sufficient. It should be noted that most of the
At each subsequent time step, the hypotheses are all tested experiments we have carried out have involved mostly
by projecting them into the image, where each is instantiated sideways camera motions and this initialization approach
as an elliptical search region. The size and shape of each would perform more poorly with motions along the optic
ellipse is determined by the uncertain parameters of the line: axis where little parallax is measured.
Each discrete hypothesis at depth  has 3D world location Since the initialization algorithm of this section was first
yi ¼ rW ^W published in [5], some interesting developments to the
i þhi . This location is projected into the image via
the standard measurement function and relevant Jacobians of essential idea have been published. In particular, Solà et al.
Section 3.5 to obtain the search ellipse for each depth. Note [46] have presented an algorithm which represents the
that, in the case of a nonperspective camera (such as the wide- uncertainty in a just-initialized feature by a set of overlapping
angle cameras we normally use), the centers of the ellipses 3D Gaussian distributions spaced along the 3D initialization
will not lie along a straight line, but a curve. This does not line. Appealing aspects of this approach are first the
present a problem as we treat each hypothesis separately. distribution of the Gaussians, which is uniform in inverse
We use an efficient algorithm to make correlation depth rather than uniform in depth as in our technique—this
searches for the same feature template over this set of appears to be a more efficient use of multiple samples. Also,
ellipses, which will typically be significantly overlapping their technique allows measurements of new features
(the algorithm builds a look-up table of correlation scores so immediately to have an effect on refining the camera
that image processing work is not repeated for the over- localization estimate, improving on our need to wait until
lapping regions). Feature matching within each ellipse the feature is “fully-initialized.” Most recently, Montiel et al.
produces a likelihood for each, and their probabilities are [47] have shown that a reparametrization in terms of inverse
reweighted via Bayes’ rule: The likelihood score is simply depth permits even more straightforward and efficient
the probability indicated by the 2D Gaussian PDF in image initialization within the standard EKF framework, in an
space implied by the elliptical search region. Note that, in approach similar to that used by Eade and Drummond [42] in
the case of many small ellipses with relatively small a new FastSLAM-based monocular SLAM system.
overlaps (true when the camera localization estimate is
very good), we get much more resolving power between
different depth hypotheses than when larger, significantly
overlapping ellipses are observed, and this affects the speed
at which the depth distribution will collapse to a peak.
Fig. 4 illustrates the progress of the search over several
frames and Fig. 5 shows a typical evolution of the
distribution over time, from uniform prior to sharp peak.
When the ratio of the standard deviation of depth to depth
estimate drops below a threshold (currently 0.3), the
distribution is safely approximated as Gaussian and the
feature initialized as a point into the map. Features which
have just crossed this threshold typically retain large depth
uncertainty (see Fig. 1a which shows several uncertainty Fig. 5. Frame-by-frame evolution of the probability density over feature
depth represented by a particle set. One hundred equally-weighted
ellipsoids elongated along the approximate camera viewing particles are initially spread evenly along the range 0.5 m to 5.0 m; with
direction), but this shrinks quickly as the camera moves and each subsequent image measurement the distribution becomes more
further standard measurements are obtained. closely Gaussian.
1060 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

3.7 Map Management


An important part of the overall algorithm is sensible
management of the number of features in the map, and on-
the-fly decisions need to be made about when new features
should be identified and initialized, as well as when it might be
necessary to delete a feature. Our map-maintenance criterion
aims to keep the number of reliable features visible from any
camera location close to a predetermined value determined by
the specifics of the measurement process, the required
Fig. 6. (a) Geometry of a camera in two positions observing a surface
localization accuracy and the computing power available: with normal n. (b) Processing cycle for estimating the 3D orientation of
we have found that with a wide-angle camera a number in the planar feature surfaces.
region of 12 gives accurate localization without overburden-
ing the processor. An important part of our future work plan 3.8 Feature Orientation Estimation
is to put heuristics such as this on a firm theoretical footing In Section 3.2, we described how visual patch features
using methods from information theory as discussed in [45]. extracted from the image stream are inserted into the map as
Feature “visibility” (more accurately, predicted measur- oriented, locally-planar surfaces, but explained that the
ability) is calculated based on the relative position of the orientations of these surfaces are initially just postulated,
camera and feature and the saved position of the camera from
this proving sufficient for calculating the change of appear-
which the feature was initialized. The feature must be
ance of the features over reasonable viewpoint changes. This
predicted to lie within the image, but also the camera must
not have translated too far from its initialization viewpoint of is the approach used in the applications presented in
the feature or we would expect correlation to fail (note that we Sections 4 and 5.
can cope with a full range of rotation). Features are added to In this section, we show as in [7] that it is possible to go
the map only if the number visible in the area the camera is further, and use visual measurement within real-time
passing through is less than this threshold—it is undesirable SLAM to actually improve the arbitrarily assigned orienta-
to increase the number of features and add to the computa- tion for each feature and recover real information about
tional complexity of filtering without good reason. Features local surface normals at the feature locations. This improves
are detected by running the image interest operator of Shi and the range of measurability of each feature, but also takes us
Tomasi to locate the best candidate within a box of limited a step further toward a possible future goal of recovering
size (around 80  60 pixels) placed within the image. The detailed 3D surface maps in real-time rather than sets of
position of the search box is currently chosen randomly, with
sparse landmarks.
the constraints only that it should not overlap with any
Our approach shares some of the ideas of Jin et al. [48]
existing features and that based on the current estimates of
who described a sequential (but not real-time) algorithm
camera velocity and angular velocity any detected features
they described as “direct structure from motion” which
are not expected to disappear from the field of view
estimated feature positions and orientations. Their concept
immediately.
of their method as “direct” in globally tying together feature
A feature is deleted from the map if, after a predeter-
mined number of detection and matching attempts when the tracking and geometrical estimation is the same as the
feature should be visible, more than a fixed proportion (in principles of probabilistic SLAM and active search used
our work, 50 percent) are failures. This criterion prunes over several years in our work [5], [27]. They achieve
features which are “bad” for a number of possible reasons: impressive patch orientation estimates as a camera moves
They are not true 3D points (lying at occlusion boundaries around a highly textured object.
such as T-junctions), lie on moving objects, are caused by Since we assume that a feature corresponds to a locally
specular highlights on a curved surface, or importantly are planar region in 3D space, as the camera moves its image
just often occluded. appearance will be transformed by changes in viewpoint by
Over a period of time, a “natural selection” of features takes warping the initial template captured for the feature. The exact
place through these map management criteria which leads to nature of the warp will depend on the initial and current
a map of stable, static, widely-observable point features. positions of the camera, the 3D position of the center of the
Clutter in the scene can be dealt with even if it sometimes feature, and the orientation of its local surface. The SLAM
occludes these landmarks since attempted measurements of system provides a running estimate of camera pose and
the occluded landmarks simply fail and do not lead to a filter
3D feature positions. We now additionally maintain estimates
update. Problems only arise if mismatches occur due to a
of the initial camera position and the local surface orientation
similarity in appearance between clutter and landmarks and
this can potentially lead to catastrophic failure. Note, for each point. This allows a prediction of the feature’s warped
however, that mismatches of any kind are extremely rare appearance from the current viewpoint. In the image, we then
during periods of good tracking since the large feature make a measurement of the current warp, and the difference
templates give a high degree of uniqueness and the active between the prediction and measurement is used to update the
search method means that matching is usually only attempted surface orientation estimate.
within very small image regions (typically, 15-20 pixels Fig. 6a shows the geometry of a camera in two positions
across). viewing an oriented planar patch. The warping which
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1061

Fig. 7. Results from real-time feature patch orientation estimation, for an outdoor scene containing one dominant plane and an indoor scene
containing several. These views are captured from the system running in real time after several seconds of motion and show the initial hypothesized
orientations with wire-frames and the current estimates with textured patches.

relates the appearance of the patch in one view to the other However, it should be remembered that the current
is described by the homography: motivation for estimating feature orientations in our work is
to improve the range of camera motion over which each
H ¼ CR½nT xp I  tnT C1 ; ð12Þ
long-term landmark will be measurable. Those features for
where C is the camera’s calibration matrix, describing which it is difficult to get an accurate normal estimate are
perspective projection or a local approximate perspective exactly those where doing so is less important in the first
projection in our images with radial distortion, R and t place, since they exhibit a natural degree of viewpoint-
describe the camera motion, n is the surface normal and xp invariance in their appearance. It does not matter if the
is the image projection of the center of the patch (I is the normal estimate for these features is incorrect because it
3  3 identity matrix).
will still be possible to match them. We see this work on
It is assumed that this prediction of appearance is
estimating feature surface orientation as part of a general
sufficient for the current image position of the feature to be
found using a standard exhaustive correlation search over the direction toward recovering more complete scene geometry
two image coordinates within an elliptical uncertainty region from a camera moving in real time.
derived from the SLAM filter. The next step is to measure the
change in warp between the predicted template and the 4 RESULTS: INTERACTIVE AUGMENTED REALITY
current image. Rather than widening the exhaustive search to
include all of the degrees of freedom of potential warps, Before presenting a robotics application of MonoSLAM in
having locked down the template’s 2D image position, we Section 5, in this section we give results from the use of our
proceed with a more efficient probabilistic inverse-composi- algorithm in an augmented reality scenario, as virtual
tional gradient-descent image alignment step [49], [50] to objects are inserted interactively into live video. We show
search through the additional parameters, on the assumption how virtual furniture can be stably added to a 30Hz image
that the change in warp will be small and that this search will stream captured as a hand-held camera is moved around a
find the globally best fit. room. Fig. 8 gives a storyboard for this demonstration,
Fig. 6b displays graphically the processing steps in feature which is featured in the video submitted with this paper.
orientation estimation. When a new feature is added to the In Augmented Reality (AR), computer graphics are added
map, we initialize an estimate of its surface normal which is to images of the real world from a camera to generated a
parallel to the current viewing direction, but with large composite scene. A convincing effect is created if the graphics
uncertainty. We currently make the simplifying approxima- move in the image as if they are anchored to the 3D scene
tion that estimates of feature normals are only weakly
observed by the camera. For this to be achieved, the motion of
correlated to those of camera and feature positions. Normal
the camera must be accurately known—its location can then
estimates are therefore not stored in the main SLAM state
be fed into a standard 3D graphics engine such as OpenGL
vector, but maintained in a separate two-parameter EKF for
which will then render the graphics correctly on top of the real
each feature.
images. Here, we use MonoSLAM to estimate the hand-held
Fig. 7 shows results from the patch orientation algorithm
in two different scenes: an outdoor scene which contains camera’s motion in real-time from the live image stream and
one dominant plane and an indoor scene where several feed this directly into the rendering engine.
boxes present planes at different orientations. Over a period There are various ways to recover the motion of a moving
of several seconds of tracking in both cases, the orientations camera which have been used for augmented reality, usually
of most of mapped feature patches are recovered well. featuring additional hardware such as magnetic or ultrasonic
In general terms, it is clear that orientation estimation only sensors attached to the camera. It is appealing to achieve
works well with patches which are large and have significant camera tracking using only the images from the actual
interesting texture over their area because, in this case, the moving camera for motion estimation, but previous ap-
image alignment operation can accurately estimate the proaches have either operated offline, such as [3] which is
warping. This is a limitation as far as estimating an accurate used in movie postproduction, or required prior knowledge
normal vector at every feature location since many features about the structure of the observed scene, either via the
have quite simple texture patterns like a black on white placement of fiducial targets or a prior map-learning stage
corner, where full warp estimation is badly constrained. The (see [51] for a review). Our approach here is the first which can
scenes in our examples are somewhat artificial in that both achieve convincing real time and drift-free AR as the camera
contain large planar areas with significant flat texture. moves through a scene it observes for the first time.
1062 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

Fig. 8. Frames from a demonstration of real-time augmented reality using MonoSLAM, all acquired directly from the system running in real time at
30 Hz. Virtual furniture is inserted into the images of the indoor scene observed by a hand-held camera. By clicking on features from the SLAM map
displayed with real-time graphics, 3D planes are defined to which the virtual objects are attached. These objects then stay clamped to the scene in
the image view as the camera continues to move. We show how the tracking is robust to fast camera motion, extreme rotation, and significant
occlusion.

In implementation, the linear acceleration noise compo- Augmented reality was achieved with some interaction
nents in Pn were set to a standard deviation of 10ms2 from a human user, who was presented with displays of both
(1 acceleration due to gravity), and angular components with the image stream with the tracked features highlighted and of
a standard deviation of 6rads2 . These magnitudes of the estimated position of the camera and features within a
acceleration empirically describe the approximate dynamics 3D display whose viewpoint could be manipulated. By
of a camera moved quickly but smoothly in the hand (the selecting three of the mapped features with a mouse, by
algorithm cannot cope with very sudden, jerky movement). clicking in either of the two displays, the user defined a plane
The camera used was a low-cost IEEE 1394 webcam with a to which a virtual object could be attached. In this demonstra-
wide angle lens, capturing at 30 Hz. The software-controlled tion, the objects were items of virtual furniture to be added to
shutter and gain controls were set to remove most of the the indoor scene—perhaps as in a virtual “kitchen fitting”
effects of motion blur but retain a high-contrast image—this is application—and four objects were added to three different
practically achievable in a normal bright room. planes in the scene corresponding to two vertical walls and a
Following initialization from a simple target as in Fig. 2a, counter-top surface.
the camera was moved to observe most of the small room in In general terms in this scenario, the algorithm gives
which the experiment was carried out, dynamically map- robust real-time performance within a small room with
ping a representative set of features within a few seconds. relatively few constraints on the movement of the camera and
Tracking was continued over a period of several minutes arbitrarily long time periods of localization are routinely
(several thousand frames) with SLAM initializing new achievable. Clearly, situations where no useful features are
features as necessary—though of course little new initiali- found in the field of view (when the camera faces a blank wall
zation was needed as previously-seen areas were revisited. or ceiling) cannot be coped with, although the tracking will
There is little doubt that the system would run for much regularly survive periods when as few as two or three
longer periods of time without problems because once the features are visible, the localization uncertainty growing
uncertainty in the mapped features becomes small they are bigger during these times, but good tracking recaptured once
very stable and lock the map into a drift-free state. Note that more features come back into view.
once sufficient nearby features are mapped it is possible to Small and large loops are routinely and seamlessly
remove the initialization target from the wall completely. closed by the system. A rotational movement of the camera
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1063

to point into a new unobserved part of the scene and then


return would lead to a small loop closure, tying the newly
initialized features in the unobserved region to the accurate
part of the map, whereas a translational motion right
around the room would require larger loop closure. The
active feature selection mechanism (Section 3.5) leads to
particularly satisfactory behavior in loop closing: Its desire
to make measurements that add the most information to the Fig. 9. HRP-2 walking in a circle. The robot is walking autonomously and
map by maximally reducing uncertainty demands reobser- tether-free with SLAM processing on-board and a wireless Ethernet link
vation of features which come back into view after periods to a control workstation. The support cradle seen is only for safety
of neglect. The Si innovation covariance scores of these purposes.
features are much higher than nearby features which have
been recently measured due to the increase in uncertainty in equip HRP-2 with an additional wide-angle camera (field of
their positions relative to the camera. In particular, small view around 90 degrees) and use output from only this
loops are closed immediately whenever possible, reducing camera for SLAM. The wide angle camera was calibrated with
that larger growth in uncertainty which could cause a one parameter radial distortion model as in Section 3.5.
problems with closing bigger loops. Since the robot started its motions from a position
observing a far wall, a mixture of natural and artificially
placed features in measured positions mostly on this
5 RESULTS: HUMANOID ROBOT SLAM wall were used for SLAM initialization rather than the
In this section, we present the use of MonoSLAM to provide standard target.
real-time SLAM for one of the leading humanoid robot
5.2 Gyro
platforms, HRP-2 [52] as it moves around a cluttered indoor
Along with other proprioceptive sensors, HRP-2 is equipped
workspace.
Most advanced research humanoids have vision systems, with a 3-axis gyro in the chest which reports measurements of
but there have been only limited attempts at vision-based the body’s angular velocity at 200 Hz. In the humanoid SLAM
mapping and localization. Takaoka et al. [53] presented application, although it was quite possible to progress with
interesting results using stereo vision and a visual odometry vision-only MonoSLAM, the ready availability of this extra
approach to estimate the motion of a humanoid while information argued strongly for its inclusion in SLAM
building a dense 3D reconstruction of the cluttered floor near estimation, and it played a role in reducing the rate of growth
the robot. The local motion estimation was good, but this of uncertainty around looped motions.
approach lacks the ability to close loops and will lead to drift We sampled the gyro at the 30 Hz rate of vision for use
over time with repeated motion. Sabe et al. [54] have used within the SLAM filter. We assessed the standard deviation of
occupancy grid mapping and plane detection with stereo to each element of the angular velocity measurement to be
detect free-space areas in front a miniature humanoid, but 0:01rads1 . Since our single camera SLAM state vector
relied on odometry (or in other work artificial floor markers) contains the robot’s angular velocity expressed in the frame
for localization so this was also not true SLAM. of reference of the robot, we can incorporate these measure-
Using the MonoSLAM algorithm, we build online only a ments in the EKF directly as an “internal measurement”
sparse map of point landmarks, rather than the dense directly of the robot’s own state—an additional Kalman
representations of [53] or [54], and show that despite the update step before visual processing.
challenges presented by high-acceleration 3D motion and we 5.3 Results
can form a persistent map which permits drift-free real-time We performed an experiment which was a real SLAM test,
localization over a small area. Typical humanoid activities of in which the robot was programmed to walk in a circle of
the near future (during some kind of handling or service task radius 0.75 m (Fig. 9). This was a fully exploratory motion,
for instance) will involve agile but repeated movement within involving observation of new areas before closing one large
a small area such as a single room. The important requirement loop at the end of the motion. For safety and monitoring
is that localization and mapping should be repeatable so that reasons, the motion was broken into five parts with short
uncertainty in the robot’s position does not increase with time stationary pauses between them: First, a forward diagonal
during these repeated movements. motion to the right without rotation in which the robot put
itself in position to start the circle and then four 90 degree
5.1 Vision
arcing turns to the left where the robot followed a circular
As standard, HRP-2 is fitted with a high-performance path, always walking tangentially. The walking was at
forward-looking trinocular camera rig, providing the cap- HRP-2’s standard speed and the total walking time was
ability to make accurate 3D measurements in a focused around 30 seconds (though the SLAM system continued to
observation area close in front of the robot, suitable for track continuously at 30 Hz even while the robot paused).
grasping or interaction tasks. Since it has been shown that by Fig. 10 shows the results of this experiment. Classic SLAM
contrast a wide field of view is advantageous for localization behavior is observed, with a steady growth in the uncertainty
and mapping, for this and other related work it was decided to of newly-mapped features until an early feature can be
1064 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

Fig. 10. Snapshots from MonoSLAM as a humanoid robot walks in a circular trajectory of radius 0.75 m. The yellow trace is the estimated robot
trajectory, and ellipses show feature location uncertainties, with color coding as in Fig. 1a. The uncertainty in the map can be seen growing until the
loop is closed and drift corrected. (a) Early exploration and first turn. (b) Mapping back all and greater uncertainty. (c) Just before loop close,
maximum uncertainty. (d) End of circle with closed loop and drift corrected.

Fig. 11. Ground truth characterization experiment. (a) The camera flies over the desktop scene with initialization target and rectangular track in view.
(b) The track is followed around a corner with real-time camera trace in external view and image view augmented with coordinate axes. In both
images, the hanging plumb-line can be seen, but does not significantly occlude the camera’s field of view.

reobserved, the loop closed and drift corrected. A large and paused one by one at positions over the four corners of
number of features are seen to swing into better estimated the rectangle. The following table gives the ground truth
positions simultaneously thanks to the correlations stored in coordinates of the camera at the four corners followed by
the covariance matrix. This map of features is now suitable for averaged estimated values from MonoSLAM over several
long-term use, and it would be possible to complete any looped revisits. The  variation values reported indicate the
number of loops without drift in localization accuracy.
standard deviations of the sampled estimated values.

6 SYSTEM DETAILS Ground Truth ðmÞ Estimated ðmÞ


6.1 System Characterization against Ground Truth x y z x y z
An experiment was conducted to assess the accuracy of 0:00 0:00 0:62 0:000:01 0:010:01 0:640:01
camera localization estimation within a typical environment. 1:00 0:00 0:62 0:930:03 0:060:02 0:630:02
The camera, motion model parameters, and 30 Hz frame-rate 1:00 0:50 0:62 0:980:03 0:460:02 0:660:02
of the experiment were as in the Interactive Augmented 0:00 0:50 0:62 0:010:01 0:470:02 0:640:02
Reality implementation of Section 4. A horizontal desktop
cluttered with various objects was marked out with a These figures show that on this “tabletop” scale of
precisely measured rectangular track and the standard motion, MonoSLAM typically gives localization results
initialization target of Section 3.3 located in one corner, accurate to a few centimeters, with “jitter” levels on the
defining the origin and orientation of the world coordinate order of 1-2 cm. Some coordinates, such as the 0.93 mean
frame. Nothing else about the scene was known a priori. x-value reported at the second way-point, display consis-
A hand-held camera equipped with a plumb-line of tent errors larger than the jitter level, which persist during
known length was then moved such that a vertically loops although reducing slowly on each revisit (in our
hanging weight closely skimmed the track (Fig. 11). In this experiment by around 1 cm per loop) as probabilistic SLAM
way, the ground truth 3D coordinates of the camera could gradually pulls the whole map into consistency.
be accurately known (to an assessed 1 cm precision) as it
arrived in sequence at the four corner “way-points.” 6.2 Processing Requirements
Following a short initial motion of a few seconds during On a 1.6 GHz Pentium M processor, a typical breakdown of
which an initial map was built, the camera was moved to the processing time required at each frame at 30Hz (such
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1065

that 33 ms is available for processing each image) is as This work will involve several strands. To increase the
follows: dynamic performance of the algorithm, and be able to cope
with even faster motion than currently, a promising possibi-
lity is to investigate cameras which can capture at rates greater
Image loading and administration 2 ms
than 30 Hz. An interesting aspect of our active search image
Image correlation searches 3 ms
processing is that a doubling in frame-rate would not imply a
Kalman Filter update 5 ms
doubling of image processing effort as in bottom-up feature
Feature initialization search 4 ms
detection schemes (see [45]) because search regions would
Graphical rendering 5 ms become correspondingly smaller due to reduced motion
Total 19 ms uncertainty. There are currently CMOS IEEE 1394 cameras
which offer 100 Hz capture at full resolution and even higher
This indicates that 30 Hz performance is easily achieved rates in programmable subwindows—a technology our
—in fact, this type of safe margin is desirable since processing active image search would be well-suited to benefit from.
We are keen to work with such cameras in the near future.
time fluctuates from frame to frame and dropped frames
There will certainly be a payoff for developing the sparse
should be avoided whenever possible. We see however that
maps currently generated into denser representations from
doubling the frequency of operation to 60 Hz should be which to reason more completely about the geometry of the
possible in the very near future if the graphical rendering environment, initially by attempting to detect higher-order
were simplified or perhaps performed at a lower update rate. entities such as surfaces. Our work on feature patch
orientation estimation gives a strong hint that this will be
6.3 Software achievable and, therefore, we should be able to build more
The C++ library SceneLib in which the systems described in complete but more efficient scene representations, but
this paper are built, including example real-time Mono- maintain real-time operation. These efficient high-order
SLAM applications, is available as an open source project maps may give our SLAM system a human-like ability to
under the LGPL license from the Scene homepage at quickly capture an idea of the basic shape of a room.
http://www.doc.ic.ac.uk/~ajd/Scene/. Finally, to extend the algorithm to very large-scale
environments, some type of submapping strategy certainly
6.4 Movies seems appropriate, though as discussed earlier it remains
Videos illustrating the results in this paper can be obtained unclear how maps of visual features can be cleanly divided
from the following files, all available on the Web in the into meaningful subblocks. As shown in other work on
directory: http://www.doc.ic.ac.uk/~ajd/Movies/. submaps (e.g., [20]), a network of accurate small scale maps
can be very successfully joined by a relatively loose set of
1. kitchen.mp4.avi (basic method and augmented estimated transformations as long as there is the ability to
reality), “map-match” submaps in the background. This is closely
2. CircleHRP2.mpg (humanoid external view), and related to being able to solve the “lost robot” problem of
3. hrploopclose.mpg (humanoid MonoSLAM localizing against a known map with only a weak position
output). prior, and has proven relatively straightforward with 2D laser
data. With vision-only sensing this type of matching can be
7 CONCLUSIONS achieved with invariant visual feature types like SIFT [35] (an
idea used by Newman and Ho for loop-closing in a system
In this paper, we have explained MonoSLAM, a real-time
with both laser and vision [55]), or perhaps more interestingly
algorithm for simultaneous localization and mapping with a
in our context by matching higher-level scene features such as
single freely-moving camera. The chief tenets of our approach
gross 3D surfaces.
are probabilistic mapping, motion modeling and active
measurement and mapping of a sparse map of high-quality
features. Efficiency is provided by active feature search, ACKNOWLEDGMENTS
ensuring that no image processing effort is wasted—this is This work was primarily performed while A.J. Davison,
truly a Bayesian, “top-down” approach. We have presented I.D. Reid, and N.D. Molton worked together at the Active
experimental implementations which demonstrate the wide Vision Laboratory, University of Oxford. The authors are
applicability of the algorithm, and hope that it will have an very grateful to David Murray, Ben Tordoff, Walterio
impact in application areas including both low-cost and Mayol, Nobuyuki Kita, and others at Oxford, AIST and
Imperial College London for discussions and software
advanced robotics, wearable computing, augmented reality
collaboration. They would like to thank Kazuhito Yokoi at
for industry and entertainment and user interfaces. JRL for support with HRP-2. This research was supported
In future work, we plan to continue to improve the by EPSRC grants GR/R89080/01 and GR/T24685, an
performance of the algorithm to cope with larger environ- EPSRC Advanced Research Fellowship to AJD, and
ments (indoors and outdoors), more dynamic motions, and CNRS/AIST funding at JRL.
more complicated scenes with significant occlusions, com-
plicated objects and changing lighting conditions to create
REFERENCES
genuinely practical systems. We will maintain our focus on
[1] A.W. Fitzgibbon and A. Zisserman, “Automatic Camera Recovery
hard real-time operation, commodity cameras, and minimal for Closed or Open Image Sequences,” Proc. European Conf.
assumptions. Computer Vision, pp. 311-326, June 1998.
1066 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 6, JUNE 2007

[2] M. Pollefeys, R. Koch, and L.V. Gool, “Self-Calibration and Metric [30] I. Jung and S. Lacroix, “High Resolution Terrain Mapping Using
Reconstruction in Spite of Varying and Unknown Internal Camera Low Altitude Aerial Stereo Imagery,” Proc. Ninth Int’l Conf.
Parameters,” Proc. Sixth Int’l Conf. Computer Vision, pp. 90-96, 1998. Computer Vision, 2003.
[3] “2d3 Web Based Literature,” URL http://www.2d3.com/, 2005. [31] J.H. Kim and S. Sukkarieh, “Airborne Simultaneous Localisation
[4] A. Rahimi, L.P. Morency, and T. Darrell, “Reducing Drift in and Map Building,” Proc. IEEE Int’l Conf. Robotics and Automation,
Parametric Motion Tracking,” Proc. Eighth Int’l Conf. Computer pp. 406-411, 2003.
Vision, pp. 315-322, 2001. [32] M. Bosse, R. Rikoski, J. Leonard, and S. Teller, “Vanishing Points
[5] A.J. Davison, “Real-Time Simultaneous Localisation and Mapping and 3D Lines from Omnidirectional Video,” Proc. IEEE Int’l Conf.
with a Single Camera,” Proc. Ninth Int’l Conf. Computer Vision, 2003. Image Processing, 2002.
[6] A.J. Davison, Y.G. Cid, and N. Kita, “Real-Time 3D SLAM with [33] R.M. Eustice, H. Singh, J.J. Leonard, M. Walter, and R. Ballard,
Wide-Angle Vision,” Proc. IFAC Symp. Intelligent Autonomous “Visually Navigating the RMS Titanic with SLAM Information
Vehicles, July 2004. Filters,” Proc. Robotics: Science and Systems, 2005.
[7] N.D. Molton, A.J. Davison, and I.D. Reid, “Locally Planar Patch [34] R. Sim, P. Elinas, M. Griffin, and J.J. Little, “Vision-Based SLAM
Features for Real-Time Structure from Motion,” Proc. 15th British Using the Rao-Blackwellised Particle Filter,” Proc. IJCAI Workshop
Machine Vision Conf., 2004. Reasoning with Uncertainty in Robotics, 2005.
[8] C.G. Harris and J.M. Pike, “3D Positional Integration from Image [35] D.G. Lowe, “Object Recognition from Local Scale-Invariant Fea-
Sequences,” Proc. Third Alvey Vision Conf., pp. 233-236, 1987. tures,” Proc. Seventh Int’l Conf. Computer Vision, pp. 1150-1157, 1999.
[9] N. Ayache, Artificial Vision for Mobile Robots: Stereo Vision and [36] N. Karlsson, E.D. Bernardo, J. Ostrowski, L. Goncalves, P.
Multisensory Perception. MIT Press, 1991. Pirjanian, and M.E. Munich, “The vSLAM Algorithm for Robust
[10] P.A. Beardsley, I.D. Reid, A. Zisserman, and D.W. Murray, Localization and Mapping,” Proc. IEEE Int’l Conf. Robotics and
“Active Visual Navigation Using Non-Metric Structure,” Proc. Automation, 2005.
Fifth Int’l Conf. Computer Vision, pp. 58-65, 1995. [37] P.F. McLauchlan and D.W. Murray, “A Unifying Framework for
[11] R. Smith, M. Self, and P. Cheeseman, “A Stochastic Map for Structure and Motion Recovery from Image Sequences,” Proc. Fifth
Uncertain Spatial Relationships,” Proc. Fourth Int’l Symp. Robotics Int’l Conf. Computer Vision, 1995.
Research, 1987. [38] A. Chiuso, P. Favaro, H. Jin, and S. Soatto, ““MFm”: 3-D Motion
[12] P. Moutarlier and R. Chatila, “Stochastic Multisensory Data from 2-D Motion Causally Integrated over Time,” Proc. Sixth
Fusion for Mobile Robot Location and Environement Modelling,” European Conf. Computer Vision, 2000.
Proc. Int’l Symp. Robotics Research, 1989. [39] D. Nistér, O. Naroditsky, and J. Bergen, “Visual Odometry,” Proc.
[13] J.J. Leonard, “Directed Sonar Sensing for Mobile Robot Naviga- IEEE Conf. Computer Vision and Pattern Recognition, 2004.
tion,” PhD dissertation, Univ. of Oxford, 1990. [40] E. Foxlin, “Generalized Architecture for Simultaneous Localiza-
[14] J. Manyika, “An Information-Theoretic Approach to Data Fusion tion, Auto-Calibration and Map-Building,” Proc. IEEE/RSJ Conf.
and Sensor Management,” PhD dissertation, Univ. of Oxford, 1993. Intelligent Robots and Systems, 2002.
[15] S. Betgé-Brezetz, P. Hébert, R. Chatila, and M. Devy, “Uncertain [41] D. Burschka and G.D. Hager, “V-GPS(SLAM): Vision-Based
Map Making in Natural Environments,” Proc. IEEE Int’l Conf. Inertial System for Mobile Robots,” Proc. IEEE Int’l Conf. Robotics
Robotics and Automation, 1996. and Automation, 2004.
[16] M. Csorba, “Simultaneous Localisation and Mapping,” PhD [42] E. Eade and T. Drummond, “Scalable Monocular SLAM,” Proc.
dissertation, Univ. of Oxford, 1997. IEEE Conf. Computer Vision and Pattern Recognition, 2006.
[17] J.A. Castellanos, “Mobile Robot Localization and Map Building: A [43] J. Shi and C. Tomasi, “Good Features to Track,” Proc. IEEE Conf.
Multisensor Fusion Approach,” PhD dissertation, Universidad de Computer Vision and Pattern Recognition, pp. 593-600, 1994.
Zaragoza, Spain, 1998. [44] R. Swaminathan and S.K. Nayar, “Nonmetric Calibration of Wide-
[18] A.J. Davison, “Mobile Robot Navigation Using Active Vision,” Angle Lenses and Polycameras,” IEEE Trans. Pattern Analysis and
PhD dissertation, Univ. of Oxford, 1998. Machine Intelligence, vol. 22, no. 10, pp. 1172-1178, 2000.
[19] P.M. Newman, “On the Structure and Solution of the Simulta- [45] A.J. Davison, “Active Search for Real-Time Vision,” Proc. 10th Int’l
neous Localization and Map Building Problem,” PhD dissertation, Conf. Computer Vision, 2005.
Univ. of Sydney, 1999. [46] J. Solà, M. Devy, A. Monin, and T. Lemaire, “Undelayed
[20] M. Bosse, P. Newman, J. Leonard, M. Soika, W. Feiten, and S. Initialization in Bearing Only SLAM,” Proc. IEEE/RSJ Conf.
Teller, “An Atlas Framework for Scalable Mapping,” Proc. IEEE Intelligent Robots and Systems, 2005.
Int’l Conf. Robotics and Automation, 2003. [47] J.M.M. Montiel, J. Civera, and A.J. Davison, “Unified Inverse
[21] P.M. Newman and J.J. Leonard, “Consistent Convergent Constant Depth Parametrization for Monocular SLAM,” Proc. Robotics:
Time SLAM,” Proc. Int’l Joint Conf. Artificial Intelligence, 2003. Science and Systems, 2006.
[22] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, “FastSLAM: [48] H. Jin, P. Favaro, and S. Soatto, “A Semi-Direct Approach to
A Factored Solution to the Simultaneous Localization and Mapping Structure from Motion,” The Visual Computer, vol. 19, no. 6, pp. 377-
Problem,” Proc. AAAI Nat’l Conf. Artificial Intelligence, 2002. 394, 2003.
[23] P.M. Newman, J.J. Leonard, J. Neira, and J. Tardós, “Explore and [49] N.D. Molton, A.J. Davison, and I.D. Reid, “Parameterisation and
Return: Experimental Validation of Real Time Concurrent Map- Probability in Image Alignment,” Proc. Asian Conf. Computer Vision,
ping and Localization,” Proc. IEEE Int’l Conf. Robotics and 2004.
Automation, pp. 1802-1809, 2002. [50] S. Baker and I. Matthews, “Lucas-Kanade 20 Years on: A Unifying
[24] K. Konolige and J.-S. Gutmann, “Incremental Mapping of Large Framework: Part 1,” Int’l J. Computer Vision, vol. 56, no. 3, pp. 221-
Cyclic Environments,” Proc. Int’l Symp. Computational Intelligence 255, 2004.
in Robotics and Automation, 1999. [51] V. Lepetit and P. Fua, “Monocular Model-Based 3D Tracking of
[25] S. Thrun, W. Burgard, and D. Fox, “A Real-Time Algorithm for Rigid Objects: A Survey,” Foundations and Trends in Computer
Mobile Robot Mapping with Applications to Multi-Robot and 3D Graphics and Vision, vol. 1, no. 1, pp. 1-89, Oct. 2005.
Mapping,” Proc. IEEE Int’l Conf. Robotics and Automation, 2000. [52] K. Kaneko, F. Kanehiro, S. Kajita, H. Hirukawa, T. Kawasaki, M.
[26] J. Neira, M.I. Ribeiro, and J.D. Tardos, “Mobile Robot Localisation Hirata, K. Akachi, and T. Isozumi, “Humanoid Robot HRP-2,”
and Map Building Using Monocular Vision,” Proc. Int’l Symp. Proc. IEEE Int’l Conf. Robotics and Automation, 2004.
Intelligent Robotics Systems, 1997. [53] Y. Takaoka, Y. Kida, S. Kagami, H. Mizoguchi, and T. Kanade,
[27] A.J. Davison and D.W. Murray, “Mobile Robot Localisation Using “3D Map Building for a Humanoid Robot by Using Visual
Active Vision,” Proc. Fifth European Conf. Computer Vision, pp. 809- Odometry,” Proc. IEEE Int’l Conf. Systems, Man, and Cybernetics,
825, 1998. pp. 4444-4449, 2004.
[28] A.J. Davison and D.W. Murray, “Simultaneous Localization and [54] K. Sabe, M. Fukuchi, J.-S. Gutmann, T. Ohashi, K. Kawamoto, and T.
Map-Building Using Active Vision,” IEEE Trans. Pattern Analysis Yoshigahara, “Obstacle Avoidance and Path Planning for Huma-
and Machine Intelligence, vol. 24, no. 7, pp. 865-880, July 2002. noid Robots Using Stereo Vision,” Proc. IEEE Int’l Conf. Robotics and
[29] A.J. Davison and N. Kita, “3D Simultaneous Localisation and Map- Automation, 2004.
Building Using Active Vision for a Robot Moving on Undulating [55] P.M. Newman and K. Ho, “SLAM Loop-Closing with Visually
Terrain,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, Salient Features,” Proc. IEEE Int’l Conf. Robotics and Automation,
2001. 2005.
DAVISON ET AL.: MONOSLAM: REAL-TIME SINGLE CAMERA SLAM 1067

Andrew J. Davison read physics at the Nicholas D. Molton received the BEng degree
University of Oxford, receiving the BA degree in engineering at Brunel University and the DPhil
in 1994. In his doctoral research in Oxford’s degree in robotics at the University of Oxford. He
Robotics Research Group under the supervision has worked primarily in areas related to structure
of Professor David Murray, he developed one of from motion and estimation both at the Uni-
the first robot SLAM systems using vision. On versity of Oxford and at 2d3 Ltd in Oxford, and is
receiving the DPhil degree in 1998, he took up currently working as a vision scientist with
an EU Science and Technology Fellowship and Imagineer Systems in Guildford, United King-
spent two years at AIST in Japan, expanding his dom. His work has recently been focused on
work on visual robot navigation. He returned to areas with application to the visual effects
further postdoctoral work with Dr. Ian Reid at Oxford in 2000, was industry, and he shared the Emmy award won by 2d3 for technical
awarded a five year EPSRC Advanced Research Fellowship in 2002, contribution to television in 2002.
and moved to Imperial College London in 2005 to take up a lectureship.
He continues to work on advancing the basic technology of real-time Olivier Stasse received the MSc degree in
localization and mapping using vision while collaborating to apply these operations research (1996) and the PhD degree
techniques in robotics and related areas. in intelligent systems (2000), both from Univer-
sity of Paris 6. He is assistant professor of
Ian D. Reid received the BSc degree from the computer science at the University of Paris 13.
University of Western Australia in 1987, and His research interests include humanoids robots
came to Oxford University on a Rhodes Scholar- as well as distributed and real-time computing
ship in 1988, where he received the DPhil applied to vision problems for complex robotic
degree in 1991 He is a reader in engineering systems. He was involved in the humanoid
science at the University of Oxford and a fellow project led by Professor Yasuo Kuniyoshi (Tokyo
of Exeter College. After a period as an EPSRC University) from 1997 to 2000. From 2000 and 2003, he was at the
Advanced Research Fellow, he was appointed Laboratoire de Transport et Traitement de l’Information (L2TI), and then
to his current post in 2000. His research joined the SONY robot-soccer team of the Laboratoire de Robotique de
concentrates on variety of problems within Versailles (LRV). Since 2003, he has been a member of the Joint
computer vision, usually with an emphasis on real-time processing; in French-Japanese Robotics Laboratory (JRL) in a secondment position
particular, he is interested in algorithms for visual tracking, visual control as a CNRS researcher. He is a member of the IEEE and the IEEE
of active head/eye robotic platforms for surveillance and navigation, Computer Society.
visual geometry and camera self-calibration, and human motion capture
and activity recognition. He has published more than 80 papers on these
topics. He is a member of the IEEE.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

You might also like