A Multi-State Constraint Kalman Filter
A Multi-State Constraint Kalman Filter
Abstract— In this paper, we present an Extended Kalman geometric constraints involving all these poses. The primary
Filter (EKF)-based algorithm for real-time vision-aided inertial contribution of our work is a measurement model that
navigation. The primary contribution of this work is the expresses these constraints without including the 3D feature
derivation of a measurement model that is able to express
the geometric constraints that arise when a static feature is position in the filter state vector, resulting in computational
observed from multiple camera poses. This measurement model complexity only linear in the number of features. After a
does not require including the 3D feature position in the state brief discussion of related work in the next section, the de-
vector of the EKF and is optimal, up to linearization errors. tails of the proposed estimator are presented in Section III. In
The vision-aided inertial navigation algorithm we propose has Section IV we describe the results of a large-scale experiment
computational complexity only linear in the number of features,
and is capable of high-precision pose estimation in large-scale in an uncontrolled urban environment, which demonstrate
real-world environments. The performance of the algorithm that the proposed estimator enables accurate, real-time pose
is demonstrated in extensive experimental results, involving a estimation. Finally, in Section V the conclusions of this work
camera/IMU system localizing within an urban area. are drawn.
I. I NTRODUCTION II. R ELATED W ORK
In the past few years, the topic of vision-aided inertial One family of algorithms for fusing inertial measurements
navigation has received considerable attention in the re- with visual feature observations follows the Simultaneous
search community. Recent advances in the manufacturing of Localization and Mapping (SLAM) paradigm. In these meth-
MEMS-based inertial sensors have made it possible to build ods, the current IMU pose, as well as the 3D positions
small, inexpensive, and very accurate Inertial Measurement of all visual landmarks are jointly estimated [1]–[4]. These
Units (IMUs), suitable for pose estimation in small-scale approaches share the same basic principles with SLAM-
systems such as mobile robots and unmanned aerial vehicles. based methods for camera-only localization (e.g., [5], [6],
These systems often operate in urban environments where and references therein), with the difference that IMU mea-
GPS signals are unreliable (the “urban canyon”), as well as surements, instead of a statistical motion model, are used
indoors, in space, and in several other environments where for state propagation. The fundamental advantage of SLAM-
global position measurements are unavailable. The low cost, based algorithms is that they account for the correlations that
weight, and power consumption of cameras make them ideal exist between the pose of the camera and the 3D positions of
alternatives for aiding inertial navigation, in cases where GPS the observed features. On the other hand, the main limitation
measurements cannot be relied upon. of SLAM is its high computational complexity; properly
An important advantage of visual sensing is that images treating these correlations is computationally costly, and
are high-dimensional measurements, with rich information thus performing vision-based SLAM in environments with
content. Feature extraction methods can typically detect and thousands of features remains a challenging problem.
track hundreds of features in images, which, if properly Several algorithms exist that, contrary to SLAM, estimate
used, can result is excellent localization results. However, the pose of the camera only (i.e., do not jointly estimate
the high volume of data also poses a significant challenge the feature positions), with the aim of achieving real-time
for estimation algorithm design. When real-time localization operation. The most computationally efficient of these meth-
performance is required, one is faced with a fundamental ods utilize the feature measurements to derive constraints
trade-off between the computational complexity of an algo- between pairs of images. For example in [7], an image-
rithm and the resulting estimation accuracy. based motion estimation algorithm is applied to consecutive
In this paper we present an algorithm that is able to pairs of images, to obtain displacement estimates that are
optimally utilize the localization information provided by subsequently fused with inertial measurements. Similarly,
multiple measurements of visual features. Our approach is in [8], [9] constraints between current and previous image are
motivated by the observation that, when a static feature is defined using the epipolar geometry, and combined with IMU
viewed from several camera poses, it is possible to define measurements in an Extended Kalman Filter (EKF). In [10],
This work was supported by the University of Minnesota (DTC), the
[11] the epipolar geometry is employed in conjunction with
NASA Mars Technology Program (MTP-1263201), and the National Sci- a statistical motion model, while in [12] epipolar constraints
ence Foundation (EIA-0324864, IIS-0643680). The authors would like to are fused with the dynamical model of an airplane. The use of
thank Faraz Mirzaei for his invaluable help with the experiments. feature measurements for imposing constraints between pairs
The authors are with the Dept. of Computer Science & Engi-
neering, University of Minnesota, Minneapolis, MN 55455. Emails: of images is similar in philosophy to the method proposed in
{mourikis|stergios}@cs.umn.edu this paper. However, one fundamental difference is that our
algorithm can express constraints between multiple camera Algorithm 1 Multi-State Constraint Filter
poses, and can thus attain higher estimation accuracy, in Propagation: For each IMU measurement received,
cases where the same feature is visible in more than two propagate the filter state and covariance (cf. Section III-B).
images.
Pairwise constraints are also employed in algorithms that Image registration: Every time a new image is recorded,
maintain a state vector comprised of multiple camera poses. • augment the state and covariance matrix with a copy of
In [13], an augmented-state Kalman filter is implemented, the current camera pose estimate (cf. Section III-C).
in which a sliding window of robot poses is maintained in • image processing module begins operation.
the filter state. On the other hand, in [14], all camera poses
are simultaneously estimated. In both of these algorithms, Update: When the feature measurements of a given image
pairwise relative-pose measurements are derived from the
become available, perform an EKF update (cf. Sections III-D
images, and used for state updates. The drawback of this
and III-E).
approach is that when a feature is seen in multiple images,
the additional constraints between the multiple poses are
discarded, thus resulting in loss of information. Further- treatment of the effects of the earth’s rotation on the IMU
more, when the same image measurements are processed measurements (cf. Eqs. (7)-(8)), the global frame is chosen as
for computing several displacement estimates, these are not an Earth-Centered, Earth-Fixed (ECEF) frame in this paper.
statistically independent, as shown in [15]. An overview of the algorithm is given in Algorithm 1. The
One algorithm that, similarly to the method proposed IMU measurements are processed immediately as they be-
in this paper, directly uses the landmark measurements come available, for propagating the EKF state and covariance
for imposing constraints between multiple camera poses is (cf. Section III-B). On the other hand, each time an image
presented in [16]. This is a visual odometry algorithm that is recorded, the current camera pose estimate is appended
temporarily initializes landmarks, uses them for imposing to the state vector (cf. Section III-C). State augmentation
constraints on windows of consecutive camera poses, and is necessary for processing the feature measurements, since
then discards them. This method, however, does not in- during EKF updates the measurements of each tracked
corporate inertial measurements. Moreover, the correlations feature are employed for imposing constraints between all
between the landmark estimates and the camera trajectory camera poses from which the feature was seen. Therefore,
are not properly accounted for, and as a result, the algorithm at any time instant the EKF state vector comprises (i) the
does not provide any measure of the covariance of the state evolving IMU state, XIMU , and (ii) a history of up to Nmax
estimates. past poses of the camera. In the following, we describe the
A window of camera poses is also maintained in the various components of the algorithm in detail.
Variable State Dimension Filter (VSDF) [17]. The VSDF
is a hybrid batch/recursive method, that (i) uses delayed A. Structure of the EKF state vector
linearization to increase robustness against linearization in- The evolving IMU state is described by the vector:
accuracies, and (ii) exploits the sparsity of the information T
matrix, that naturally arises when no dynamic motion model XIMU = IG q̄ T bg T G vI T ba T G pTI (1)
is used. However, in cases where a dynamic motion model
where IG q̄ is the unit quaternion [19] describing the rotation
is available (such as in vision-aided inertial navigation) the
from frame {G} to frame {I}, G pI and G vI are the IMU
computational complexity of the VSDF is at best quadratic
position and velocity with respect to {G}, and finally bg
in the number of features [18].
and ba are 3 × 1 vectors that describe the biases affecting
In contrast to the VSDF, the multi-state constraint filter the gyroscope and accelerometer measurements, respectively.
that we propose in this paper is able to exploit the benefits of The IMU biases are modeled as random walk processes,
delayed linearization while having complexity only linear in driven by the white Gaussian noise vectors nwg and nwa ,
the number of features. By directly expressing the geometric respectively. Following Eq. (1), the IMU error-state is defined
constraints between multiple camera poses it avoids the as:
computational burden and loss of information associated with T
X IMU = δθ T b T Gv T T Gp
b T (2)
pairwise displacement estimation. Moreover, in contrast to I g I a I
SLAM-type approaches, it does not require the inclusion For the position, velocity, and biases, the standard additive
of the 3D feature positions in the filter state vector, but error definition is used (i.e., the error in the estimate x̂
still attains optimal pose estimation. As a result of these of a quantity x is defined as x = x − x̂). However, for
properties, the described algorithm is very efficient, and as the quaternion a different error definition is employed. In
shown in Section IV, is capable of high-precision vision- particular, if q̄ˆ is the estimated value of the quaternion q̄,
aided inertial navigation in real time. then the orientation error is described by the error quaternion
III. E STIMATOR D ESCRIPTION δ q̄, which is defined by the relation q̄ = δ q̄ ⊗ q̄ˆ. In this
expression, the symbol ⊗ denotes quaternion multiplication.
The goal of the proposed EKF-based estimator is to track The error quaternion is
the 3D pose of the IMU-affixed frame {I} with respect to 1 T T
a global frame of reference {G}. In order to simplify the δ q̄ 2 δθ 1 (3)
T
Intuitively, the quaternion δ q̄ describes the (small) rotation where nIMU = nTg nTwg nTa nTwa is the system
that causes the true and estimated attitude to coincide. Since noise. The covariance matrix of nIMU , QIMU , depends on
attitude corresponds to 3 degrees of freedom, using δθ to the IMU noise characteristics and is computed off-line during
describe the attitude errors is a minimal representation. sensor calibration. Finally, the matrices F and G that appear
Assuming that N camera poses are included in the EKF in Eq. (10) are given by:
state vector at time-step k, this vector has the following form:
−ω̂ × −I3 03×3 03×3 03×3
T 03×3
T CN ˆT 03×3 03×3 03×3 03×3
X̂k = X̂TIMU C 1ˆ
G q̄ p̂C1 . . . G q̄
G T
p̂CN
G T (4)
k
F = −Cq̂ â × 03×3 −2ω G × −Cq̂ −ω G ×2
T T
03×3 03×3 03×3 03×3 03×3
where C iˆ
G q̄ and
G
p̂Ci , i = 1 . . . N are the estimates of the
camera attitude and position, respectively. The EKF error- 03×3 03×3 I3 03×3 03×3
state vector is defined accordingly: where I3 is the 3 × 3 identity matrix, and
T
k = X
X T δθ TC1 G p TC1 . . . δθ TCN G p TCN (5) −I3 03×3 03×3 03×3
IMUk 03×3
I3 03×3 03×3
B. Propagation G= 0
3×3 0 3×3 −C T
q̂ 03×3
The filter propagation equations are derived by discretiza- 03×3 03×3 03×3 I3
tion of the continuous-time IMU system model, as described 03×3 03×3 03×3 03×3
in the following:
2) Discrete-time implementation: The IMU samples the
1) Continuous-time system modeling: The time evolution
signals ω m and am with a period T , and these measurements
of the IMU state is described by [20]:
I are used for state propagation in the EKF. Every time a new
1
G q̄(t) = 2 Ω ω(t) G q̄(t), ḃg (t) = nwg (t)
I ˙ IMU measurement is received, the IMU state estimate is
(6) propagated using 5th order Runge-Kutta numerical integra-
G
v̇I (t) = G a(t), b˙a (t) = nwa (t), G p ˙ I (t) = G vI (t) tion of Eqs. (9). Moreover, the EKF covariance matrix has to
be propagated. For this purpose, we introduce the following
In these expressions G a is the body acceleration in the global
partitioning for the covariance:
frame, ω = [ωx ωy ωz ]T is the rotational velocity expressed
in the IMU frame, and PIIk|k PICk|k
Pk|k = (11)
PTICk|k PCCk|k
0 −ωz ωy
−ω × ω
Ω(ω) = , ω × = ωz 0 −ωx where PIIk|k is the 15×15 covariance matrix of the evolving
−ω T 0
−ωy ωx 0 IMU state, PCCk|k is the 6N × 6N covariance matrix of the
The gyroscope and accelerometer measurements, ω m and camera pose estimates, and PICk|k is the correlation between
am respectively, are given by [20]: the errors in the IMU state and the camera pose estimates.
With this notation, the covariance matrix of the propagated
ω m = ω + C(IG q̄)ω G + bg + ng (7) state is given by:
am = C(IG q̄)(G a − g + 2ω G × vI + ω G ×
G G 2 G
pI )
PII k+1|k Φ(tk + T, tk )PIC k|k
+ ba + na Pk+1|k =
(8) PIC k|k Φ(tk + T, tk )
T T
PCC k|k
where C(·) denotes a rotational matrix, and ng and na are where PII k+1|k is computed by numerical integration of the
zero-mean, white Gaussian noise processes modeling the Lyapunov equation:
measurement noise. It is important to note that the IMU
measurements incorporate the effects of the planet’s rotation, ṖII = FPII + PII FT + GQIMU GT (12)
ω G . Moreover, the accelerometer measurements include the Numerical integration is carried out for the time interval
gravitational acceleration, G g, expressed in the local frame. (tk , tk +T ), with initial condition PII k|k . The state transition
Applying the expectation operator in the state propagation matrix Φ(tk + T, tk ) is similarly computed by numerical
equations (Eq. (6)) we obtain the equations for propagating integration of the differential equation
the estimates of the evolving IMU state:
Φ̇(tk + τ, tk ) = FΦ(tk + τ, tk ), τ ∈ [0, T ] (13)
˙ 1 ˙
G q̄ = 2 Ω(ω̂)G q̄ , b̂g = 03×1 ,
I ˆ I ˆ
with initial condition Φ(tk , tk ) = I15 .
G˙ 2 G
v̂I = CTq̂ â − 2ω G × v̂I − ω G ×
G
p̂I + g
G (9)
C. State Augmentation
˙
b̂a = 03×1 , G
p̂˙ I = G v̂I Upon recording a new image, the camera pose estimate is
computed from the IMU pose estimate as:
where for brevity we have denoted Cq̂ = C(IG q̄ˆ), â = am −
b̂a and ω̂ = ω m − b̂g − Cq̂ ω G . The linearized continuous- Cˆ C
G q̄ = I q̄ ⊗ IG q̄ˆ, and G
p̂C = G p̂I + CTq̂ I pC (14)
time model for the IMU error-state is:
where C
I q̄ is the quaternion expressing the rotation between
˙ IMU = FX
X IMU + GnIMU (10) the IMU and camera frames, and I pC is the position of
the origin of the camera frame with respect to {I}, both of compute the measurement residual:
which are known. This camera pose estimate is appended (j) (j) (j)
to the state vector, and the covariance matrix of the EKF is ri = zi − ẑi (20)
augmented accordingly: where
Ci
T C X̂j
I I (j) 1 i
X̂j
Pk|k ← 6N +15 Pk|k 6N +15 (15) ẑi = , Ci Ŷj = C(C
G q̄ )( p̂fj − p̂Ci )
iˆ G G
J J Ci Ẑ
j
Ci
Ŷj Ci
Ẑj
where the Jacobian J is derived from Eqs. (14) as:
Linearizing about the estimates for the camera pose and
C C I q̄ 03×9 03×3 03×6N for the feature position, the residual of Eq. (20) can be
J= (16)
CTq̂ I pC × 03×9 I3 03×6N approximated as:
(j) (j) (j) (j)
D. Measurement Model r H X i + H Gp f + n
Xi (21)
fi j i
(j) (j)
We now present the measurement model employed for up- In the preceding expression HXi and Hfi are the Jacobians
dating the state estimates, which is the primary contribution (j)
of the measurement zi with respect to the state and the
of this paper. Since the EKF is used for state estimation, feature position, respectively, and G p fj is the error in the
for constructing a measurement model it suffices to define position estimate of fj . The exact values of the Jacobians in
a residual, r, that depends linearly on the state errors, X, this expression are provided in [21]. By stacking the residuals
according to the general form: of all Mj measurements of this feature, we obtain:
+ noise
r = HX (17) (j)
r(j) H X
(j)
+ H Gp f + n(j) (22)
X f j
0.05 0.2
x
5
0 0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600
20 0.2 0.4
10
y
0.1 0.2
y
0 0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600
2 1.5 0.02
1
1
z
0.01
z
0.5
0 0 0
0 100 200 300 400 500 600 0 100 200 300 400 500 600 0 100 200 300 400 500 600
Time (sec) Time (sec) Time (sec)
estimates for α̂j , β̂j , and ρ̂j , using Gauss-Newton least- [12] R. J. Prazenica, A. Watkins, and A. J. Kurdila, “Vision-based kalman
squares minimization. Then, the global feature position is filtering for aircraft state estimation and structure from motion,” in
Proceedings of the AIAA Guidance, Navigation, and Control Confer-
computed by: ence, no. AIAA 2005-6003, San Fransisco, CA, Aug. 15-18 2005.
[13] R. Garcia, J. Puig, P. Ridao, and X. Cufi, “Augmented state Kalman
α̂j filtering for AUV navigation,” in IEEE International Conference on
1 nˆ
G
p̂fj = CT (C G q̄ ) β̂j + G p̂Cn (38) Robotics and Automation, Washington D.C., 2002, pp. 4010–4015.
ρ̂j [14] R. Eustice, H. Singh, J. Leonard, M. Walter, and R. Ballard, “Visu-
1 ally navigating the RMS Titanic with SLAM information filters,” in
We note that during the least-squares minimization process Proceedings of Robotics: Science and Systems, Cambridge, MA, June
2005.
the camera pose estimates are treated as known constants, [15] A. I. Mourikis and S. I. Roumeliotis, “On the treatment of relative-pose
and their covariance matrix is ignored. As a result, the min- measurements for mobile robot localization,” in Proceedings of the
IEEE International Conference on Robotics and Automation, Orlando,
imization can be carried out very efficiently, at the expense FL, May 15-19 2006, pp. 2277 – 2284.
of the optimality of the feature position estimates. Recall, [16] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground
however, that up to a first-order approximation, the errors in vehicle applications,” Journal of Field Robotics, vol. 23, no. 1, pp.
3–20, January 2006.
these estimates do not affect the measurement residual (cf. [17] P. McLauchlan, “The variable state dimension filter,” Centre for
Eq. (23)). Thus, no significant degradation of performance Vision, Speech and Signal Processing, University of Surrey, UK, Tech.
is inflicted. Rep., 1999.
[18] M. C. Deans, “Maximally informative statistics for localization and
R EFERENCES mapping,” in IEEE International Conference on Robotics and Automa-
tion, Washington D.C., May 2002, pp. 1824–1829.
[1] J. W. Langelaan, “State estimation for autonomous flight in cluttered [19] W. G. Breckenridge, “Quaternions proposed standard conventions,”
environments,” Ph.D. dissertation, Stanford University, Department of JPL, Tech. Rep. INTEROFFICE MEMORANDUM IOM 343-79-
Aeronautics and Astronautics, 2006. 1199, 1999.
[2] D. Strelow, “Motion estimation from image and inertial measure- [20] A. B. Chatfield, Fundamentals of High Accuracy Inertial Navigation.
ments,” Ph.D. dissertation, Carnegie Mellon University, 2004. Reston, VA: AIAA, 1997.
[3] L. L. Ong, M. Ridley, J. H. Kim, E. Nettleton, and S. Sukkarieh, [21] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman
“Six DoF decentralised SLAM,” in Australasian Conf. on Robotics filter for vision-aided inertial navigation,” Dept. of Computer Sci-
and Automation, Brisbane, Australia, December 2003, pp. 10–16. ence and Engineering, University of Minnesota, Tech. Rep., 2006,
[4] E. Eade and T. Drummond, “Scalable monocular SLAM,” in IEEE www.cs.umn.edu/∼mourikis/tech reports/TR MSCKF.pdf.
Computer Society Conference on Computer Vision and Pattern Recog- [22] G. Golub and C. van Loan, Matrix computations. The Johns Hopkins
nition, June 17-26 2006, pp. 469 – 476. University Press, London, 1996.
[5] A. Chiuso, P. Favaro, H. Jin, and S. Soatto, “Structure from motion [23] J. Oliensis, “A new structure-from-motion ambiguity,” IEEE Transac-
causally integrated over time,” IEEE Transactions on Pattern Analysis tions on Pattern Analysis and Machine Intelligence, vol. 22, no. 7, pp.
and Machine Intelligence, vol. 24, no. 4, pp. 523–535, April 2002. 685–700, July 2000.
[6] A. J. Davison and D. W. Murray, “Simultaneous localisation and map- [24] A. Huster, “Relative position sensing by fusing monocular vision
building using active vision,” IEEE Transactions on Pattern Analysis and inertial rate sensors,” Ph.D. dissertation, Department of Electrical
and Machine Intelligence, vol. 24, no. 7, pp. 865 – 880, July 2002. Engineering, Stanford University, 2003.
[7] S. Roumeliotis, A. Johnson, and J. Montgomery, “Augmenting inertial [25] A. D. J. Montiel, J. Civera, “Unified inverse depth parametrization for
navigation with image-based motion estimation,” in IEEE Interna- monocular slam,” in Proceedings of Robotics: Science and Systems,
tional Conference on Robotics and Automation, Washington D.C., Philadelphia, PA, June 2006.
2002, pp. 4326–33. [26] http://www.cs.umn.edu/∼mourikis/icra07video.htm.
[8] D. D. Diel, “Stochastic constraints for vision-aided inertial navigation,” [27] D. G. Lowe, “Distinctive image features from scale-ivnariant key-
Master’s thesis, MIT, January 2005. points,” International Journal of Computer Vision, vol. 60, no. 2, pp.
[9] D. S. Bayard and P. B.Brugarolas, “An estimation algorithm for vision- 91–100, 2004.
based exploration of small bodies in space,” in American Control [28] B. Triggs, P. McLauchlan, R. Hartley, and Fitzgibbon, “Bundle ad-
Conference, June 8-10 2005, pp. 4589 – 4595. justment – a modern synthesis,” in Vision Algorithms: Theory and
[10] S. Soatto, R. Frezza, and P. Perona, “Motion estimation via dynamic Practice. Springer Verlag, 2000, pp. 298–375.
vision,” IEEE Transactions on Automatic Control, vol. 41, no. 3, pp.
393–413, March 1996.
[11] S. Soatto and P. Perona, “Recursive 3-d visual motion estimation
using subspace constraints,” IEEE Transactions on Automatic Control,
vol. 22, no. 3, pp. 235–259, 1997.