2007 01867
2007 01867
V ISUAL-Inertial Navigation Systems (VINS) in recent down pedestrian IMU navigation faces the challenge of the
years have seen tremendous success, enabling a wide accumulation of sensor errors since the IMU kinematic model
range of applications from mobile devices to autonomous provides only relative state estimates. To compensate for the
systems. Using only light-weight cameras and Inertial Measure- errors and reduce drift without the aid of other sensors or floor
ment Units (IMUs), VINS achieve high accuracy in tracking at maps, the existing approaches rely on the prior knowledge
low cost and provide one of the best solutions for localization of human walking motion, in particular, steps. One way to
and navigation on constrained platforms. With the unique make use of steps is Zero-velocity UPdaTe (ZUPT) [3]. It
combination of these advantages, VINS have been the de facto detects when the foot touches the ground to generate pseudo-
standard for demanding applications such as Virtual/Augmented measurement updates in an Extended Kalman Filter (EKF)
Reality (VR/AR) on mobile phones or headsets. framework to calibrate IMU biases. However, it only works
Despite the impressive performance of state-of-the-art VINS well when the sensor is attached to the foot where step
algorithms, demands from consumer AR/VR products are detection is obvious. Another category is step counting [4]
posing new challenges on state estimation, pushing the research which does not require the sensor to be attached to the foot.
frontier. Visual-Inertial Odometry (VIO) largely relies on Such systems consist of multiple submodules: the identification
consistent image tracking, leading to failure cases under and classification of steps, data segmentation and step length
extreme lighting or camera blockage conditions, such as in a prediction, all of which require heavy tuning of hand-designed
dark room or inside a pocket. High frequency image processing heuristics or machine learning.
makes power consumption a bottleneck for sustainable long- In parallel, recent research advances (IONet [5] and
term operations. In addition, widespread camera usage carries RoNIN [6]) have shown that integrating average velocity
privacy implications. Targeting an alternative to the state-of- estimates from a trained neural network results in highly
the-art VINS algorithms for pedestrian applications, this paper accurate 2D trajectory reconstruction using only IMU data
focuses on consumer grade IMU-only state estimation problem from pedestrian hand-carried devices. These results showed
the ability of networks to learn a translation motion model
1 GRASP Lab, University of Pennsylvania, Philadelphia, USA from pedestrian data. In this work, we draw inspiration from
[email protected] these new findings to build an IMU-only dead reckoning system
2 Facebook Reality Labs, Redmond, USA [email protected] trained with data collected from a head-mounted device, similar
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020
The EKF propagates with raw IMU samples and uses network where Σ̂ = {Σ̂i }i≤n are the 3 × 3 covariance matrices for ith
outputs for measurement updates. We define the measurement data as a function of the network uncertainty output vector ûi .
in a local gravity-aligned frame to decouple global yaw Σ̂i has 6 degrees of freedom, and there are various covariance
information from the relative state measurement (see Sec. V-D). parametrizations for neural network uncertainty estimation [18].
The propagation from raw IMU data provides a model-based In this paper, we simply assume a diagonal covariance output,
kinematic motion model, and the neural network provides a parametrized by 3 coefficients written as:
statistical motion model. The filter tightly couples these two
Σ̂i (ûi ) = diag(e2ûxi , e2ûyi , e2ûzi ) (3)
sources of information.
At runtime, the raw IMU samples are interpolated to the The diagonal assumption decouples each axis, while regressing
network input frequency and rotated to a local gravity-aligned the logarithm of the standard deviations removes the singularity
frame using rotations estimated from the filter state and around zero in the loss function, adding numerical stabilization
gyroscope data. A gravity-aligned frame always has gravity and helping the convergence in the optimization process. This
pointing downward. Placing data in this frame implicitly gives choice constrains the principal axis of the uncertainty ellipses
the gravity direction information to the network. to be along the gravity-aligned frame axis.
Note that in the proposed approach, IMU data is being used
twice, both as direct input for state propagation and indirectly
B. Data Collection and Implementation Details
through the network as the measurement. This violates the
independence assumptions the EKF is based on: the errors in We use a dataset collected by a custom rig where an IMU
the IMU network input (initial orientation, sensor bias, and (Bosch BMI055) is mounted on a headset rigidly attached to the
sensor noise) would propagate to its output which is used to cameras. The full dataset contains more than 400 sequences
correct the propagation results polluted with the same noise, totaling 60 hours of pedestrian data that pictures a variety
which could lead to estimation inconsistency. We address this of activities including walking, standing still, organizing the
issue with training techniques (see Sec. IV) to reduce this error kitchen, playing pool, going up and down the stairs etc. It was
propagation by adding random perturbations to IMU bias and captured with multiple different physical devices by more than
gravity direction during training. We show in Section VII that 5 people to depict a wide range of individual motion patterns
these techniques successfully improved the robustness of the and IMU systematic errors. A state-of-the-art visual-inertial
network to sensor bias and rotation inaccuracies. filter based on [19] provides position estimates at 1000 Hz
on the entire dataset. We use these results both as supervision
data in the training set and as ground truth in the test set. The
IV. N EURAL S TATISTICAL M OTION M ODEL
dataset is split into 80% training, 10% validation and 10% test
A. Architecture and Loss Function Design subsets randomly.
Our network uses a 1D version of ResNet18 architecture For network training, we use an overlapping sliding window
proposed in [17]. The input dimension to the network is on each sequence to collect input samples. Each window
N × 6, consisting of N IMU samples in the gravity-aligned contains N IMU samples of total size N × 6. In our final
frame. The output of the network contains two 3D vectors: system we choose N = 200 for 200 Hz IMU data. We want
ˆ
the displacement estimates d and their uncertainties û which the network to capture a motion model with respect to the
parametrize the diagonal entries of the covariance. The two gravity-aligned IMU frame, therefore the IMU samples in each
vectors have independent fully-connected blocks extending the window are rotated from the IMU frame to a gravity-aligned
main ResNet architecture. frame built from the orientation at the beginning of the window.
We make use of two different loss functions during training: We use visual-inertial ground-truth rotation for that purpose.
the Mean Square Error (MSE) and the Gaussian Maximum The supervision data for the network output is computed as the
Likelihood loss. The MSE loss on the trained dataset is defined difference of the ground-truth position between two instants
as: expressed in the same headset-centered, gravity-aligned frame.
n During training, because we assume the headset can be
ˆ = 1 kdi − dˆi k2
X
LMSE (d, d) (1) worn at an arbitrary heading angle with respect to the walking
n i=1 direction, we augment the input data for the network to be yaw
where dˆ = {dˆi }i≤n are the 3D displacement output of the invariant by giving a random horizontal rotation to each sample
network and d = {di }i≤n are the ground truth displacement. following RoNIN [6]. In our final estimator, the network is
n is the number of data in the training set. fed with potentially inaccurate input from the filter, especially
at the initialization stage. We simulate this at training time
We define the Maximum Likelihood loss as the negative
by random perturbations on the sensor bias and the gravity
log-likelihood of the displacement according to the regressed
direction to reduce network sensitivity to these input errors.
Gaussian distribution:
n To simulate bias, we generate additive bias vectors with each
ˆ 1X 1 − 12 kdi −dˆi k
2
component independently sampled from uniform distribution in
LML (d, Σ̂, d) = − log √ e Σ̂i
n i=1 8π 3 det(Σ̂i ) [−0.2, 0.2] m/s2 or [−0.05, 0.05] rad/s for each input sample.
n
1 X 1 Gravity direction is perturbed by rotating those samples along a
= log det(Σ̂ i ) + 1
kd i − dˆi k2 + Cst (2) random horizontal rotation axis with magnitude sampled from
n i=1 2 2 Σ̂i
[0, 5o ].
4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020
Optimization is done through the Adam optimizer. We used ng and na are random noise variables following a zero-centered
an initial learning rate of 0.0001, zero weight decay, and Gaussian distribution. Evolution of biases is modeled as a
dropouts with a probability of 0.5 for the fully connected layers. random walk process with discrete parameters ηgd and ηad
We observe that training for LML directly does not converge. over the IMU sampling period ∆t.
Therefore we first train with LMSE for 10 epochs until the
network stabilizes, then switch to LML until the network fully
converges. It takes around another 10 epochs to converge and
C. State Propagation and Augmentation
a total of 4 hours of training time is needed on an NVIDIA
DGX computer. The filter propagates the current state s with IMU data
using a kinematic motion model. If the current timestamp
V. S TOCHASTIC C LONING E XTENDED K ALMAN F ILTER is associated to a measurement update, stochastic cloning
The EKF in our system tightly integrates the displacements is performed together with propagation in one step. During
predicted by the network with a statistical IMU model as cloning, a new state ξ is appended to the past state.
used in other inertial navigation systems. As the displacement 1) Propagation Model: We use the strapdown inertial
estimates from the network express constraints on pairs of past kinematic equation assuming a uniform gravity field w g and
states, we adopt a stochastic cloning framework [20]. Similar ignoring Coriolis forces and the earth’s curvature:
to [19], we maintain a sliding window of m poses in the filter w w
state. In contrast to [19], however, we only apply constraints i R̂k+1 = i R̂k expSO3 ((ωk − b̂gk )∆t) (4)
w w w w
between pairs of poses, and these constraints are derived from v̂ k+1 = v̂ k + g∆t + i R̂k (ak − b̂ak )∆t (5)
the network described in the preceding section, rather than w w
p̂k+1 = p̂k + v̂ k ∆tw w
+ 12 ∆t2 (w g + i R̂k (ak − b̂ak )) (6)
camera data.
b̂g(k+1) = b̂gk + ηgdk (7)
A. State Definition b̂a(k+1) = b̂ak + ηadk . (8)
At each instant, the full state of the EKF is defined as: Here, expSO3 denotes the SO(3) exponential map, the inverse
X = (ξ1 , . . . , ξm , s) function of logSO3 . The discrete-time noise ηgdk , ηadk have
been defined above and follow a Gaussian distribution.
where ξi , i = 1, . . . , m are past (cloned) states, and s is the
The linearized error propagation can be written as:
current state. More specifically,
s̃k+1 = Ask(15,15) s̃k + Bk(15,12)
s
nk (9)
ξi = (wi Ri , w pi ), s = (wi R, w v, w p, bg , ba )
T
where nk = [nωk , nak , ηgdk , ηadk ] is a vector containing all
We express wi R as the rotation matrix that transforms a point
random input. The superscript s indicates that these matrices
from the IMU frame to the world frame, and w v and w p are
correspond to the current state s, and the subscript brackets
respectively the velocity and position expressed in the world
indicate matrix dimensions. The corresponding propagation of
frame. In the following, we will drop the explicit superscript for
the state covariance P is:
conciseness. bg , ba are the IMU gyroscope and accelerometer
biases. Pk+1 = Ak Pk ATk + Bk W BkT , (10)
As commonly done in such setups, we apply the error-based
I6m 0 0
filtering methodology to linearize locally on the manifold of the Ak = , Bk = (11)
0 Ask Bks
minimal parametrization of the rotation. More specifically, the
with W(12,12) the covariance matrices of sensor noise and bias
filter covariance is defined as the covariance of the following
random walk noise. The state covariance P is of the dimension
error-state:
(6m + 15) of the full state X. In the implementation, multiple
ξ̃i = (θ̃i , δ p̃i ), s̃ = (θ̃, δ ṽ, δ p̃, δ b̃g , δ b̃a ) propagation steps can be combined together to reduce the
where tilde indicates errors for every state as the difference computational cost [20].
between the estimate and the real value, except for rotation 2) State Augmentation: In our system, state augmentations
error which is defined as θ̃ = logSO3 (RR̂−1 ) ∈ so(3) where are performed at the measurement update frequency. During an
logSO3 denotes the logarithm map of rotation. The full error- augmentation step, the state dimension is incremented through
state is of dimension 6m + 15, where m is the total number propagation with cloning:
of past states, and s̃ has dimension of 15. Pk+1 = Āk Pk ĀTk + B̄k W B̄kT , (12)
I6m 0 0
B. IMU model
Āk = 0 Aξk , B̄k = Bkξ (13)
We assume that the IMU sensor provides direct measure- 0 Ask Bks
ments of the non-gravitational acceleration a and angular
velocity ω, expressed in the IMU frame. We assume as it Āk is now a copy operation plus ξ
the augmented and current
is common for this class of sensor that the measurements are state propagation operations. A k and Bkξ are partial propagation
polluted with noise ng;a and bias bg:a . matrices for rotation and position only. After the increment,
the dimension of the state vector increases by 6. Old past states
ω = ωtrue + bg + ng , a = atrue + ba + na . will be pruned in the marginalization step, see Sec.V-E.
LIU et al.: TLIO: TIGHT LEARNED INERTIAL ODOMETRY 5
B. System Performance
Figure 3 shows the distribution of the metrics across the
entire test set. It demonstrates that our system consistently
performs better than the decoupled orientation and position
estimator 3 D - RONIN on all metrics.
On 3D position, TLIO performs better than 3 D - RONIN which
integrates average velocities. Integration approach has the
advantage of being robust to outliers since the measurements
at each instant are decoupled. The result shows not only the
Fig. 3: Comparing TLIO to 3 D - RONIN on the error metrics with respect benefit of integrating IMU kinematic model, but also the overall
to the ground-truth. Each plot shows the cumulative density function robustness of the filter, which comes from the quality of the
of the chosen metric on the entire test set. The steeper the plots are covariance output from the network and the effectiveness of
the better the performance. TLIO shows significantly lower value than outlier rejection with χ2 test.
3 D - RONIN for all presented metrics.
Our system also has a smaller yaw drift than 3 D - RONIN,
indicating that even without any hand engineered heuristics
or step detection, using displacement estimates from a trained
statistical model outperforms a smartphone AHRS attitude
filter. This shows that the EKF can accurately estimate not
only position and velocity, but also gyroscope biases.
Figure 4 shows a hand-picked collection of 7 trajectories
containing the best, typical and worst cases of TLIO and
3 D - RONIN, while Figure 1 shows a 3D visualization of one
additional sequence. See the corresponding captions for more
details about the trajectories and failure cases.
B. EKF System
1) Ablation Study: We compare system variants using a
hand-tuned constant covariance and networks trained without
likelihood loss to show the benefit of using the regressed
uncertainty for TLIO. To find the best constant covariance
parameters, we use the standard deviation of the measurement
Fig. 6: Uncertainty returned by the network (standard-deviation σ) errors on the test set - diag(0.051, 0.051, 0.013)m - multiplied
plotted against errors (m) on three axes x, y, z respectively. Points by a factor yielding the best median ATE tuned by grid-
with larger errors have larger variance outputs, and over 99% of the search. Fig. 8 shows the Cumulative Density Function of
points fall inside the 3σ cone region indicated by the dashed red line.
We have 0.70% for x, y and 0.47% for z axis of error points outside
ATE, Drift and AYE over the test set of various system
3σ bounds respectively. configurations. We observe that using 3 D - RONIN, training the
network with only MSE loss gives a more accurate estimate.
2) Consistency of learned covariance: We collected around However, using a fixed covariance, such network variants in
60 000 samples at 5 Hz on the entire test set for this analysis a filter system achieve similar performance. What makes a
and the sensitivity analysis section below. Fig. 6 plots the sigma difference is the regressed covariance from the network, which
outputs of the network against the errors on the displacement. significantly improves the filter in ATE and Drift. We also
We observe over 99% of the error points fall inside the 3σ notice a dataset where using a fixed covariance loses track
region, showing the network outputs are consistent statistically. due to initialization failure, showing the adaptive property
In addition, we compute the Mahalanobis distance of the output of the regressed uncertainty for different inputs, improving
displacement errors scaled by the output covariances in 3D. system robustness. Comparing to 3 D - RONIN-mse, TLIO has
We found that only 0.30% of the samples were beyond the an improvement of 27% and 31% on the average yaw and
99 percentile critical value of the χ2 distribution, all of which position drift respectively.
are from the failure case shown in Fig. 4.3.b). From this 2) Timing parameters: Fig. 9 shows the performance
analysis, we conclude that the uncertainty output of the network comparison between variations on network IMU frequency,
grows as expected with its own error albeit being often slightly ∆tij , and filter measurement-update frequency. ATE and drift
conservative. We observe an asymmetry between horizontal are reduced when using high frequency measurements. Once
and vertical errors distribution; along vertical axis errors and again, we observe minor differences between network parameter
sigmas concentrate on lower values. This is not surprising: variations on the monitored metrics. This shows that our entire
people usually either walk on a flat ground or on stairs which pipeline is not sensitive to these choices of parameters, and
may make expressing a prior easier on the z axis. TLIO outperforms 3 D - RONIN in all cases.
3) Sensitivity Analysis: Figure 7 shows the network output
robustness to input perturbation on bias and gravity direction. VIII. C ONCLUSION
We observe that the models trained with bias perturbation are In this paper we propose a tightly-coupled inertial odometry
more robust to IMU bias, in particular accelerometer bias. The algorithm introducing a learned component in an EKF frame-
model trained with gravity perturbation is significantly more work. We present the network regressing 3D displacement and
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020
R EFERENCES
[1] P. D. Groves, “Principles of GNSS, inertial, and multisensor integrated
navigation systems, [book review],” IEEE Aerospace and Electronic
Systems Magazine, vol. 30, no. 2, pp. 26–27, 2015.
[2] R. Harle, “A survey of indoor inertial positioning systems for pedestrians,”
IEEE Communications Surveys & Tutorials, vol. 15, no. 3, pp. 1281–1293,
2013.