0% found this document useful (0 votes)
5 views

2007 01867

Uploaded by

alexstogov123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

2007 01867

Uploaded by

alexstogov123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2020 IEEE. Personal use of this material is permitted.

Permission from IEEE must be obtained for all other uses, in


any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating
new collective works, for resale or redistribution to servers or
lists, or reuse of any copyrighted component of this work in
other works.
arXiv:2007.01867v3 [cs.RO] 10 Jul 2020
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020 1

TLIO: Tight Learned Inertial Odometry


Wenxin Liu1,2 , David Caruso2 , Eddy Ilg2 , Jing Dong2 , Anastasios I. Mourikis2 ,
Kostas Daniilidis1 , Vijay Kumar1 , and Jakob Engel2

Abstract—In this work we propose a tightly-coupled Extended


Kalman Filter framework for IMU-only state estimation. Strap-
down IMU measurements provide relative state estimates based
on IMU kinematic motion model. However the integration of
measurements is sensitive to sensor bias and noise, causing
significant drift within seconds. Recent research by Yan et al.
(RoNIN) and Chen et al. (IONet) showed the capability of using
trained neural networks to obtain accurate 2D displacement
estimates from segments of IMU data and obtained good position
estimates from concatenating them. This paper demonstrates
a network that regresses 3D displacement estimates and its
uncertainty, giving us the ability to tightly fuse the relative state
measurement into a stochastic cloning EKF to solve for pose, Fig. 1: An example 3D trajectory estimated by TLIO from a single
velocity and sensor biases. We show that our network, trained MEMS IMU, showing the user repeatedly walking up and down
with pedestrian data from a headset, can produce statistically a staircase. Our method estimates full orientation and translation
consistent measurement and uncertainty to be used as the update in 3D: On the left, we show the estimated x,y,z position (blue) as
step in the filter, and the tightly-coupled system outperforms well as associated uncertainties (±3σ in red). On the right, we show
velocity integration approaches in position estimates, and AHRS the resulting 3D trajectory. For both plots, we show the output of
attitude filter in orientation estimates. state-of-the-art visual-inertial odometry (green) for comparison.
Video materials and code can be found on our project page:
cathias.github.io/TLIO
Index Terms—Localization, AI-Based Methods, Pedestrian known as strap-down Inertial Navigation System (INS) or Dead
Dead Reckoning, Inertial State Estimation Reckoning [1].
Inertial Pedestrian Dead Reckoning (PDR) has gained
increasing interest with the price decline and widespread
I. I NTRODUCTION
usage of MEMS sensors in the past two decades [2]. Strap-

V ISUAL-Inertial Navigation Systems (VINS) in recent down pedestrian IMU navigation faces the challenge of the
years have seen tremendous success, enabling a wide accumulation of sensor errors since the IMU kinematic model
range of applications from mobile devices to autonomous provides only relative state estimates. To compensate for the
systems. Using only light-weight cameras and Inertial Measure- errors and reduce drift without the aid of other sensors or floor
ment Units (IMUs), VINS achieve high accuracy in tracking at maps, the existing approaches rely on the prior knowledge
low cost and provide one of the best solutions for localization of human walking motion, in particular, steps. One way to
and navigation on constrained platforms. With the unique make use of steps is Zero-velocity UPdaTe (ZUPT) [3]. It
combination of these advantages, VINS have been the de facto detects when the foot touches the ground to generate pseudo-
standard for demanding applications such as Virtual/Augmented measurement updates in an Extended Kalman Filter (EKF)
Reality (VR/AR) on mobile phones or headsets. framework to calibrate IMU biases. However, it only works
Despite the impressive performance of state-of-the-art VINS well when the sensor is attached to the foot where step
algorithms, demands from consumer AR/VR products are detection is obvious. Another category is step counting [4]
posing new challenges on state estimation, pushing the research which does not require the sensor to be attached to the foot.
frontier. Visual-Inertial Odometry (VIO) largely relies on Such systems consist of multiple submodules: the identification
consistent image tracking, leading to failure cases under and classification of steps, data segmentation and step length
extreme lighting or camera blockage conditions, such as in a prediction, all of which require heavy tuning of hand-designed
dark room or inside a pocket. High frequency image processing heuristics or machine learning.
makes power consumption a bottleneck for sustainable long- In parallel, recent research advances (IONet [5] and
term operations. In addition, widespread camera usage carries RoNIN [6]) have shown that integrating average velocity
privacy implications. Targeting an alternative to the state-of- estimates from a trained neural network results in highly
the-art VINS algorithms for pedestrian applications, this paper accurate 2D trajectory reconstruction using only IMU data
focuses on consumer grade IMU-only state estimation problem from pedestrian hand-carried devices. These results showed
the ability of networks to learn a translation motion model
1 GRASP Lab, University of Pennsylvania, Philadelphia, USA from pedestrian data. In this work, we draw inspiration from
[email protected] these new findings to build an IMU-only dead reckoning system
2 Facebook Reality Labs, Redmond, USA [email protected] trained with data collected from a head-mounted device, similar
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020

to a VR headset. This paper has two major contributions: Network


IMU Buffer
1) We propose a network design to regress both the 3D
displacement and the corresponding covariance, and show
the network’s capability of providing both accurate and
statistically consistent outputs trained on our pedestrian EKF
dataset. IMU
2) We propose a complete state estimation system combin-
ing the neural network with an EKF in a tightly-coupled
formulation that jointly estimates position, orientation, Fig. 2: System block diagram. The IMU buffer provides segments of
velocity, and IMU biases with only pedestrian IMU data. gravity-aligned IMU measurements to the network, using rotation from
the filter state. The network outputs displacement dˆ and uncertainty û
Our approach presents significant advantages over other IMU- used as measurement update to the filter. The filter estimates rotation,
only state estimation approaches. Comparing to traditional PDR velocity, position and IMU biases at IMU rate.
methods, it avoids the limitations and complexities of step-
counting/stride or gait detection by learning a displacement
motion model valid for any data segments, simplifying the device to rotate IMU data into a gravity-aligned frame. While
process and improving accuracy and robustness. Comparing RIDI regresses velocity to optimize bias but still uses double
to common deep learning approaches like RoNIN, it elevates integration from the corrected IMU data for position, RoNIN
the problem onto 3D domain and does not require an external directly integrates the regressed velocity and showed better
ZUPT-based orientation estimator. This tight fusion approach results. This shows that a statistical IMU motion model can
reduces average yaw and position drift by 27% and 33% re- outperform the traditional kinematic IMU model in scenarios
spectively on our test dataset comparing to the best performing that can be captured with training data.
RoNIN velocity concatenation baseline approach. Fig. 1 shows In addition to using networks alone for pose estimates,
an example of the estimated trajectory. We name our method Backprop KF [15] proposes an end-to-end differentiable
T IGHT L EARNED I NERTIAL O DOMETRY (TLIO). Kalman filter framework. Because the loss function is on the
accuracy of the filter outputs, the noise parameters are trained
II. R ELATED W ORKS to produce the best state estimate, and do not necessarily best
capture the measurement error model. AI-IMU [16] uses this
VIO. Visual-Inertial Odometry has been well studied in the approach to estimate IMU noise parameters and measurement
literature [7]. Without power and privacy issues and when uncertainties for car applications. In this work, we also combine
static visual features are clearly trackable, VIO is the standard deep learning with a Kalman filter. However, instead of training
method for state estimation. an end-to-end system, we leverage a state-of-the-art visual-
PDR. Solutions to Pedestrian Dead Reckoning problems inertial fusion algorithm as ground-truth for supervised learning
can be divided into two categories: Bayesian filtering [8] of the measurement function itself. We use a likelihood loss
and step counting [4]. With IMU as the only source of function to jointly train for 3D displacement and uncertainty,
information for both propagation and update steps of an EKF, which are directly used as input to the EKF in the measurement
the measurements need to be carefully constructed. Different update.
types of pseudo measurement approaches include Zero-velocity
UPdaTe (ZUPT) [3] and Zero Angular Rate Update (ZARU) [9]
which detects when the system is static, and heuristic heading III. S YSTEM D ESIGN
reduction (HDR) [10] which identifies walking in a straight line Our estimation system takes IMU data as input and has two
to calibrate gyroscope bias. Foot-mounted IMU sensor enables major components: the network and the filter. Figure 2 shows
reliable detection of contact from which accurate velocity a high-level diagram of the system.
constraints can be derived [11], however such information is not The first component is a convolutional neural network trained
present for hand carried devices, and various signal processing to regress 3D relative displacement and uncertainty between two
and machine learning algorithms such as peak detection in time instants given the segment of IMU data in between. The
time/frequency domain and activity classification have been network is required to infer positional displacement over a short
explored [4]. These algorithms are complex and involve lots timespan from acceleration and angular velocity measurements
of tuning to accurately estimate a trajectory. It is also an alone, without access to the initial velocity. This is intentional:
active research direction as deep learning approaches solving It forces the network to learn only a prior expressing ”typical
for subproblems such as gait classification [12] and stride human motion” and leaves model-based state propagation, i.e.
detection [13] are investigated. acceleration integration to propagate velocity, and respective
Deep Learning. The emergence of deep learning provides uncertainty propagation, to the second system component, the
new possibilities to extract information from IMU data. Es- EKF. In order for this to work well, we found that it is important
timating average velocity from IMU data segments has seen to include the inferred prior’s uncertainty to the network output,
great success in position estimation. IONet [5] first proposed allowing the network to encode how much motion model prior
an LSTM structure to output relative displacement in 2D polar it obtained from the input measurements.
coordinates and concatenate to obtain position. RIDI [14] and The EKF estimates the current state: 3D position, velocity,
RoNIN [6] both assume orientation is known from an Android orientation and IMU biases, and a sparse set of past poses.
LIU et al.: TLIO: TIGHT LEARNED INERTIAL ODOMETRY 3

The EKF propagates with raw IMU samples and uses network where Σ̂ = {Σ̂i }i≤n are the 3 × 3 covariance matrices for ith
outputs for measurement updates. We define the measurement data as a function of the network uncertainty output vector ûi .
in a local gravity-aligned frame to decouple global yaw Σ̂i has 6 degrees of freedom, and there are various covariance
information from the relative state measurement (see Sec. V-D). parametrizations for neural network uncertainty estimation [18].
The propagation from raw IMU data provides a model-based In this paper, we simply assume a diagonal covariance output,
kinematic motion model, and the neural network provides a parametrized by 3 coefficients written as:
statistical motion model. The filter tightly couples these two
Σ̂i (ûi ) = diag(e2ûxi , e2ûyi , e2ûzi ) (3)
sources of information.
At runtime, the raw IMU samples are interpolated to the The diagonal assumption decouples each axis, while regressing
network input frequency and rotated to a local gravity-aligned the logarithm of the standard deviations removes the singularity
frame using rotations estimated from the filter state and around zero in the loss function, adding numerical stabilization
gyroscope data. A gravity-aligned frame always has gravity and helping the convergence in the optimization process. This
pointing downward. Placing data in this frame implicitly gives choice constrains the principal axis of the uncertainty ellipses
the gravity direction information to the network. to be along the gravity-aligned frame axis.
Note that in the proposed approach, IMU data is being used
twice, both as direct input for state propagation and indirectly
B. Data Collection and Implementation Details
through the network as the measurement. This violates the
independence assumptions the EKF is based on: the errors in We use a dataset collected by a custom rig where an IMU
the IMU network input (initial orientation, sensor bias, and (Bosch BMI055) is mounted on a headset rigidly attached to the
sensor noise) would propagate to its output which is used to cameras. The full dataset contains more than 400 sequences
correct the propagation results polluted with the same noise, totaling 60 hours of pedestrian data that pictures a variety
which could lead to estimation inconsistency. We address this of activities including walking, standing still, organizing the
issue with training techniques (see Sec. IV) to reduce this error kitchen, playing pool, going up and down the stairs etc. It was
propagation by adding random perturbations to IMU bias and captured with multiple different physical devices by more than
gravity direction during training. We show in Section VII that 5 people to depict a wide range of individual motion patterns
these techniques successfully improved the robustness of the and IMU systematic errors. A state-of-the-art visual-inertial
network to sensor bias and rotation inaccuracies. filter based on [19] provides position estimates at 1000 Hz
on the entire dataset. We use these results both as supervision
data in the training set and as ground truth in the test set. The
IV. N EURAL S TATISTICAL M OTION M ODEL
dataset is split into 80% training, 10% validation and 10% test
A. Architecture and Loss Function Design subsets randomly.
Our network uses a 1D version of ResNet18 architecture For network training, we use an overlapping sliding window
proposed in [17]. The input dimension to the network is on each sequence to collect input samples. Each window
N × 6, consisting of N IMU samples in the gravity-aligned contains N IMU samples of total size N × 6. In our final
frame. The output of the network contains two 3D vectors: system we choose N = 200 for 200 Hz IMU data. We want
ˆ
the displacement estimates d and their uncertainties û which the network to capture a motion model with respect to the
parametrize the diagonal entries of the covariance. The two gravity-aligned IMU frame, therefore the IMU samples in each
vectors have independent fully-connected blocks extending the window are rotated from the IMU frame to a gravity-aligned
main ResNet architecture. frame built from the orientation at the beginning of the window.
We make use of two different loss functions during training: We use visual-inertial ground-truth rotation for that purpose.
the Mean Square Error (MSE) and the Gaussian Maximum The supervision data for the network output is computed as the
Likelihood loss. The MSE loss on the trained dataset is defined difference of the ground-truth position between two instants
as: expressed in the same headset-centered, gravity-aligned frame.
n During training, because we assume the headset can be
ˆ = 1 kdi − dˆi k2
X
LMSE (d, d) (1) worn at an arbitrary heading angle with respect to the walking
n i=1 direction, we augment the input data for the network to be yaw
where dˆ = {dˆi }i≤n are the 3D displacement output of the invariant by giving a random horizontal rotation to each sample
network and d = {di }i≤n are the ground truth displacement. following RoNIN [6]. In our final estimator, the network is
n is the number of data in the training set. fed with potentially inaccurate input from the filter, especially
at the initialization stage. We simulate this at training time
We define the Maximum Likelihood loss as the negative
by random perturbations on the sensor bias and the gravity
log-likelihood of the displacement according to the regressed
direction to reduce network sensitivity to these input errors.
Gaussian distribution:
n   To simulate bias, we generate additive bias vectors with each
ˆ 1X 1 − 12 kdi −dˆi k
2
component independently sampled from uniform distribution in
LML (d, Σ̂, d) = − log √ e Σ̂i
n i=1 8π 3 det(Σ̂i ) [−0.2, 0.2] m/s2 or [−0.05, 0.05] rad/s for each input sample.
n
1 X 1  Gravity direction is perturbed by rotating those samples along a
= log det(Σ̂ i ) + 1
kd i − dˆi k2 + Cst (2) random horizontal rotation axis with magnitude sampled from
n i=1 2 2 Σ̂i
[0, 5o ].
4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020

Optimization is done through the Adam optimizer. We used ng and na are random noise variables following a zero-centered
an initial learning rate of 0.0001, zero weight decay, and Gaussian distribution. Evolution of biases is modeled as a
dropouts with a probability of 0.5 for the fully connected layers. random walk process with discrete parameters ηgd and ηad
We observe that training for LML directly does not converge. over the IMU sampling period ∆t.
Therefore we first train with LMSE for 10 epochs until the
network stabilizes, then switch to LML until the network fully
converges. It takes around another 10 epochs to converge and
C. State Propagation and Augmentation
a total of 4 hours of training time is needed on an NVIDIA
DGX computer. The filter propagates the current state s with IMU data
using a kinematic motion model. If the current timestamp
V. S TOCHASTIC C LONING E XTENDED K ALMAN F ILTER is associated to a measurement update, stochastic cloning
The EKF in our system tightly integrates the displacements is performed together with propagation in one step. During
predicted by the network with a statistical IMU model as cloning, a new state ξ is appended to the past state.
used in other inertial navigation systems. As the displacement 1) Propagation Model: We use the strapdown inertial
estimates from the network express constraints on pairs of past kinematic equation assuming a uniform gravity field w g and
states, we adopt a stochastic cloning framework [20]. Similar ignoring Coriolis forces and the earth’s curvature:
to [19], we maintain a sliding window of m poses in the filter w w
state. In contrast to [19], however, we only apply constraints i R̂k+1 = i R̂k expSO3 ((ωk − b̂gk )∆t) (4)
w w w w
between pairs of poses, and these constraints are derived from v̂ k+1 = v̂ k + g∆t + i R̂k (ak − b̂ak )∆t (5)
the network described in the preceding section, rather than w w
p̂k+1 = p̂k + v̂ k ∆tw w
+ 12 ∆t2 (w g + i R̂k (ak − b̂ak )) (6)
camera data.
b̂g(k+1) = b̂gk + ηgdk (7)
A. State Definition b̂a(k+1) = b̂ak + ηadk . (8)
At each instant, the full state of the EKF is defined as: Here, expSO3 denotes the SO(3) exponential map, the inverse
X = (ξ1 , . . . , ξm , s) function of logSO3 . The discrete-time noise ηgdk , ηadk have
been defined above and follow a Gaussian distribution.
where ξi , i = 1, . . . , m are past (cloned) states, and s is the
The linearized error propagation can be written as:
current state. More specifically,
s̃k+1 = Ask(15,15) s̃k + Bk(15,12)
s
nk (9)
ξi = (wi Ri , w pi ), s = (wi R, w v, w p, bg , ba )
T
where nk = [nωk , nak , ηgdk , ηadk ] is a vector containing all
We express wi R as the rotation matrix that transforms a point
random input. The superscript s indicates that these matrices
from the IMU frame to the world frame, and w v and w p are
correspond to the current state s, and the subscript brackets
respectively the velocity and position expressed in the world
indicate matrix dimensions. The corresponding propagation of
frame. In the following, we will drop the explicit superscript for
the state covariance P is:
conciseness. bg , ba are the IMU gyroscope and accelerometer
biases. Pk+1 = Ak Pk ATk + Bk W BkT , (10)
As commonly done in such setups, we apply the error-based
   
I6m 0 0
filtering methodology to linearize locally on the manifold of the Ak = , Bk = (11)
0 Ask Bks
minimal parametrization of the rotation. More specifically, the
with W(12,12) the covariance matrices of sensor noise and bias
filter covariance is defined as the covariance of the following
random walk noise. The state covariance P is of the dimension
error-state:
(6m + 15) of the full state X. In the implementation, multiple
ξ̃i = (θ̃i , δ p̃i ), s̃ = (θ̃, δ ṽ, δ p̃, δ b̃g , δ b̃a ) propagation steps can be combined together to reduce the
where tilde indicates errors for every state as the difference computational cost [20].
between the estimate and the real value, except for rotation 2) State Augmentation: In our system, state augmentations
error which is defined as θ̃ = logSO3 (RR̂−1 ) ∈ so(3) where are performed at the measurement update frequency. During an
logSO3 denotes the logarithm map of rotation. The full error- augmentation step, the state dimension is incremented through
state is of dimension 6m + 15, where m is the total number propagation with cloning:
of past states, and s̃ has dimension of 15. Pk+1 = Āk Pk ĀTk + B̄k W B̄kT , (12)
   
I6m 0 0
B. IMU model
Āk =  0 Aξk  , B̄k = Bkξ  (13)
We assume that the IMU sensor provides direct measure- 0 Ask Bks
ments of the non-gravitational acceleration a and angular
velocity ω, expressed in the IMU frame. We assume as it Āk is now a copy operation plus ξ
the augmented and current
is common for this class of sensor that the measurements are state propagation operations. A k and Bkξ are partial propagation
polluted with noise ng;a and bias bg:a . matrices for rotation and position only. After the increment,
the dimension of the state vector increases by 6. Old past states
ω = ωtrue + bg + ng , a = atrue + ba + na . will be pruned in the marginalization step, see Sec.V-E.
LIU et al.: TLIO: TIGHT LEARNED INERTIAL ODOMETRY 5

D. Measurement Update from the network, we scale the covariance by a factor of 10


It would be natural to define the measurement function as to compensate for the temporal correlation of measurements
the displacement d expressed in the world frame, however as noted in Sec. VII-A1.
such measurement function would imply heading observability
at the filter level. Heading is theoretically unobservable as the E. State size, Marginalization and Initialization
IMU propagation model and the learned prior are invariant to As soon as an update is processed, all states prior to
the change of yaw angle. In order to prevent injecting spurious this update window are marginalized. This is performed
information into the filter, we carefully define the measurement by removing the corresponding ξ from the state and the
function h as the 3D displacement in a local gravity-aligned corresponding covariance entries. The number of states kept
frame. This frame is anchored to a clone state i. Vectors in in the filter depends on two parameters: (i) the displacement
this frame can be obtained by rotating the corresponding world duration window for which the network is trained and (ii) the
frame vectors by the yaw rotation matrix Rγ of the state rotation update frequency. For instance, for an update frequency of
matrix Ri . We decompose Ri using extrinsic ”XYZ” Euler angle 20 Hz and a window of 1 s, there will be at most 21 past states
convention: Ri = Rγ Rβ Rα , where α, β, γ correspond to roll, in the filter. The two parameters can be set independently: if
pitch and yaw respectively. h then writes: the period of update is smaller than the displacement window,
the consecutive measurements will be generated from the
h(X) = RT (w p − w p ) = dˆij + ηd .
γ j i ij
(14)
overlapping windows of IMU data, while if the update period
dˆij is the network output displacement between state i and j, is too long, some IMU measurements would not be given to the
using IMU samples rotated to the local gravity-aligned frame network. We observe that our system has better performance
anchored at pose i as input. ηdij is a random variable that, we with a higher update frequency (see Sec. VII-B2), and we use
assume, follows the normal distribution N (0, Σ̂ij ) given by 20 Hz in our final system.
the network. Our EKF needs a good initialization to converge. In this work
The linearization of the measurement function h with respect we assumed the initial state is given by an external procedure.
to the error state yields the linearized measurement matrix In our experiments we initialize the speed, roll, and pitch with
H(3, 6m+15) . It has only non zero values in the 3 × 3 blocks the ground-truth estimate assigning a large covariance, while
corresponding to θ̃i , δ p̃i and δ p̃j . the yaw and the position are initialized with a strong prior in
order to fix the gauge. Biases are initialized as with the values
∂h(X)
Hθ̃i = = R̂Tγ bw p̂j − w p̂i c× Hz (15) of an initial factory calibration, and we refine the biases online
∂ θ̃i to account for its evolution due to turn-on change, temperature
∂h(X) effect, or aging.
Hδp̃i = = −R̂Tγ (16)
∂δ p̃i In practice, we use the following values to initialize the state
∂h(X) covariance: σw v = 0.1 m/s, σba = 0.2 m/s2 , σbg = 1.10−4
Hδp̃j = = R̂Tγ (17)
∂δ p̃j rad/s, σθ = diag(10, 10, 0.1) deg. We also compensate all

0 0 0
 the input IMU samples for non-orthogonality scale factor and
where Hz =  0 0 0 (18) g-sensitivity from the factory calibration.
cos γ tan β sin γ tan β 1
VI. E XPERIMENTAL R ESULTS
In the above expression, bxc× is the skew-symmetric matrix
We evaluate our system on the test split of our dataset. Each
built from vector x. The singularity cos β = 0 occurs when the
of these 37 trajectories contains between 3 to 7 minutes of
person wearing the headset is looking straight down or straight
human activities. We compare the performance of two pose
up: we simply discard the update for these cases. Note that
estimators against a state-of-the-art VIO implementation based
these situations are unlikely and never occur in our dataset.
on [19] and considered as ground-truth (GT):
Finally, H and Σ̂ij are used to compute the Kalman gain
• TLIO : the tightly-coupled EKF with displacement update
to update the state X and the covariance P as follows:
from the trained network presented in this work.
K = P H T (HP H T + Σ̂ij )−1 (19) • 3 D - RONIN : an estimator where the displacements from
 
X ←− X ⊕ K(h(X) − dˆij ) (20) the same trained network are concatenated in the direction
given by an engineered AHRS attitude filter, resembling
P ←− (I − KH)P (I − KH)T + K Σ̂ij K T (21) what smartphones have. We borrow the name from [6], but
Operator ⊕ denotes the regular addition operation except for we extended the method to 3D and we trained the network
rotation where the update operation writes R ← Exp(θ̃)R. on our dataset to get a fair comparison. Note that our
We use a χ2 test to protect the filter estimate against dataset contains trajectories involving climbing staircases,
occasional wrong network output: we discard the update when sitting motions, and walking outdoors on uneven terrain,
the normalized innovation error kdij − dˆij kHP H T +Σ̂ij is where the 2D PDR method of [6] is not directly applicable.
greater than a threshold. We choose here the threshold value
of 11.345 which corresponds to the 99th percentile of the χ2 A. Metrics definitions
distribution with 3 degrees of freedom. In order to assess our system performance, for each dataset
In practice, when using the measurement covariance Σ̂ij of length n, we define the following metrics:
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020

TLIO 3D-RONIN • Yaw-DR (◦ /hour): (γn − γ̂n )/(sequence-duration) is the


Yaw Drift over time.

B. System Performance
Figure 3 shows the distribution of the metrics across the
entire test set. It demonstrates that our system consistently
performs better than the decoupled orientation and position
estimator 3 D - RONIN on all metrics.
On 3D position, TLIO performs better than 3 D - RONIN which
integrates average velocities. Integration approach has the
advantage of being robust to outliers since the measurements
at each instant are decoupled. The result shows not only the
Fig. 3: Comparing TLIO to 3 D - RONIN on the error metrics with respect benefit of integrating IMU kinematic model, but also the overall
to the ground-truth. Each plot shows the cumulative density function robustness of the filter, which comes from the quality of the
of the chosen metric on the entire test set. The steeper the plots are covariance output from the network and the effectiveness of
the better the performance. TLIO shows significantly lower value than outlier rejection with χ2 test.
3 D - RONIN for all presented metrics.
Our system also has a smaller yaw drift than 3 D - RONIN,
indicating that even without any hand engineered heuristics
or step detection, using displacement estimates from a trained
statistical model outperforms a smartphone AHRS attitude
filter. This shows that the EKF can accurately estimate not
only position and velocity, but also gyroscope biases.
Figure 4 shows a hand-picked collection of 7 trajectories
containing the best, typical and worst cases of TLIO and
3 D - RONIN, while Figure 1 shows a 3D visualization of one
additional sequence. See the corresponding captions for more
details about the trajectories and failure cases.

VII. C OMPONENTS AND VARIATION S TUDIES


We investigate different variations of the network settings
Fig. 4: Selection of trajectories. 1.x and 2.x are the good and hard in three aspects: input IMU frequency, time interval ∆tij over
cases and 3.x show examples of failures. 2.a. shows a case with which displacement is computed, and the total time interval
very wrong network outputs. 2.b shows a case of bad initialization in
the filter leading to yaw drift. In 3.a and 3.b, unusual motions not included in the network input. We do not consider future data
present in the training set are performed: side-stepping repeatedly for because we aim for a causal filter. The name convention we
3 minutes and rolling on a chair. use is built from these aspects. For example, 200hz-05s-3s
is a model that takes 3 s of 200 Hz input data and regresses
q P for the displacement over the last 0.5 s. To easily keep track of
n
•ATE (m): n1 i k w pi − w p̂i k2 the different models, in all following figures, we prefix with
The Absolute Translation Error indicates the spatial “•” the version of our algorithm whose results were presented
closeness of position estimate to the GT over the whole in Sec. VI: 200hz-1s-1s.
trajectory, computed as root-mean square error (RMSE)
between these sets of values. A. Network Statistical Model
• qRTE-∆t (m):
Pn w In this section, we evaluate the deep learning component of
1 w T w w 2
n i k pi+∆t − pi − Rγ R̂γ ( p̂i+∆t − p̂i )k our system in isolation.
The Relative Translation Error indicates the local spatial 1) Does using more data help?: Figure 5 shows a com-
closeness of position estimate to the GT over a window parison of several variations of the window duration, keeping
of duration ∆t. We use 1 s in this analysis. We remove the IMU frequency fixed to 200 Hz. It is surprising that lower
the yaw drift at the beginning of the window so that the MSE loss over the same displacement window does not imply
relative measure is not affected by accumulated errors. lower ATE. This is because MSE does not say anything about
w w
• DR (%) : (k pn − p̂n k)/(trajectory-length) the sum of the errors: the network can make more accurate
The final translation drift over the distance traveled. predictions per sample by seeing more data, while at the
We compute similar metrics for yaw angle γ, measuring the same time produces more correlated results as the inputs
quality of the unobservable orientation around gravity: overlap over a longer period of time. This introduces a trade-
q P
◦ 1

2 is the Absolute Yaw Error. off when integrating displacement, and we observe similar
• AYE ( ): nq i kγi − γ̂i k
Pn trajectory ATEs in different network time window variants. For
◦ 1 2
• RYE-∆t ( ): n i kγi+∆t − γi − (γ̂i+∆t − γ̂i )k is further experiments we will use 200hz-1s-1s except noted
the Relative Yaw Error otherwise.
LIU et al.: TLIO: TIGHT LEARNED INERTIAL ODOMETRY 7

Fig. 7: Sensitivity of the network prediction on bias and gravity


direction noise. At test time the input IMU samples are perturbed
Fig. 5: Network variations with different past and total window sizes for accelerometer bias, gyroscope bias, and gravity direction as
evaluated on the test set. Upper left: ATE of 3 D - RONIN concatenation described in Sec. IV-B but with various ranges. The three models under
result comparing to GT trajectories. Upper right: average MSE loss comparison are trained with different data augmentation procedures:
on the test set. Adding past data to the network input reduces average no perturbation (noptrb), perturbing only the bias (bias), and perturbing
MSE loss over samples, but it does not imply lower ATE over both the bias and the gravity direction (bias-grav). We observe
trajectories. that training with perturbation significantly improves robustness on
1
accelerometer bias and gravity direction.
auto-corellation

Left: Auto-correlation function of the norm


of the network displacement error over one robust to gravity direction errors. These network training tech-
sequence. The norms are computed at 20 Hz niques effectively help reduce the propagation of errors from
using overlapping windows of IMU data.
Models with more input data have lower MSE
input to the output, protecting the independence assumption
0 required in a Kalman Filter framework. Therefore we choose
0 0.5 1 1.5 2 but greater temporal correlation of errors.
t(s) 200hz-1s-1s trained with bias and gravity perturbation as
our system model.

B. EKF System
1) Ablation Study: We compare system variants using a
hand-tuned constant covariance and networks trained without
likelihood loss to show the benefit of using the regressed
uncertainty for TLIO. To find the best constant covariance
parameters, we use the standard deviation of the measurement
Fig. 6: Uncertainty returned by the network (standard-deviation σ) errors on the test set - diag(0.051, 0.051, 0.013)m - multiplied
plotted against errors (m) on three axes x, y, z respectively. Points by a factor yielding the best median ATE tuned by grid-
with larger errors have larger variance outputs, and over 99% of the search. Fig. 8 shows the Cumulative Density Function of
points fall inside the 3σ cone region indicated by the dashed red line.
We have 0.70% for x, y and 0.47% for z axis of error points outside
ATE, Drift and AYE over the test set of various system
3σ bounds respectively. configurations. We observe that using 3 D - RONIN, training the
network with only MSE loss gives a more accurate estimate.
2) Consistency of learned covariance: We collected around However, using a fixed covariance, such network variants in
60 000 samples at 5 Hz on the entire test set for this analysis a filter system achieve similar performance. What makes a
and the sensitivity analysis section below. Fig. 6 plots the sigma difference is the regressed covariance from the network, which
outputs of the network against the errors on the displacement. significantly improves the filter in ATE and Drift. We also
We observe over 99% of the error points fall inside the 3σ notice a dataset where using a fixed covariance loses track
region, showing the network outputs are consistent statistically. due to initialization failure, showing the adaptive property
In addition, we compute the Mahalanobis distance of the output of the regressed uncertainty for different inputs, improving
displacement errors scaled by the output covariances in 3D. system robustness. Comparing to 3 D - RONIN-mse, TLIO has
We found that only 0.30% of the samples were beyond the an improvement of 27% and 31% on the average yaw and
99 percentile critical value of the χ2 distribution, all of which position drift respectively.
are from the failure case shown in Fig. 4.3.b). From this 2) Timing parameters: Fig. 9 shows the performance
analysis, we conclude that the uncertainty output of the network comparison between variations on network IMU frequency,
grows as expected with its own error albeit being often slightly ∆tij , and filter measurement-update frequency. ATE and drift
conservative. We observe an asymmetry between horizontal are reduced when using high frequency measurements. Once
and vertical errors distribution; along vertical axis errors and again, we observe minor differences between network parameter
sigmas concentrate on lower values. This is not surprising: variations on the monitored metrics. This shows that our entire
people usually either walk on a flat ground or on stairs which pipeline is not sensitive to these choices of parameters, and
may make expressing a prior easier on the z axis. TLIO outperforms 3 D - RONIN in all cases.
3) Sensitivity Analysis: Figure 7 shows the network output
robustness to input perturbation on bias and gravity direction. VIII. C ONCLUSION
We observe that the models trained with bias perturbation are In this paper we propose a tightly-coupled inertial odometry
more robust to IMU bias, in particular accelerometer bias. The algorithm introducing a learned component in an EKF frame-
model trained with gravity perturbation is significantly more work. We present the network regressing 3D displacement and
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2020

[3] E. Foxlin, “Pedestrian tracking with shoe-mounted inertial sensors,” IEEE


Computer graphics and applications, vol. 25, no. 6, pp. 38–46, 2005.
[4] A. Brajdic and R. Harle, “Walk detection and step counting on
unconstrained smartphones,” in ACM international joint conference on
Pervasive and ubiquitous computing, 2013, pp. 225–234.
[5] C. Chen, X. Lu, A. Markham, and N. Trigoni, “IONet: Learning to cure
the curse of drift in inertial odometry,” in Thirty-Second AAAI Conference
on Artificial Intelligence, 2018, pp. 6468–6476.
[6] H. Yan, S. Herath, and Y. Furukawa, “RoNIN: Robust neural inertial
navigation in the wild: Benchmark, evaluations, and new methods,”
Fig. 8: Performance comparison showing the effectiveness of using Proceedings of the IEEE international conference on robotics and
a learned covariance. TLIO-mse and 3 D - RONIN-mse use a network automation (ICRA), pp. 3146–3152, 2020.
trained with only MSE loss. TLIO-fixcov and TLIO-mse use a hand- [7] C. Chen, H. Zhu, M. Li, and S. You, “A review of visual-inertial simultane-
ous localization and mapping from filtering-based and optimization-based
tuned constant covariance for all the measurements. 3 D - RONIN and
perspectives,” Robotics, vol. 7, no. 3, p. 45, 2018.
3 D - RONIN-mse concatenate displacement estimates directly without [8] A. R. Jiménez, F. Seco, J. C. Prieto, and J. Guevara, “Indoor pedestrian
considering uncertainty. TLIO achieves the best performance with navigation using an INS/EKF framework for yaw drift reduction and a
the covariance regressed from the network. Constant covariance foot-mounted IMU,” in 7th Workshop on Positioning, Navigation and
approaches do not reach 100% for ATE and drift in this illustration Communication. IEEE, 2010.
due to a failure case. [9] S. Rajagopal, “Personal dead reckoning system with shoe mounted inertial
sensors,” Masters Degree Project, Stockholm, Sweden, 2008.
[10] J. Borenstein, L. Ojeda, and S. Kwanmuang, “Heuristic reduction of gyro
2hz 4hz 10hz • 20hz ronin
drift in IMU-based personnel tracking systems,” in Optics and Photonics
ATE (m) Drift (%) in Global Homeland Security V and Biometric Technology for Human
6 8
4 6 Identification VI, vol. 7306, 2009, p. 73061H.
4 [11] H. Ju, S. Y. Park, and C. G. Park, “Foot mounted inertial navigation
2
2 system using estimated velocity during the contact phase,” in IPIN, 2016.
0 0 [12] O. Dehzangi, M. Taherisadr, and R. ChangalVala, “IMU-based gait
AYE (m) RYE (deg) recognition using convolutional neural networks and multi-sensor fusion,”
6 0.2 Sensors, vol. 17, no. 12, p. 2735, 2017.
4 [13] B. Beaufils, F. Chazal, M. Grelet, and B. Michel, “Robust stride detector
0.1
2 from ankle-mounted inertial sensors for pedestrian navigation and activity
0 0.0 recognition with machine learning approaches,” Sensors, vol. 19, no. 30,
s
-.5 -1s-1
s 2s -.5
s
-1s 2s-2
s s
-.5 -1s-1
s 2s -.5
s
-1s 2s-2
s p. 4491, 2019.
.5s 2s- .5s -1s .5s 2s- .5s -1s
hz- hz hz- hz- hz 200h
z-
hz- hz hz- hz- hz 200h
z- [14] H. Yan, Q. Shan, and Y. Furukawa, “RIDI: Robust IMU double
100 100 100 200 2 00 100 100 100 200 2 00
• • integration,” in Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 621–636.
Fig. 9: Performance of different system configurations. Each group
[15] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, “Backprop KF” learning
corresponds to a network model, as indicated by the x axis labels. discriminative deterministic state estimators,” in Advances in Neural
Within each group, the colored boxplots differ by update frequency, Information Processing Systems (NIPS), 2016, pp. 4376–4384.
and the grey ones show 3 D - RONIN as baseline. The filtering approach [16] M. Brossard, A. Barrau, and S. Bonnabel, “AI-IMU dead-reckoning,”
constantly outperforms 3 D - RONIN on all the experimented models. IEEE Transactions on Intelligent Vehicles, 2019.
High frequency update yields the best ATE and drift in spite of [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
temporal correlation of the measurements. RYE-1s increases with recognition,” in Proceedings of the IEEE conference on computer vision
update frequency, indicating more jitter on the yaw estimates. and pattern recognition (CVPR), 2016, pp. 770–778.
[18] R. L. Russell and C. Reale, “Multivariate uncertainty in deep learning,”
arXiv preprint arXiv:1910.14215, 2019.
uncertainty, and the EKF fusing the displacement measurement [19] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman
to estimate the full state. We train a network that learns a filter for vision-aided inertial navigation,” in Proceedings of the IEEE
prior on the displacement distributions given IMU data from International Conference on Robotics and Automation (ICRA), 2007, pp.
3565–3572.
statistical motion patterns. We show through experimental [20] S. I. Roumeliotis and J. W. Burdick, “Stochastic cloning: A generalized
results and variation studies that the network outputs are framework for processing relative state measurements,” in Proceedings
statistically consistent, and the filter outperforms the state- IEEE International Conference on Robotics and Automation (ICRA),
vol. 2, 2002, pp. 1788–1795.
of-the-art trajectory estimator using velocity integration on
position estimates, as well as a model-based AHRS attitude
filter on orientation. This demonstrates that with a learned prior,
an IMU sensor alone can provide enough information to do
low drift pose estimation and calibration for pedestrian dead-
reckoning. As common with learning approaches, the system
is limited by the scope of training data. Unusual motions cause
system failure as discussed with examples. Whether similar
approach can be generalized to wider use cases such as legged
robots is still an unexplored but promising field of research.

R EFERENCES
[1] P. D. Groves, “Principles of GNSS, inertial, and multisensor integrated
navigation systems, [book review],” IEEE Aerospace and Electronic
Systems Magazine, vol. 30, no. 2, pp. 26–27, 2015.
[2] R. Harle, “A survey of indoor inertial positioning systems for pedestrians,”
IEEE Communications Surveys & Tutorials, vol. 15, no. 3, pp. 1281–1293,
2013.

You might also like