0% found this document useful (0 votes)

68 views8 pages

DPC-Net: Deep Pose Correction For Visual Localization: Valentin Peretroukhin, and Jonathan Kelly

This document presents DPC-Net, a method for using a convolutional neural network to learn corrections to an existing visual localization algorithm from ground truth data. DPC-Net predicts small corrections to poses estimated by a computationally efficient sparse stereo visual odometry pipeline. It is trained with a novel loss function for learning corrections in the Lie group SE(3) that balances errors in translation and rotation. The authors demonstrate that DPC-Net provides significant accuracy improvements to the visual odometry pipeline, making it as accurate as more computationally intensive dense methods, and can help mitigate issues from sensor calibration errors.

Uploaded by

pradip gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views8 pages

DPC-Net: Deep Pose Correction For Visual Localization: Valentin Peretroukhin, and Jonathan Kelly

Uploaded by

pradip gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

PERETROUKHIN et al.

: DPC-NET: DEEP POSE CORRECTION FOR VISUAL LOCALIZATION 1

DPC-Net: Deep Pose Correction for Visual

Localization
Valentin Peretroukhin1 , and Jonathan Kelly1

Abstract—We present a novel method to fuse the power of Stereo Images

deep networks with the computational efficiency of geometric
and probabilistic localization algorithms. In contrast to other
methods that completely replace a classical visual estimator with
arXiv:1709.03128v3 [cs.CV] 10 Sep 2018

a deep network, we propose an approach that uses a convolutional

neural network to learn difficult-to-model corrections to the
estimator from ground-truth training data. To this end, we derive
a novel loss function for learning SE(3) corrections based on
cnn-layer cnn-layer
a matrix Lie groups approach, with a natural formulation for Tracking Tracking cnn-layer cnn-layer
balancing translation and rotation errors. We use this loss to cnn-layer cnn-layer

train a Deep Pose Correction network (DPC-Net) that predicts cnn-layer

Outlier Rejection Convolution cnn-layer
corrections for a particular estimator, sensor and environment. cnn-layer
PReLU
Using the KITTI odometry dataset, we demonstrate significant cnn-layer
Dropout cnn-layer
improvements to the accuracy of a computationally-efficient Non-Linear Optimization cnn-layer cnn-layer

sparse stereo visual odometry pipeline, that render it as accurate

Visual Odometry ⇠ 2 R6 DPC-Net
as a modern computationally-intensive dense estimator. Further,
we show how DPC-Net can be used to mitigate the effect of exp ⇠ ^

poorly calibrated lens distortion parameters. SE(3) SE(3)

Estimate Correction
Index Terms—Deep Learning in Robotics and Automation,
Localization
SE(3)
Corrected Pose
I. I NTRODUCTION Pose Graph Relaxation
EEP convolutional neural networks (CNNs) are at the
D core of many state-of-the-art classification and segmenta-
tion algorithms in computer vision [1]. These CNN-based tech-
Fig. 1. We propose a Deep Pose Correction network (DPC-Net) that learns
SE(3) corrections to classical visual localizers.

niques achieve accuracies previously unattainable by classical CNNs can infer difficult-to-model geometric quantities (e.g.,
methods. In mobile robotics and state estimation, however, the direction of the sun) to improve an existing localization
it remains unclear to what extent these deep architectures estimate. In a similar vein, we propose a system (Figure 1) that
can obviate the need for classical geometric modelling. In takes as its starting point an efficient, classical localization
this work, we focus on visual odometry (VO): the task of algorithm that computes high-rate pose estimates. To it, we
computing the egomotion of a camera through an unknown add a Deep Pose Correction Network (DPC-Net) that learns
environment with no external positioning sources. Visual local- low-rate, ‘small’ corrections from training data that we then
ization algorithms like VO can suffer from several systematic fuse with the original estimates. DPC-Net does not require
error sources that include estimator biases [2], poor calibration, any modification to an existing localization pipeline, and can
and environmental factors (e.g., a lack of scene texture). learn to correct multi-faceted errors from estimator bias, sensor
While machine learning approaches can be used to better mis-calibration or environmental effects.
model specific subsystems of a localization pipeline (e.g., the Although in this work we focus on visual data, the DPC-Net
heteroscedastic feature track error covariance modelling of [3], architecture can be readily modified to learn SE(3) corrections
[4]), much recent work [5]–[9] has been devoted to completely for estimators that operate with other sensor modalities (e.g.,
replacing the estimator with a CNN-based system. lidar). For this general task, we derive a pose regression loss
We contend that this type of complete replacement places function and a closed-form analytic expression for its Jacobian.
an unnecessary burden on the CNN. Not only must it learn Our loss permits a network to learn an unconstrained Lie
projective geometry, but it must also understand the environ- algebra coordinate vector, but derives its Jacobian with respect
ment, and account for sensor calibration and noise. Instead, we to SE(3) geodesic distance.
take inspiration from recent results [10] that demonstrate that In summary, the main contributions of this work are
Manuscript received: September, 10, 2017; Revised November 24, 2017; 1) the formulation of a novel deep corrective approach to
Accepted November, 8, 2017. egomotion estimation,
1 Both authors are with the Space & Terrestrial Autonomous
2) a novel cost function for deep SE(3) regression that
Robotic Systems (STARS) laboratory at the University of
Toronto Institute for Aerospace Studies (UTIAS), Canada. naturally balances translation and rotation errors, and
<firstname>.<lastname>@robotics.utias.utoronto.ca 3) an open-source implementation of DPC-Net in
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017

PyTorch1 . ti Stereo RGB (6xHxW) ti+ p

II. R ELATED W ORK

Visual state estimation has a rich history in mobile robotics.
We direct the reader to [11] and [12] for detailed surveys of
2, 3x3, 6, 64 2, 3x3, 6, 64
what we call classical, geometric approaches to visual odom-
etry (VO) and visual simultaneous localization and mapping 2, 3x3, 64, 64 2, 3x3, 64, 64

(SLAM). 1, 3x3, 64, 128 1, 3x3, 64, 128

In the past decade, much of machine learning and its sub-
disciplines has been revolutionized by carefully constructed 2, 3x3, 256, 256
deep neural networks (DNNs) [1]. For tasks like image 2, 3x3, 256, 512
segmentation, classification, and natural language processing, Convolution
2, 3x3, 512, 1024
most prior state-of-the-art algorithms have been replaced by
PReLU
their DNN alternatives. 2, 3x3, 1024, 4096
In mobile robotics, deep neural networks have ushered in Dropout 2, 3x3, 4096, 4096 Target SE(3)
a new paradigm of end-to-end training of visuomotor policies Correction
2, 3x3, 4096, 6
[13] and significantly impacted the related field of reinforce- T⇤
ment learning [14]. In state estimation, however, most suc- 2, 1x2, 6, 6
cessful applications of deep networks have aimed at replacing L(⇠, T⇤ )
a specific sub-system of a localization and mapping pipeline ⇠ 2 R6 SE(3) Cost
(for example, object detection [15], place recognition [16], or
bespoke discriminative observation functions [17]). Fig. 2. The Deep Pose Correction network with stereo RGB image data.
Nevertheless, a number of recent approaches have presented The network learns a map from two stereo pairs to a vector of Lie algebra
convolutional neural network architectures that purport to coordinates. Each darker blue block consists of a convolution, a PReLU non-
linearity, and a dropout layer. We opt to not use MaxPooling in the network,
obviate the need for classical visual localization algorithms. following [19]. The labels correspond to the stride, kernel size, and number
For example, Kendall et al. [7], [18] presented extensive work of input and output channels of for each convolution.
on PoseNet, a CNN-based camera re-localization approach that
regresses the 6-DOF pose of a camera within a previously corrections to an existing estimator, instead of learning the
explored environment. Building upon PoseNet, Melekhov [8] entire localization problem ab initio.
applied a similar CNN learning paradigm to relative camera
III. S YSTEM OVERVIEW: D EEP P OSE C ORRECTION
motion. In related work, Costante et al. [5] presented a CNN-
based VO technique that uses pre-processed dense optical flow We base our network structure on that of [19] but learn
images. By focusing on RGB-D odometry, Handa et al. [19], SE(3) corrections from stereo images and require no pre-
detailed an approach to learning relative poses with 3D Spatial training. Similar to [19], we use primarily 3x3 convolutions
Transformer modules and a dense photometric loss. Finally, and avoid the use of max pooling to preserve spatial infor-
Oliviera et al. [9] and Clark et al. [6] described techniques mation. We achieve downsampling by setting the stride to 2
for more general sensor fusion and mapping. The former where appropriate (see Figure 2 for a full description of the
work outlined a DNN-based topometric localization pipeline network structure). We derive a novel loss function, unlike
with separate VO and place recognition modules while the that used in [7]–[9], based on SE(3) geodesic distance. Our
latter presented VINet, a CNN paired with a recurrent neural loss naturally balances translation and rotation error without
network for visual-inertial sensor fusion and online calibration. requiring a hand-tuned scalar hyper-parameter. Similar to [5],
With this recent surge of work in end-to-end learning for we test our final system on the KITTI odometry benchmark,
visual localization, one may be tempted to think this is the and evaluate how it copes with degraded visual data.
only way forward. It is important to note, however, that Given two coordinate frames → F i, →
− F i+∆p that represent a
−
these deep CNN-based approaches do not yet report state- camera’s pose at time ti and ti+∆p (where ∆p is an integer
of-the-art localization accuracy, focusing instead on proof-of- hyper-parameter that allows DPC-Net to learn corrections
concept validation. Indeed, at the time of writing, the most across multiple temporally consecutive poses), we assume that
accurate visual odometry approach on the KITTI odometry our visual localizer gives us an estimate3 , T̂i,i+∆p , of the true
benchmark leaderboard2 remains a variant of a sparse feature- transform Ti,i+∆p ∈ SE(3) between the two frames. We aim
based pipeline with carefully hand-tuned optimization [20]. to learn a target correction,
Taking inspiration from recent results that show that CNNs −1
T∗i = Ti,i+∆p T̂i,i+∆p , (1)
can be used to inject global orientation information into a
visual localization pipeline [10], and leveraging ideas from from two pairs of stereo images (collectively referred to
recent approaches to trajectory tracking in the field of con- as Iti ,ti+∆p ) captured at ti and ti+∆p 4 . Given a dataset,
trols [21], [22], we formulate a system that learns pose
3 We ˆ to denote an estimated quantity throughout the paper.
use (·)
1 See 4 Note that the visual estimator does not necessarily compute T̂i,i+∆p
https://github.com/utiasSTARS/dpc-net.
2 See http://www.cvlibs.net/datasets/kitti/eval_odometry.php. directly. T̂i,i+∆p may be compounded from several estimates.
PERETROUKHIN et al.: DPC-NET: DEEP POSE CORRECTION FOR VISUAL LOCALIZATION 3

{T∗i , Iti ,ti+∆p }N

i=1 , we now turn to the problem of selecting The term ∂g(ξ)
∂ξ is of importance. We can derive it in two ways.
an appropriate loss function for learning SE(3) corrections. To start, note two important identities [23]. First,
∧ ∧
A. Loss Function: Correcting SE(3) Estimates exp (ξ + δξ) ≈ exp (J δξ) exp ξ ∧ , (8)
One possible approach to learning T∗ is to break it into
where J , J (ξ) is the left SE(3)
Jacobian. Second, if T1 ,
constituent parts and then compose a loss function as the
exp ξ 1 ∧ and T2 , exp ξ 2 ∧ , then
weighted sum of translational and rotational error norms (as
done in [7]–[9]). This however does not account for the ∨ ∨
ln (T1 T2 ) = ln exp ξ 1 ∧ exp ξ 2 ∧
possible correlation between the two losses, and requires the
−1
careful tuning of a scalar weight. J (ξ 2 ) ξ 1 + ξ 2 if ξ 1 small
≈ −1 (9)
In this work, we instead choose
to parametrize our correc- ξ 1 + J (−ξ 1 ) ξ 2 if ξ 2 small.
tion prediction as T = exp ξ ∧ , where ξ ∈ R6 , a vector of
Lie algebra coordinates, is the output of our network (similar See [23] for a detailed treatment of matrix Lie groups and
to [19]). We define a loss for ξ as their use in state estimation.
1 1) Deriving ∂g(ξ)
∂ξ , Method I: If we assume that only ξ is
L(ξ) = g(ξ)T Σ−1 g(ξ), (2) ‘small’, we can apply Equation (9) directly to define
2
where −1 ∨ ∂g(ξ) −1
g(ξ) , ln exp ξ ∧ T∗ . (3) = J (−ξ ∗ ) , (10)
∂ξ
Here, Σ is the covariance of our estimator (expressed using ∨
with ξ ∗ , ln (T∗ ) . Although attractively compact, note that
unconstrained Lie algebra coordinates), and (·)∧ , (·)∨ convert
this expression for ∂g(ξ)
∂ξ assumes that ξ is small, and may be
vectors of Lie algebra coordinates to matrix vectorspace quan-
inaccurate for ‘larger’ T∗ (since we will therefore require ξ
tities and back, respectively. Given two stereo image pairs,
to be commensurately ‘large’).
we use the output of DPC-Net, ξ, to correct our estimator as
follows: 2) Deriving ∂g(ξ)
∂ξ , Method II: Alternatively, we can lin-
corr earize Equation (3) about ξ, by considering a small change
T̂ = exp ξ ∧ T̂, (4) δξ and applying Equation (8):
where we have dropped subscripts for clarity. ∨
∧ −1
g(ξ + δξ) = ln exp (ξ + δξ) T∗ (11)
−1 ∨
B. Loss Function: SE(3) Covariance ∧
≈ ln exp (J δξ) exp ξ ∧ T∗ . (12)
Since we are learning estimator corrections, we can com-
pute an empirical covariance over the training set as Now, assuming that J δξ is ‘small’, and using Equation (3),
N
X T Equation (9) gives:
1
Σ= ξ ∗i − ξ ∗ ξ ∗i − ξ ∗ , (5)
N −1 i=1 g(ξ + δξ) ≈ J (g(ξ))
−1
J (ξ)δξ + g(ξ). (13)
where
N
1 X ∗ Comparing this to the first order Taylor expansion:
∨
ξ ∗i , ln (T∗i ) , ξ∗ , ξ . (6) g(ξ + δξ) ≈ g(ξ) + ∂g(ξ)
∂ξ δξ, we see that
N i=1 i
The term Σ balances the rotational and translation loss ∂g(ξ) −1
= J (g(ξ)) J (ξ). (14)
terms based on their magnitudes in the training set, and ∂ξ
accounts for potential correlations. We stress that if we were
learning poses directly, the pose targets and their associated Although slightly more computationally expensive, this ex-
mean would be trajectory dependent and would render this pression makes no assumptions about the ‘magnitude‘ of our
type of covariance estimation meaningless. Further, we find correction and works reliably for any target.
Note further that
that, in our experiments, Σ weights translational and rotational if ξ is small, then J (ξ) ≈ 1 and exp ξ ∧ ≈ 1. Thus,
errors similarly to that presented in [7] based on the diagonal ∨
−1
components, but contains relatively large off-diagonal terms. g(ξ) ≈ ln T∗ = −ξ ∗ , (15)

C. Loss Function: SE(3) Jacobians and Equation (14) becomes

In order to use Equation (2) to train DPC-Net with back- ∂g(ξ) −1
propagation, we need to compute its Jacobian with respect to = J (−ξ ∗ ) , (16)
∂ξ
our network output, ξ. Applying the chain rule, we begin with
the expression which matches Method I. To summarize, to apply back-
∂L(ξ) ∂g(ξ) propagation to Equation (2), we use Equation (7) and Equa-
= g(ξ)T Σ−1 . (7) tion (14).
∂ξ ∂ξ
4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017

T̂i,i+1 IV. E XPERIMENTS

To assess the power of our proposed deep corrective

corr
paradigm, we trained DPC-Net on visual data with localiza-
T̂i,i+ p tion estimates from a sparse stereo visual odometry (S-VO)
estimator. In addition to training the full SE(3) DPC-Net,
Fig. 3. We apply pose graph relaxation to fuse high-rate visual localization we modified the loss function and input data to learn simpler
corr
estimates (T̂i,i+1 ) with low-rate deep pose corrections (T̂i,i+∆p ).
SO(3) rotation corrections, and simpler still, yaw angle correc-
tions. For reference, we compared S-VO with different DPC-
D. Loss Function: Correcting SO(3) Estimates
Net corrections to a state-of-the-art dense estimator. Finally,
Our framework can be easily modified to learn SO(3) we trained DPC-Net on visual data and localization estimates
corrections only. We can parametrize a similar objective for from radially-distorted and cropped images.
φ ∈ R3 ,
1
L(φ, C∗ ) = f (φ)T Σ−1 f (φ), (17)
2
where A. Training & Testing
∨
f (φ) , ln exp φ∧ C∗ −1 . (18) For all experiments, we used the KITTI odometry bench-
Equations (8) and (9) have analogous SO(3) formulations: mark training set [24]. Specifically, our data consisted of the
eight sequences 00,02 and 05-10 (we removed sequences 01,
∧ ∧
exp (φ + δφ) ≈ exp (Jδφ) exp φ∧ , (19) 03, 04 to ensure that all data originated from the ‘residential’
category for training and test consistency).
and For testing and validation, we selected the first three se-
∨ ∨
ln (C1 C2 ) = ln exp φ1 ∧ exp φ2 ∧ quences (00,02, and 05). For each test sequence, we selected
−1 the subsequent sequence for validation (i.e., for test sequence
J(φ2 ) φ1 + φ2 if φ1 small
≈ −1 (20) 00 we validated with 02, for 02 with 05, etc.) and used
φ1 + J(−φ1 ) φ2 if φ2 small,
the remaining sequences for training. We note that by design,
where J , J(φ) is the left SO(3) Jacobian. Accordingly, the we train DPC-Net to predict corrections for a specific sensor
final loss Jacobians are identical in structure to Equation (7) and estimator pair. A pre-trained DPC-Net may further serve
and Equation (14), with the necessary SO(3) replacements. as a useful starting point to fine-tune new models for other
sensors and estimators. In this work, however, we focus on
E. Pose Graph Relaxation the aforementioned KITTI sequences and leave a thorough
investigation of generalization for future work.
In practice, we find that using camera poses several frames
apart (i.e. ∆p > 1) often improves test accuracy and reduces To collect training samples, {T∗i , Iti ,ti+∆p }N
i=0 , we used a
overfitting. As a result, we turn to pose graph relaxation to fuse stereo visual odometry estimator and GPS-INS ground-truth
low-rate corrections with higher-rate visual pose estimates. For from the KITTI odometry dataset5 . We resized all images
a particular window of ∆p + 1 poses (see Figure 3), we solve to [400, 120] pixels, approximately preserving their original
the non linear minimization problem aspect ratio6 . For non-distorted data, we use ∆p ∈ [3, 4, 5] for
training, and test with ∆p = 4. For distorted data, we reduce
∆p
{Tti ,n }i=0 = argmin Ot , (21) this to ∆p ∈ [2, 3, 4] and ∆p = 3, respectively, to compensate
∆ p
{Tti ,n }i=0 ∈SE(3) for the larger estimation errors.
where n refers to a common navigation frame, and where we Our training datasets contained between 35,000 and 52,000
define the total cost, Ot , as a sum of visual estimation and training samples7 depending on test sequence. We trained all
correction components: models for 30 epochs using the Adam optimizer, and selected
the best epoch based on the lowest validation loss.
Ot = Ov + Oc . (22)
1) Rotation: To train rotation-only corrections, we ex-
The former cost sums over each estimated transform, tracted the SO(3) component of T∗i and trained our network
∆p −1
X using Equation (17). Further, owing to the fact that rotation
Ov , e Tti ,ti+1 Σ−1
v eti ,ti+1 , (23) information can be extracted from monocular images, we
i=0 replaced the input stereo pairs in DPC-Net with monocular
while the latter incorporates a single pose correction, images from the left camera8 .

Oc , e Tt0 ,t∆p Σ−1

c et0 ,t∆p , (24) 5 We used RGB stereo images to train DPC-Net but grayscale images for
the estimator.
with the pose error defined as 6 Because our network is fully convolutional, it can, in principle, operate
∨ on different image resolutions with no modifications, though we do not
e1,2 = ln T̂1,2 T2 T−1
1 . (25) investigate this ability in this work.
7 If a sequence has M poses, we collect M − ∆ training samples for each
p
We refer the reader to [23] for a detailed treatment of pose- ∆p .
graph relaxation. 8 The stereo VO estimator remained unchanged.
PERETROUKHIN et al.: DPC-NET: DEEP POSE CORRECTION FOR VISUAL LOCALIZATION 5

2) Yaw: To further simplify the corrections, we extracted a 3) Dense Visual Odometry: Finally, we present localization
single-degree-of-freedom yaw rotation correction angle9 from estimates from a computationally-intensive keyframe-based
T∗i , and trained DPC-Net with monocular images and a mean dense, direct visual localization pipeline [26] that computes
squared loss. relative camera poses by minimizing photometric error with
respect to a keyframe image. To compute the photometric
error, the pipeline relies on an inverse compositional approach
B. Estimators to map the image coordinates of a tracking image to the
image coordinates of the reference depth image. As the camera
1) Sparse Visual Odometry: To collect T∗i , we first used a moves through an environment, a new keyframe depth image
frame-to-frame sparse visual odometry pipeline similar to that is computed and stored when the camera field-of-view differs
presented in [3]. We briefly outline the pipeline here. sufficiently from the last keyframe.
Using the open-source libviso2 package [25], we detect We used this dense, direct estimator as our benchmark for a
and track sparse stereo image key-points, yl,ti+1 and yl,ti , state-of-the-art visual localizer, and compared its accuracy to
between stereo image pairs (assumed to be undistorted and that of a much less computationally expensive sparse estimator
rectified). We model reprojection errors (due to sensor noise paired with DPC-Net.
and quantization) as zero-mean Gaussians with a known
covariance, Σy , C. Evaluation Metrics
To evaluate the performance of DPC-Net, we use three error
el,ti = yl,ti+1 − π(Tti+1 ,ti π −1 (yl,ti )) (26)
metrics: mean absolute trajectory error, cumulative absolute
∼ N (0, Σy ) , (27) trajectory error, and mean segment error. For clarity, we
describe each of these three metrics explicitly and stress the
where π(·) is the stereo camera projection function. To gen- importance of carefully selecting and defining error metrics
erate an initial guess and to reject outliers, we use three point when comparing relative localization estimates, as results can
Random Sample Consensus (RANSAC) based on stereo re- be subtly deceiving.
projection error. Finally, we solve for the maximum likelihood 1) Mean Absolute Trajectory Error (m-ATE): The mean
transform, T∗t+1,t , through a Gauss-Newton minimization: absolute trajectory error averages the magnitude of the rota-
tional or translational error10 of estimated poses with respect
Nti
X to a ground truth trajectory defined within the same navigation
T∗ti+1 ,ti = argmin e Tl,ti Σy −1 el,ti . (28) frame. Concretely, em-ATE is defined as
Tti+1 ,ti ∈SE(3) l=1
N ∨
1 X ln T̂−1

.
2) Sparse Visual Odometry with Radial Distortion: Similar em-ATE , T
p,0 p,0 (30)
N p=1
to [5], we modified our input images to test our network’s
ability to correct estimators that compute poses from degraded Although widely used, m-ATE can be deceiving because a
visual data. Unlike [5], who darken and blur their images, we single poor relative transform can significantly affect the final
chose to simulate a poorly calibrated lens model by applying statistic.
radial distortion to the (rectified) KITTI dataset using a plumb- 2) Cumulative Absolute Trajectory Error (c-ATE): Cumu-
bob distortion model. The model computes radially-distorted lative absolute trajectory error sums rotational or translational
image coordinates, xd , yd , from the normalized coordinates em-ATE up to a given point in a trajectory. It is defined as
xn , yn as Xq ∨

ec-ATE (q) , ln T̂−1 T . (31)
xn p,0 p,0
xd
= 1 + κ1 r2 + κ2 r4 + κ3 r6 , (29) p=1
yd yn
c-ATE can show clearer trends than m-ATE (because it is less
p affected by fortunate trajectory overlaps), but it still suffers
where r = x2n + yn2 . We set the distortion coefficients, κ1 , from the same susceptibility to poor (but isolated) relative
κ2 , and κ3 to −0.3, 0.2, 0.01 respectively, to approximately transforms.
match the KITTI radial distortion parameters. We solved 3) Segment Error: Our final metric, segment error, averages
Equation (29) iteratively and used bilinear interpolation to the end-point error for all the possible segments of a given
compute the distorted images for every stereo pair in a length within a trajectory, and then normalizes by the segment
sequence. Finally, we cropped each distorted image to remove length. Since it considers multiple starting points within a
any whitespace. Figure 4 illustrates this process. trajectory, segment error is much less sensitive to isolated
With this distorted dataset, we computed S-VO localization degradations. Concretely, eseg (s) is defined as
estimates and then trained DPC-Net to correct for the effects
Ns ∨
of the radial distortion and effective intrinsic parameter shift 1 X
−1
eseg (s) , ln T̂p+sp ,p Tp+sp ,p , (32)
due to the cropping process. sNs p=1

9 We define yaw in the camera frame as the rotation about the camera’s 10 For brevity, the notation ln (·)∨ returns rotational or translational com-
vertical y axis. ponents depending on context.
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017

Original Distorted Distorted & Cropped

Fig. 4. Illustration of our image radial distortion procedure. Left: rectified RGB image (frame 280 from KITTI odometry sequence 05). Middle: the same
image with radial distortion applied. Right: distorted, cropped, and scaled image.

Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error

Cumulative Err. Norm. (deg)

Cumulative Err. Norm. (m)

3.5 0.016

Average error (deg/m)

80000 S-VO S-VO
250000 0.014

Average error (%)

S-VO + DPC (yaw) 3.0 S-VO + DPC (yaw)
200000 60000 S-VO + DPC (rot) 0.012 S-VO + DPC (rot)
S-VO + DPC (pose) 2.5 0.010 S-VO + DPC (pose)
150000
40000 Dense Dense
2.0 0.008
100000
20000 0.006
50000 1.5
0.004
0 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 200 400 600 800 200 400 600 800
Timestep Timestep Segment length (m) Segment length (m)

(a) Sequence 00: Cumulative Errors. (b) Sequence 00: Segment Errors.

Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error
Cumulative Err. Norm. (deg)
Cumulative Err. Norm. (m)

400000

Average error (deg/m)

S-VO 0.010 S-VO
2.5

Average error (%)

60000
S-VO + DPC (yaw) S-VO + DPC (yaw)
300000
S-VO + DPC (rot) 0.008 S-VO + DPC (rot)
40000 S-VO + DPC (pose) 2.0 S-VO + DPC (pose)
200000
Dense 0.006 Dense

100000 20000 1.5

0.004

0 0 1.0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 200 400 600 800 200 400 600 800
Timestep Timestep Segment length (m) Segment length (m)

Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error
Cumulative Err. Norm. (deg)
Cumulative Err. Norm. (m)

80000 2.5
0.012

Average error (deg/m)

25000 S-VO S-VO
Average error (%)

60000 S-VO + DPC (yaw) 0.010 S-VO + DPC (yaw)

20000 2.0
S-VO + DPC (rot) S-VO + DPC (rot)
S-VO + DPC (pose) 0.008 S-VO + DPC (pose)
15000
40000 1.5
Dense 0.006 Dense
10000
20000 1.0 0.004
5000
0.002
0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 200 400 600 800 200 400 600 800
Timestep Timestep Segment length (m) Segment length (m)

(e) Sequence 05: Cumulative Errors. (f) Sequence 05: Segment Errors.
Fig. 5. c-ATE and mean segment errors for S-VO with and without DPC-Net.

Trajectory Trajectory Trajectory

Ground Truth
400 600 300
S-VO
Northing (m)

Northing (m)

300 S-VO + DPC (pose)

400 200
Dense
200
200 100
100 Ground Truth Ground Truth
S-VO 0 S-VO
0 0
S-VO + DPC (pose) S-VO + DPC (pose)
−100 Dense −100 Dense
−200
−200 0 200 400 600 −250 0 250 500 750 1000 −400 −200 0 200
Easting (m) Easting (m) Easting (m)

(a) Sequence 00. (b) Sequence 02. (c) Sequence 05.

Fig. 6. Top down projections for S-VO with and without DPC-Net.

where Ns and sp (the number of segments of a given length, V. R ESULTS & D ISCUSSION
and the number of poses in each segment, respectively) are
computed based on the selected segment length s. In this work, A. Correcting Sparse Visual Odometry
we follow the KITTI benchmark and report the mean segment Figure 5 plots c-ATE and mean segment errors for test
error norms for all s ∈ [100, 200, 300, ..., 800] (m). sequences 00, 02 and 05 for three different DPC-Net models
REFERENCES 7

Translational error Rotational error SE(3) loss reduced translational mean segment errors by
16
50% and rotational mean segment errors by 35% (relative

Average error (deg/m)

0.07 S-VO
Average error (%)

12
0.06 S-VO + DPC (rot) to the uncorrected sparse estimator, see Table II). The yaw-
S-VO + DPC (pose)
10
0.05
only DPC-Net corrections did not produce consistent improve-
0.04
8
0.03
ments (we suspect due to the presence of large errors in the
6
0.02 remaining degrees of freedom as a result of the distortion
4
200 400 600 800 200 400 600 800 procedure). Nevertheless, DPC-Net trained with SE(3) and
Segment length (m) Segment length (m)
SO(3) losses was able to significantly mitigate the effect of a
(a) Sequence 00 (Distorted): Segment Errors. poorly calibrated camera model. We are actively working on
modifications to the network that would allow the corrected
Translational error Average error (deg/m) Rotational error results to approach those of the undistorted case.
14 S-VO
Average error (%)

0.05
12 S-VO + DPC (rot)
0.04 S-VO + DPC (pose) VI. C ONCLUSIONS
10

8 In this work, we presented DPC-Net, a novel way of fusing

0.03
6 the power of deep, convolutional networks with classical
4 0.02 geometric localization pipelines. Using a novel loss function
200 400 600 800 200 400 600 800
Segment length (m) Segment length (m) based on matrix Lie groups, DPC-Net learns SE(3) pose
corrections to improve a baseline estimator and mitigates the
(b) Sequence 02 (Distorted): Segment Errors.
effect of estimator bias, environmental factors, and poor sensor
Translational error Rotational error calibrations. We demonstrated how DPC-Net can render a
16 0.07
sparse stereo visual odometry pipeline as accurate as a state-
Average error (deg/m)

S-VO
Average error (%)

14 0.06
S-VO + DPC (rot) of-the-art dense estimator, and significantly improve estimates
12 0.05
S-VO + DPC (pose) computed with a poorly calibrated lens distortion model.
10 0.04
8
In future work, we plan to investigate the addition of a
0.03
6
0.02
memory state (through recurrent neural network structures),
200 400 600 800 200 400 600 800
extend DPC-Net to other sensor modalities (e.g., lidar), and
Segment length (m) Segment length (m) incorporate prediction uncertainty through the use of modern
(c) Sequence 05 (Distorted): Segment Errors. probabilistic approaches to deep learning [27].
Fig. 7. c-ATE and segment errors for S-VO with radially distorted images
with and without DPC-Net. R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
paired with our S-VO pipeline. Table I summarize the results [2] V Peretroukhin, J Kelly, and T. D. Barfoot, “Optimizing camera
perspective for stereo visual odometry,” in Canadian Conference on
quantitatively, while Figure 6 plots the North-East projection Comp. and Robot Vision, May 2014, pp. 1–7.
of each trajectory. On average, DPC-Net trained with the full [3] V Peretroukhin, W Vega-Brown, N Roy, and J Kelly, “PROBE-GK:
SE(3) loss reduced translational m-ATE by 72%, rotational Predictive robust estimation using generalized kernels,” in Proc. IEEE
Int. Conf. Robot. Automat. (ICRA), May 2016, pp. 817–824.
m-ATE by 75%, translational mean segment errors by 40% [4] V Peretroukhin, L Clement, M Giamou, and J Kelly, “PROBE:
and rotational mean segment errors by 44% (relative to the Predictive robust estimation for visual-inertial navigation,” in Proc.
uncorrected estimator). Mean segment errors of the sparse IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), 2015,
pp. 3668–3675.
estimator with DPC approached those observed from the [5] G Costante, M Mancini, P Valigi, and T. A. Ciarfuglia, “Exploring
dense estimator on sequence 00, and outperformed the dense representation learning with CNNs for Frame-to-Frame Ego-Motion
estimator on 02. Sequence 05 produced two notable results: estimation,” IEEE Robot. Autom. Letters, vol. 1, no. 1, pp. 18–25,
Jan. 2016, ISSN: 2377-3766.
(1) although DPC-Net significantly reduced S-VO errors, the [6] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet:
dense estimator still outperformed it in all statistics and (2) Visual-Inertial odometry as a Sequence-to-Sequence learning prob-
the full SE(3) corrections performed slightly worse than their lem,” 2017. arXiv: 1701.08376 [cs.CV].
[7] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose
SO(3) counterparts. We suspect the latter effect is a result of regression with deep learning,” 2017. arXiv: 1704.00390 [cs.CV].
motion estimates with predominantly rotational errors which [8] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Relative camera
are easier to learn with an SO(3) loss. pose estimation using convolutional neural networks,” 2017. arXiv:
1702.01381 [cs.CV].
In general, coupling DPC-Net with a simple frame-to-frame [9] G. L. Oliveira, N. Radwan, W. Burgard, and T. Brox, “Topometric
sparse visual localizer yielded a final localization pipeline with localization with deep learning,” 2017. arXiv: 1706.08775 [cs.CV].
accuracy similar to that of a dense pipeline while requiring [10] V. Peretroukhin, L. Clement, and J. Kelly, “Reducing drift in visual
odometry by inferring sun direction using a bayesian convolutional
significantly less visual data (recall that DPC-Net uses resized neural network,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA),
images). May 2017.
[11] D Scaramuzza and F Fraundorfer, “Visual odometry [tutorial],” IEEE
Robot. Autom. Mag., vol. 18, no. 4, pp. 80–92, Dec. 2011.
B. Distorted Images [12] C Cadena, L Carlone, H Carrillo, Y Latif, D Scaramuzza, J Neira,
I Reid, and J. J. Leonard, “Past, present, and future of simultaneous
Figure 7 plots mean segment errors for the radially dis- localization and mapping: Toward the Robust-Perception age,” IEEE
torted dataset. On average, DPC-Net trained with the full Trans. Rob., vol. 32, no. 6, pp. 1309–1332, Dec. 2016.
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017

TABLE I
M -ATE AND M EAN S EGMENT E RRORS FOR VO RESULTS WITH AND WITHOUT DPC-N ET.

m-ATE Mean Segment Errors

Sequence (Length) Estimator Corr. Type Translation (m) Rotation (deg) Translation (%) Rotation (millideg / m)
00 (3.7 km)1 S-VO — 60.22 18.25 2.88 11.18
Dense — 12.41 2.45 1.28 5.42
S-VO + DPC-Net Pose 15.68 3.07 1.62 5.59
Rotation 26.67 7.41 1.70 6.14
Yaw 29.27 8.32 1.94 7.47

02 (5.1 km)2 S-VO — 83.17 14.87 2.05 7.25

Dense — 16.33 3.19 1.21 4.67
S-VO + DPC-Net Pose 17.69 2.86 1.16 4.36
Rotation 20.66 3.10 1.21 4.28
Yaw 49.07 10.17 1.53 6.56

05 (2.2 km)3 S-VO — 27.59 9.54 1.99 9.47

Dense — 5.83 2.05 0.69 3.20
S-VO + DPC-Net Pose 9.82 3.57 1.34 5.62
Rotation 9.67 2.53 1.10 4.68
Yaw 18.37 6.15 1.37 6.90
1 Training sequences 05,06,07,08,09,10. Validation sequence 02.
2 Training sequences 00,06,07,08,09,10. Validation sequence 05.
3 Training sequences 00,02,07,08,09,10. Validation sequence 06.
4 All models trained for 30 epochs. The final model is selected based on the epoch with the lowest validation error.

TABLE II
M -ATE AND M EAN S EGMENT E RRORS FOR VO RESULTS WITH AND WITHOUT DPC-N ET FOR DISTORTED IMAGES .

m-ATE Mean Segment Errors

Sequence (Length) Estimator Corr. Type Translation (m) Rotation (deg) Translation (%) Rotation (millideg / m)
00-distorted (3.7 km) S-VO — 168.27 37.15 14.52 46.43
S-VO + DPC Pose 114.35 28.64 6.73 29.93
Rotation 84.54 21.90 9.58 25.28

02-distorted (5.1 km) S-VO — 335.82 51.05 13.74 34.37

S-VO + DPC Pose 196.90 23.66 7.49 25.20
Rotation 269.90 53.11 9.25 25.99

05-distorted (2.2 km) S-VO — 73.44 12.27 14.99 42.45

S-VO + DPC Pose 47.50 10.54 7.11 24.60
Rotation 71.42 13.10 8.14 23.56

[13] S Levine, C Finn, T Darrell, and P Abbeel, “End-to-end training of

deep visuomotor policies,” J. Mach. Learn. Res., 2016. [21] Q. Li, J. Qian, Z. Zhu, X. Bao, M. K. Helwa, and A. P. Schoellig,
[14] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench- “Deep neural networks for improved, impromptu trajectory tracking
marking deep reinforcement learning for continuous control,” in Proc. of quadrotors,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA),
Int. Conf. on Machine Learning, ser. ICML’16, 2016, pp. 1329–1338. 2017, pp. 5183–5189.
[15] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate [22] A Punjani and P Abbeel, “Deep learning helicopter dynamics mod-
cnn object detector with scale dependent pooling and cascaded els,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), May 2015,
rejection classifiers,” in Proc. IEEE Int. Conf. Comp. Vision and pp. 3223–3230.
Pattern Recognition (CVPR), 2016, pp. 2129–2137. [23] T. D. Barfoot, State estimation for robotics. Cambridge University
[16] N. Sunderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell, B. Press, Jul. 2017.
Upcroft, and M. Milford, “Place recognition with ConvNet landmarks: [24] A Geiger, P Lenz, C Stiller, and R Urtasun, “Vision meets robotics:
Viewpoint-robust, condition-robust, training-free,” in Proc. Robotics: The KITTI dataset,” Int. J. Rob. Res., vol. 32, no. 11, pp. 1231–1237,
Science and Systems XII, Jul. 2015. 2013.
[17] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, “Backprop KF: Learn- [25] A Geiger, J Ziegler, and C Stiller, “StereoScan: Dense 3D reconstruc-
ing discriminative deterministic state estimators,” in Proceedings of tion in real-time,” in Proc. Intelligent Vehicles Symp. (IV), IEEE, Jun.
Neural Information Processing Systems (NIPS), 2016. 2011, pp. 963–968.
[18] A. Kendall, K. Alex, G. Matthew, and C. Roberto, “PoseNet: A [26] L. Clement and J. Kelly, “How to train a CAT: Learning canonical
convolutional network for Real-Time 6-DOF camera relocalization,” appearance transformations for robust direct localization under illumi-
in Proc. of IEEE Int. Conf. on Computer Vision (ICCV), 2015. nation change,” 2017. arXiv-preprint: arXiv:submit/2002132 (cs.CV).
[19] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and [27] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
A. Davison, “Gvnn: Neural network library for geometric computer deep learning for computer vision?,” 2017. arXiv: 1703 . 04977
vision,” in Computer Vision – ECCV 2016 Workshops, Springer, [cs.CV].
Cham, 2016, pp. 67–82.
[20] I Cvišić and I Petrović, “Stereo odometry based on careful feature
selection and tracking,” in Proc. European Conf. on Mobile Robots,
2015, pp. 1–6.