DPC-Net: Deep Pose Correction For Visual Localization: Valentin Peretroukhin, and Jonathan Kelly
DPC-Net: Deep Pose Correction For Visual Localization: Valentin Peretroukhin, and Jonathan Kelly
niques achieve accuracies previously unattainable by classical CNNs can infer difficult-to-model geometric quantities (e.g.,
methods. In mobile robotics and state estimation, however, the direction of the sun) to improve an existing localization
it remains unclear to what extent these deep architectures estimate. In a similar vein, we propose a system (Figure 1) that
can obviate the need for classical geometric modelling. In takes as its starting point an efficient, classical localization
this work, we focus on visual odometry (VO): the task of algorithm that computes high-rate pose estimates. To it, we
computing the egomotion of a camera through an unknown add a Deep Pose Correction Network (DPC-Net) that learns
environment with no external positioning sources. Visual local- low-rate, ‘small’ corrections from training data that we then
ization algorithms like VO can suffer from several systematic fuse with the original estimates. DPC-Net does not require
error sources that include estimator biases [2], poor calibration, any modification to an existing localization pipeline, and can
and environmental factors (e.g., a lack of scene texture). learn to correct multi-faceted errors from estimator bias, sensor
While machine learning approaches can be used to better mis-calibration or environmental effects.
model specific subsystems of a localization pipeline (e.g., the Although in this work we focus on visual data, the DPC-Net
heteroscedastic feature track error covariance modelling of [3], architecture can be readily modified to learn SE(3) corrections
[4]), much recent work [5]–[9] has been devoted to completely for estimators that operate with other sensor modalities (e.g.,
replacing the estimator with a CNN-based system. lidar). For this general task, we derive a pose regression loss
We contend that this type of complete replacement places function and a closed-form analytic expression for its Jacobian.
an unnecessary burden on the CNN. Not only must it learn Our loss permits a network to learn an unconstrained Lie
projective geometry, but it must also understand the environ- algebra coordinate vector, but derives its Jacobian with respect
ment, and account for sensor calibration and noise. Instead, we to SE(3) geodesic distance.
take inspiration from recent results [10] that demonstrate that In summary, the main contributions of this work are
Manuscript received: September, 10, 2017; Revised November 24, 2017; 1) the formulation of a novel deep corrective approach to
Accepted November, 8, 2017. egomotion estimation,
1 Both authors are with the Space & Terrestrial Autonomous
2) a novel cost function for deep SE(3) regression that
Robotic Systems (STARS) laboratory at the University of
Toronto Institute for Aerospace Studies (UTIAS), Canada. naturally balances translation and rotation errors, and
<firstname>.<lastname>@robotics.utias.utoronto.ca 3) an open-source implementation of DPC-Net in
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017
2) Yaw: To further simplify the corrections, we extracted a 3) Dense Visual Odometry: Finally, we present localization
single-degree-of-freedom yaw rotation correction angle9 from estimates from a computationally-intensive keyframe-based
T∗i , and trained DPC-Net with monocular images and a mean dense, direct visual localization pipeline [26] that computes
squared loss. relative camera poses by minimizing photometric error with
respect to a keyframe image. To compute the photometric
error, the pipeline relies on an inverse compositional approach
B. Estimators to map the image coordinates of a tracking image to the
image coordinates of the reference depth image. As the camera
1) Sparse Visual Odometry: To collect T∗i , we first used a moves through an environment, a new keyframe depth image
frame-to-frame sparse visual odometry pipeline similar to that is computed and stored when the camera field-of-view differs
presented in [3]. We briefly outline the pipeline here. sufficiently from the last keyframe.
Using the open-source libviso2 package [25], we detect We used this dense, direct estimator as our benchmark for a
and track sparse stereo image key-points, yl,ti+1 and yl,ti , state-of-the-art visual localizer, and compared its accuracy to
between stereo image pairs (assumed to be undistorted and that of a much less computationally expensive sparse estimator
rectified). We model reprojection errors (due to sensor noise paired with DPC-Net.
and quantization) as zero-mean Gaussians with a known
covariance, Σy , C. Evaluation Metrics
To evaluate the performance of DPC-Net, we use three error
el,ti = yl,ti+1 − π(Tti+1 ,ti π −1 (yl,ti )) (26)
metrics: mean absolute trajectory error, cumulative absolute
∼ N (0, Σy ) , (27) trajectory error, and mean segment error. For clarity, we
describe each of these three metrics explicitly and stress the
where π(·) is the stereo camera projection function. To gen- importance of carefully selecting and defining error metrics
erate an initial guess and to reject outliers, we use three point when comparing relative localization estimates, as results can
Random Sample Consensus (RANSAC) based on stereo re- be subtly deceiving.
projection error. Finally, we solve for the maximum likelihood 1) Mean Absolute Trajectory Error (m-ATE): The mean
transform, T∗t+1,t , through a Gauss-Newton minimization: absolute trajectory error averages the magnitude of the rota-
tional or translational error10 of estimated poses with respect
Nti
X to a ground truth trajectory defined within the same navigation
T∗ti+1 ,ti = argmin e Tl,ti Σy −1 el,ti . (28) frame. Concretely, em-ATE is defined as
Tti+1 ,ti ∈SE(3) l=1
N
∨
1 X
ln T̂−1
.
2) Sparse Visual Odometry with Radial Distortion: Similar em-ATE ,
T
p,0 p,0
(30)
N p=1
to [5], we modified our input images to test our network’s
ability to correct estimators that compute poses from degraded Although widely used, m-ATE can be deceiving because a
visual data. Unlike [5], who darken and blur their images, we single poor relative transform can significantly affect the final
chose to simulate a poorly calibrated lens model by applying statistic.
radial distortion to the (rectified) KITTI dataset using a plumb- 2) Cumulative Absolute Trajectory Error (c-ATE): Cumu-
bob distortion model. The model computes radially-distorted lative absolute trajectory error sums rotational or translational
image coordinates, xd , yd , from the normalized coordinates em-ATE up to a given point in a trajectory. It is defined as
xn , yn as Xq
∨
ec-ATE (q) ,
ln T̂−1 T
. (31)
xn
p,0 p,0
xd
= 1 + κ1 r2 + κ2 r4 + κ3 r6 , (29) p=1
yd yn
c-ATE can show clearer trends than m-ATE (because it is less
p affected by fortunate trajectory overlaps), but it still suffers
where r = x2n + yn2 . We set the distortion coefficients, κ1 , from the same susceptibility to poor (but isolated) relative
κ2 , and κ3 to −0.3, 0.2, 0.01 respectively, to approximately transforms.
match the KITTI radial distortion parameters. We solved 3) Segment Error: Our final metric, segment error, averages
Equation (29) iteratively and used bilinear interpolation to the end-point error for all the possible segments of a given
compute the distorted images for every stereo pair in a length within a trajectory, and then normalizes by the segment
sequence. Finally, we cropped each distorted image to remove length. Since it considers multiple starting points within a
any whitespace. Figure 4 illustrates this process. trajectory, segment error is much less sensitive to isolated
With this distorted dataset, we computed S-VO localization degradations. Concretely, eseg (s) is defined as
estimates and then trained DPC-Net to correct for the effects
Ns
∨
of the radial distortion and effective intrinsic parameter shift 1 X
−1
eseg (s) ,
ln T̂p+sp ,p Tp+sp ,p
, (32)
due to the cropping process. sNs p=1
9 We define yaw in the camera frame as the rotation about the camera’s 10 For brevity, the notation ln (·)∨ returns rotational or translational com-
vertical y axis. ponents depending on context.
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017
Fig. 4. Illustration of our image radial distortion procedure. Left: rectified RGB image (frame 280 from KITTI odometry sequence 05). Middle: the same
image with radial distortion applied. Right: distorted, cropped, and scaled image.
Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error
3.5 0.016
(a) Sequence 00: Cumulative Errors. (b) Sequence 00: Segment Errors.
Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error
Cumulative Err. Norm. (deg)
Cumulative Err. Norm. (m)
400000
0 0 1.0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 200 400 600 800 200 400 600 800
Timestep Timestep Segment length (m) Segment length (m)
(c) Sequence 02: Cumulative Errors. (d) Sequence 02: Segment Errors.
Translational Cumulative Err. Norm. Rotational Cumulative Err. Norm. Translational error Rotational error
Cumulative Err. Norm. (deg)
Cumulative Err. Norm. (m)
80000 2.5
0.012
(e) Sequence 05: Cumulative Errors. (f) Sequence 05: Segment Errors.
Fig. 5. c-ATE and mean segment errors for S-VO with and without DPC-Net.
Northing (m)
Northing (m)
where Ns and sp (the number of segments of a given length, V. R ESULTS & D ISCUSSION
and the number of poses in each segment, respectively) are
computed based on the selected segment length s. In this work, A. Correcting Sparse Visual Odometry
we follow the KITTI benchmark and report the mean segment Figure 5 plots c-ATE and mean segment errors for test
error norms for all s ∈ [100, 200, 300, ..., 800] (m). sequences 00, 02 and 05 for three different DPC-Net models
REFERENCES 7
Translational error Rotational error SE(3) loss reduced translational mean segment errors by
16
50% and rotational mean segment errors by 35% (relative
14
12
0.06 S-VO + DPC (rot) to the uncorrected sparse estimator, see Table II). The yaw-
S-VO + DPC (pose)
10
0.05
only DPC-Net corrections did not produce consistent improve-
0.04
8
0.03
ments (we suspect due to the presence of large errors in the
6
0.02 remaining degrees of freedom as a result of the distortion
4
200 400 600 800 200 400 600 800 procedure). Nevertheless, DPC-Net trained with SE(3) and
Segment length (m) Segment length (m)
SO(3) losses was able to significantly mitigate the effect of a
(a) Sequence 00 (Distorted): Segment Errors. poorly calibrated camera model. We are actively working on
modifications to the network that would allow the corrected
Translational error Average error (deg/m) Rotational error results to approach those of the undistorted case.
14 S-VO
Average error (%)
0.05
12 S-VO + DPC (rot)
0.04 S-VO + DPC (pose) VI. C ONCLUSIONS
10
S-VO
Average error (%)
14 0.06
S-VO + DPC (rot) of-the-art dense estimator, and significantly improve estimates
12 0.05
S-VO + DPC (pose) computed with a poorly calibrated lens distortion model.
10 0.04
8
In future work, we plan to investigate the addition of a
0.03
6
0.02
memory state (through recurrent neural network structures),
200 400 600 800 200 400 600 800
extend DPC-Net to other sensor modalities (e.g., lidar), and
Segment length (m) Segment length (m) incorporate prediction uncertainty through the use of modern
(c) Sequence 05 (Distorted): Segment Errors. probabilistic approaches to deep learning [27].
Fig. 7. c-ATE and segment errors for S-VO with radially distorted images
with and without DPC-Net. R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
paired with our S-VO pipeline. Table I summarize the results [2] V Peretroukhin, J Kelly, and T. D. Barfoot, “Optimizing camera
perspective for stereo visual odometry,” in Canadian Conference on
quantitatively, while Figure 6 plots the North-East projection Comp. and Robot Vision, May 2014, pp. 1–7.
of each trajectory. On average, DPC-Net trained with the full [3] V Peretroukhin, W Vega-Brown, N Roy, and J Kelly, “PROBE-GK:
SE(3) loss reduced translational m-ATE by 72%, rotational Predictive robust estimation using generalized kernels,” in Proc. IEEE
Int. Conf. Robot. Automat. (ICRA), May 2016, pp. 817–824.
m-ATE by 75%, translational mean segment errors by 40% [4] V Peretroukhin, L Clement, M Giamou, and J Kelly, “PROBE:
and rotational mean segment errors by 44% (relative to the Predictive robust estimation for visual-inertial navigation,” in Proc.
uncorrected estimator). Mean segment errors of the sparse IEEE/RSJ Int. Conf. Intelligent Robots and Syst. (IROS), 2015,
pp. 3668–3675.
estimator with DPC approached those observed from the [5] G Costante, M Mancini, P Valigi, and T. A. Ciarfuglia, “Exploring
dense estimator on sequence 00, and outperformed the dense representation learning with CNNs for Frame-to-Frame Ego-Motion
estimator on 02. Sequence 05 produced two notable results: estimation,” IEEE Robot. Autom. Letters, vol. 1, no. 1, pp. 18–25,
Jan. 2016, ISSN: 2377-3766.
(1) although DPC-Net significantly reduced S-VO errors, the [6] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet:
dense estimator still outperformed it in all statistics and (2) Visual-Inertial odometry as a Sequence-to-Sequence learning prob-
the full SE(3) corrections performed slightly worse than their lem,” 2017. arXiv: 1701.08376 [cs.CV].
[7] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose
SO(3) counterparts. We suspect the latter effect is a result of regression with deep learning,” 2017. arXiv: 1704.00390 [cs.CV].
motion estimates with predominantly rotational errors which [8] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, “Relative camera
are easier to learn with an SO(3) loss. pose estimation using convolutional neural networks,” 2017. arXiv:
1702.01381 [cs.CV].
In general, coupling DPC-Net with a simple frame-to-frame [9] G. L. Oliveira, N. Radwan, W. Burgard, and T. Brox, “Topometric
sparse visual localizer yielded a final localization pipeline with localization with deep learning,” 2017. arXiv: 1706.08775 [cs.CV].
accuracy similar to that of a dense pipeline while requiring [10] V. Peretroukhin, L. Clement, and J. Kelly, “Reducing drift in visual
odometry by inferring sun direction using a bayesian convolutional
significantly less visual data (recall that DPC-Net uses resized neural network,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA),
images). May 2017.
[11] D Scaramuzza and F Fraundorfer, “Visual odometry [tutorial],” IEEE
Robot. Autom. Mag., vol. 18, no. 4, pp. 80–92, Dec. 2011.
B. Distorted Images [12] C Cadena, L Carlone, H Carrillo, Y Latif, D Scaramuzza, J Neira,
I Reid, and J. J. Leonard, “Past, present, and future of simultaneous
Figure 7 plots mean segment errors for the radially dis- localization and mapping: Toward the Robust-Perception age,” IEEE
torted dataset. On average, DPC-Net trained with the full Trans. Rob., vol. 32, no. 6, pp. 1309–1332, Dec. 2016.
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2017
TABLE I
M -ATE AND M EAN S EGMENT E RRORS FOR VO RESULTS WITH AND WITHOUT DPC-N ET.
TABLE II
M -ATE AND M EAN S EGMENT E RRORS FOR VO RESULTS WITH AND WITHOUT DPC-N ET FOR DISTORTED IMAGES .