0% found this document useful (0 votes)
40 views10 pages

Geometric Loss Functions For Camera Pose Regression With Deep Learning

This document summarizes a research paper presented at the 2017 IEEE Conference on Computer Vision and Pattern Recognition. The paper proposes new geometric loss functions to improve the performance of PoseNet, a deep learning model for camera pose regression using single images. Specifically, it explores loss functions based on scene reprojection error and automatically learning optimal weightings of position and orientation during training. Evaluation on multiple datasets ranging from indoor rooms to urban areas shows the geometric methods enhance PoseNet's accuracy, narrowing the gap with traditional SIFT-based localization approaches.

Uploaded by

Item Songs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Geometric Loss Functions For Camera Pose Regression With Deep Learning

This document summarizes a research paper presented at the 2017 IEEE Conference on Computer Vision and Pattern Recognition. The paper proposes new geometric loss functions to improve the performance of PoseNet, a deep learning model for camera pose regression using single images. Specifically, it explores loss functions based on scene reprojection error and automatically learning optimal weightings of position and orientation during training. Evaluation on multiple datasets ranging from indoor rooms to urban areas shows the geometric methods enhance PoseNet's accuracy, narrowing the gap with traditional SIFT-based localization approaches.

Uploaded by

Item Songs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2017 IEEE Conference on Computer Vision and Pattern Recognition

Geometric loss functions for camera pose regression with deep learning

Alex Kendall and Roberto Cipolla


University of Cambridge
{agk34, rc10001}@cam.ac.uk

Abstract

Deep learning has shown to be effective for robust and


real-time monocular image relocalisation. In particular,
PoseNet [22] is a deep convolutional neural network which
learns to regress the 6-DOF camera pose from a single im-
age. It learns to localize using high level features and is
robust to difficult lighting, motion blur and unknown cam- Single RGB Input Image
era intrinsics, where point based SIFT registration fails.
However, it was trained using a naive loss function, with
hyper-parameters which require expensive tuning. In this
paper, we give the problem a more fundamental theoreti-
cal treatment. We explore a number of novel loss functions
for learning camera pose which are based on geometry and
scene reprojection error. Additionally we show how to au- 6-DoF Camera Pose
tomatically learn an optimal weighting to simultaneously
regress position and orientation. By leveraging geometry, Figure 1: PoseNet [22] is trained end-to-end to estimate the cam-
we demonstrate that our technique significantly improves era’s six degree of freedom pose from a single monocular image.
PoseNet’s performance across datasets ranging from indoor In this paper we show how to apply a principled loss function
rooms to a small city. based on the scene’s geometry to learn camera pose without any
hyper-parameters.

tures are often not robust enough for localising across dif-
1. Introduction ferent weather, lighting or environmental conditions. Addi-
Designing a system for reliable large scale localisa- tionally, they lack the ability to capture global context, and
tion is a challenging problem. The discovery of the po- require robust aggregation of hundreds of points to form a
sitioning system in mammalian brains, located in the hip- consensus to predict pose [57].
pocampus, was awarded the 2014 Nobel prize in Physiol- To address this problem, we introduced PoseNet [22, 19]
ogy or Medicine [36, 32]. It is an important problem for which uses end-to-end deep learning to predict camera pose
computer vision too, with localisation technology essential from a single input image. It was shown to be able lo-
for many applications including autonomous vehicles, un- calise more robustly using deep learning, compared with
manned aerial vehicles and augmented reality. State of the point features such as SIFT [30]. PoseNet learns a represen-
art localisation systems perform very well within controlled tation using the entire image context based on appearance
environments [24, 34, 12, 33, 44]. However, we are yet to and shape. These features generalise well and can localise
see their wide spread use in the wild because of their inabil- across challenging lighting and appearances changes. It is
ity to cope with large viewpoint or appearance changes. also fast, being able to regress the camera’s pose in only a
Many of the visual localisation systems use point land- few milliseconds. It is very scalable as it does not require
marks such as SIFT [30] or ORB [40] to localise. These a large database of landmarks. Rather, it learns a mapping
features perform well for incremental tracking and estimat- from pixels to a high dimensional space linear with pose.
ing ego-motion [33]. However, these point features are not The main weakness of PoseNet [22] was that despite
able to create a representation which is sufficiently robust to its scalability and robustness it did not produce metric ac-
challenging real-world scenarios. For example, point fea- curacy which is comparable to other geometric methods

1063-6919/17 $31.00 © 2017 IEEE 6555


DOI 10.1109/CVPR.2017.694
[44, 49]. In this paper we argue that a contributing factor methods require a 3D model with a large database of fea-
to this was because PoseNet naively applied a deep learning tures and efficient retrieval methods. They are expensive to
model end-to-end to learn camera pose. In this work, we compute, often do not scale well, and are often not robust to
reconsider this problem with a grounding in geometry. We changing environmental conditions [54].
wish to build upon the decades of research into multi-view In this work, we address the more challenging problem
geometry [14] to improve our ability to use deep learning to of metric localisation with deep learning. PoseNet [22] in-
regress camera pose. troduced the technique of training a convolutional neural
The main contribution of this paper is improving the per- network to regress camera pose. It combines the strengths
formance of PoseNet with geometrically formed loss func- of place recognition and localisation approaches: it can
tions. It is not trivial to simply regress position and rota- globally relocalise without a good initial pose estimate, and
tion quantities using supervised learning. PoseNet required produces a continuous metric pose. Rather than building a
a weighting factor to balance these two properties, but was map (or database of landmark features), the neural network
not tolerant to the selection of this hyperparameter. In Sec- learns features whose size, unlike a map, does not require
tion 3.3 we explore loss functions which remove this hy- memory linearly proportional to the size of the scene.
perparameter, or optimise it directly from the data. In Sec- Later work has extended PoseNet to use RGB-D input
tion 3.3.4 we show how to train directly from the scene ge- [25], learn relative ego-motion [31], improve the context of
ometry using the reprojection error. features [54], localise over video sequences [8] and interpret
In Section 4 we demonstrate our system on an array relocalisation uncertainty with Bayesian Neural Networks
of datasets, ranging from individual indoor rooms, to the [19]. Additionally, [54] demonstrate PoseNet’s efficacy on
Dubrovnik city dataset [26]. We show that our geometric featureless indoor environments, where they demonstrate
approach can improve PoseNet’s efficacy across many dif- that SIFT based structure from motion techniques fail in the
ferent datasets – narrowing the deficit to traditional SIFT same environment.
feature-based algorithms. For outdoor scenes ranging from Although PoseNet is scalable and robust [22], it does not
50, 000m2 to 2km2 we can achieve relocalisation accura- produce sufficiently accurate estimates of Pose compared
cies of a few meters and a few degrees. In small rooms we to traditional methods [44]. It was designed with a naive
are able to achieve accuracies of 0.2 − 0.4m. regression loss function which trains the network end-to-
end without any consideration for geometry. This problem
2. Related work is the focus of this paper – we do not want to throw away
the decades of research into multi view geometry [14]. We
Large scale localisation research can be divided into two improve PoseNet’s performance by learning camera pose
categories; place recognition and metric localisation. Place with a fundamental treatment of scene geometry.
recognition discretises the world into a number of land-
marks and attempts to identify which place is visible in a 3. Model for camera pose regression
given image. Traditionally, this has been modelled as an
image retrieval problem [6, 9, 53, 45] enabling the use of In this section we describe the details of the convolu-
efficient and scalable retrieval approaches [35, 38] such as tional neural network model we train to estimate camera
Bag-of-Words (BoW) [47], VLAD [17, 10], and Fisher vec- pose directly from a monocular image, I. Our network out-
tors [18]. Deep learning models have also been shown to puts an estimate, p̂, for pose, p, given by a 3-D camera po-
be effective for creating efficient descriptors. Many ap- sition x̂ and orientation q̂. We use a quaternion to represent
proaches leverage classification networks [39, 13, 3, 52], orientation, for reasons discussed in Section 3.2. Pose p is
and fine tune them on localisation datasets [4]. Other work defined relative to an arbitrary global reference frame. In
of note is PlaNet [55] which trained a classification network practice we centre this global reference frame at the mean
to localise images on a world scale. However, all these net- location of all camera poses. We train the model with su-
works must discretise the world into places and are unable pervised learning using pose labels, p = [x, q], obtained
to produce a fine grained estimate of 6-DOF pose. through structure from motion, or otherwise (Section 4.1).
In contrast, metric localisation techniques estimate the
3.1. Architecture
metric position and orientation of the camera. Tradition-
ally, this has been approached by computing the pose from Our pose regression formulation is capable of being ap-
2D-3D correspondences between 2D features in the query plied to any neural network trained through back propaga-
image and 3D points in the model, which are determined tion. For the experiments in this paper we adapt a state of
through descriptor matching [7, 28, 27, 42, 49]. This as- the art deep neural network architecture for classification,
sumes that the scene is represented by a 3D structure-from- GoogLeNet [51], as a basis for developing our pose regres-
motion model. The full 6 degree-of-freedom pose of a query sion network. This allows us to use pretrained weights,
image can be estimated very precisely [44]. However these for example from a model trained to classify images in the

6556
ImageNet dataset [11]. We observe that these pretrained pensate for this.
features regularise and improve performance in PoseNet
3.3. Loss function
through transfer learning [37]. Although, to generalise
PoseNet, we may apply it to any deep architecture designed This far, we have described the structure of the pose rep-
for image classification as follows: resentation we would like our network to learn. Next, we
1. Remove the final linear regression and softmax layers discuss how to design an effective loss function to learn to
used for classification estimate the camera’s 6 degree of freedom pose. This is a
2. Append a linear regression layer. This fully connected particularly challenging objective because it involves learn-
layer is designed to output a seven dimensional pose ing two distinct quantities - rotation and translation - with
vector representing position (3 dimensions) and orien- different units and scales.
tation (4 dimensional quaternion) This section defines a number of loss functions and ex-
plores their efficacy for camera pose regression. We be-
3. Insert a normalisation layer to normalise the four di-
gin in Section 3.3.2 by describing the original weighted loss
mensional quaternion orientation vector to unit length
function which was proposed by PoseNet [22]. We improve
3.2. Pose representation on this in Section 3.3.3 by introducing a novel loss function
which can learn the weighting between rotation and transla-
An important consideration when designing a machine tion automatically, using an estimate of the homoscedastic
learning system is the representation space of the output. task uncertainty. Further, in Section 3.3.4 we describe a
We can easily learn camera position in Euclidean space loss function which combines position and orientation as a
[22]. However, learning orientation is more complex. In single scalar using the reprojection error geometry. In Sec-
this section we compare a number of different parametri- tion 4.2 we compare the performance of these loss func-
sations used to express rotational quantities; Euler angles, tions, and discusses their trade-offs.
axis-angle, SO(3) rotation matrices and quaternions [2].
We evaluate their efficacy for deep learning.
3.3.1 Learning position and orientation
Firstly, Euler angles are easily understandable as an in-
terpretable parametrisation of 3-D rotation. However, they We can learn to estimate camera position by forming a
have two problems. Euler angles wrap around at 2π radi- smooth, continuous and injective regression loss in Eu-
ans, having multiple values representing the same angle. clidean space, Lx (I) = x − x̂γ , with norm given by γ
Therefore they are not injective, which causes them to be ([22] used the L2 Euclidean norm).
challenging to learn as a uni-modal scalar regression task. However, learning camera orientation is not as simple.
Additionally, they do not provide a unique parametrisation In Section 3.2 we described a number of options for repre-
for a given angle and suffer from the well-studied problem senting orientation. Quaternions are an attractive choice for
of gimbal lock [2]. The axis-angle representation is another deep learning because they are easily formulated in a con-
three dimensional vector representation. However like Eu- tinuous and differentiable way. The set of rotations lives
ler angles, it too suffers from a repetition around the 2π on the unit sphere in quaternion space. We can easily map
radians representation. any four dimensional vector to a valid quaternion rotation
Rotation matrices are a over-parametrised representation by normalising it to unit length. [22] demonstrates how to
of rotation. For 3-D problems, the set of rotation matrices learn to regress quaternion values:
are 3×3 dimensional members of the special orthogonal Lie  
 q̂ 
group, SO(3). These matrices have a number of interesting 
Lq (I) = q −  (1)
q̂ 
γ
properties, including orthonormality. However, it is diffi-
cult to enforce the orthogonality constraint when learning a Using a distance norm, γ, in Euclidean space makes no ef-
SO(3) representation through back-propagation. fort to keep q on the unit sphere. We find, however, that
In this work, we chose quaternions as our orientation rep- during training, q becomes close enough to q̂ such that the
resentation. Quaternions are favourable because arbitrary distinction between spherical distance and Euclidean dis-
four dimensional values are easily mapped to legitimate ro- tance becomes insignificant. For simplicity, and to avoid
tations by normalizing them to unit length. This is a simpler hampering the optimization with unnecessary constraints,
process than the orthonormalization required of rotation we chose to omit the spherical constraint. The main prob-
matrices. Quaternions are a continuous and smooth repre- lem with Quaternions is that they are not injective because
sentation of rotation. They lie on the unit manifold, which they have two unique values (from each hemisphere) which
is a simple constraint to enforce through back-propagation. map to a single rotation. This is because quaternion, q, is
Their main downside is that they have two mappings for identical to −q. To address this, we constrain all quater-
each rotation, one on each hemisphere. However, in Sec- nions to one hemisphere such that there is a unique value
tion 3.3.1 we show how to adjust the loss function to com- for each rotation.

6557
3
3.3.2 Simultaneously learning position and orientation Position
Position

Median Orientation Error (◦ )


6

Median Position Error (m)


Orientation
The challenging aspect of learning camera pose is design-
2.5
ing a loss function which is able to learn both position and
orientation. Initially, we proposed a method to combine po- 4

sition and orientation into a single loss function with a linear 2

weighted sum [22], shown in (2): 2


1.5
Lβ (I) = Lx (I) + βLq (I) (2) 128 256 512 1,024
Beta Weight β
Because x and q are expressed in different units, a scal-
ing factor, β, is used to balance the losses. This hyper- Figure 2: Relative performance of position and orientation regres-
parameter attempts to keep the expected value of position sion on a single model with a range of scale factors for an indoor
and orientation errors approximately equal. scene from the King’s College scene in Cambridge Landmarks, us-
Interestingly, we observe that a model which is jointly ing the loss function in (2). This demonstrates that learning with
trained to regress the camera’s position and orientation per- the optimum scale factor leads to the network uncovering a more
forms better than separate models trained on each task indi- accurate pose function.
vidually. Figure 2 shows that with just position, or just ori-
entation information, the network was not able to determine
the function representing camera pose with as great accu- which is a function of the input data [20]. Rather, it captures
racy. The model learns a better representation for pose when the uncertainty of the task itself. In [21] we show how to use
supervised with both translation and orientation labels. We this insight to combine losses for different tasks in a prob-
also experimented with branching the network lower down abilistic manner. Here we show how to apply this to learn
into two separate components to regress position and orien- camera position and orientation (with a Laplace likelihood):
tation. However, we found that it too was less effective, for Lσ (I) = Lx (I)σ̂x−2 + log σ̂x2 + Lq (I)σ̂q−2 + log σ̂q2 (3)
similar reasons: separating into distinct position and ori-
entation features denies each the information necessary to where we optimise the homoscedastic uncertainties, σ̂x2 , σ̂q2 ,
factor out orientation from position, or vice versa. through backpropagation with respect to the loss function.
However the consequence of this was that the hyper- These uncertainties are free scalar values, not model out-
parameter β required significant tuning to get reasonable puts. They represent the homoscedastic (task) noise.
results. In the loss function (2) a balance β must be struck This loss consists of two components; the residual re-
between the orientation and translation penalties (Figure 2). gressions and the uncertainty regularization terms. We learn
They are highly coupled as they are regressed from the same the variance, σ 2 , implicitly from the loss function. As the
model weights. We found β to be greater for outdoor scenes variance is larger, it has a tempering effect on the residual
as position errors tended to be relatively greater. Following regression term; larger variances (or uncertainty) results in
this intuition it is possible to fine tune β using grid search. a smaller residual loss. The second regularization term pre-
For the indoor scenes it was between 120 to 750 and out- vents the network from predicting infinite uncertainty (and
door scenes between 250 to 2000. This is an expensive task therefore zero loss). As we expect quaternion values to have
in practice, as each experiment can take days to complete. much smaller values (they are constrained to the unit man-
It is desirable to find a loss function which removes this hy- ifold), their noise, σq2 should be much smaller than the po-
perparameter. Therefore, the remainder of this section ex- sition noise, σx2 , which can be many meters in magnitude.
plores different loss functions which aim to find an optimal As σq2 should be much smaller than σx2 , orientation regres-
weighting automatically. sion should be weighted much higher than position – with a
similar effect to β in (2).
3.3.3 Learning an optimal weighting In practice, we learn ŝ := log σ̂ 2 because it is more nu-
merically stable [21]:
Ideally, we would like a loss function which is able to learn
position and orientation optimally, without including any Lσ (I) = Lx (I) exp(−ŝx )+ŝx +Lq (I) exp(−ŝq )+ŝq (4)
hyper parameters. For this reason, we propose a novel loss
function which is able to learn a weighting between the po- This is more numerically stable than regressing the vari-
sition and orientation objective functions. We formulate it ance, σ 2 , because the loss avoids a potential division by
using homoscedastic uncertainty which we can learn us- zero. The exponential mapping also allows us to regress
ing probabilistic deep learning [20]. Homoscedastic uncer- unconstrained scalar values, where exp(−si ) is resolved to
tainty is a measure of uncertainty which does not depend the positive domain giving valid values for variance. We
on the input data, as opposed to heteroscedastic uncertainty find that this loss is very robust to our initialisation choice

6558
for the homoscedastic task uncertainty values. Only an ap- The key advantage of this loss is that it allows the model
proximate initial guess is required, we arbitrarily use initial to vary the weighting between position and orientation, de-
values of ŝx = 0.0, ŝq = −3.0, for all scenes. pending on the specific geometry in the training image. For
example, training images with geometry which is far away
3.3.4 Learning from geometric reprojection error would balance rotational and translational loss differently
to images with geometry very close to the camera.
Perhaps a more desirable loss is one that does not require
balancing of rotational and positional quantities at all. Re- Interestingly, when experimenting with the original
projection error of scene geometry is a representation which weighted loss in (2) we observed that the hyperparameter
combines rotation and translation naturally in a single scalar β was an approximate function of the scene geometry. We
loss [14]. Reprojection error is given by the residual be- observed that it was a function of the landmark distance and
tween 3-D points in the scene projected onto a 2-D image size in the scene. Our intuition was that the optimal choice
plane using the ground truth and predicted camera pose. It for β was approximating the reprojection error in the scene
therefore converts rotation and translation quantities into geometry. For example, if the scene is very far away, then
image coordinates. This naturally weights translation and rotation is more significant than translation and vice versa.
rotation quantities depending on the scene and camera ge- This function is not trivial to model for complex scenes with
ometry. a large number of landmarks. It will vary significantly with
To formulate this loss, we first define a function, π, each training example in the dataset. By learning with re-
which maps a 3-D point, g, to 2-D image coordinates, projection error we can use our knowledge of the scene ge-
(u, v)T : ometry more directly to automatically infer this weighting.
 
u Projecting geometry through a projection model is a
π(x, q, g) → (5)
v differentiable operation involving matrix multiplication.
Therefore we can use this loss to train our model with
where x and q represent the camera position and orienta-
stochastic gradient descent. It is important to note that
tion. This function, π, is defined as:
we do not need to know the intrinsic camera parameters to
⎛ ⎞
u      project this 3-D geometry. This is because we apply the
⎝ v  ⎠ = K(Rg + x), u u /w same projection to both the model prediction and ground
= (6)
 v v  /w truth measurement, so we can use arbitrary values.
w
It should be noted that we need to have some knowledge
where K is the intrinsic calibration matrix of the camera, of the scene’s geometry in order to have 3-D points to repro-
and R is the mapping of q to its SO(3) rotation matrix, ject. The geometry is often known; if our data is obtained
q4×1 → R3×3 . through structure from motion, RGBD data or other sensory
We formulate this loss by taking the norm of the repro- data (see Section 4.1). Only points from the scene which
jection error between the predicted and ground truth camera are visible in the image I are used to compute the loss. We
pose. We take the subset, G  , of all 3-D points in the scene, also found it was important for numerical stability to ignore
G, which are visible in the image I. The final loss (7) is points which are projected outside the image bounds.
given by the mean of all the residuals from points, gi ∈ G  :
1
Lg (I) = π(x, q, gi ) − π(x̂, q̂, gi )γ (7)
|G  |
gi ∈G 
3.3.5 Regression norm
where x̂ and q̂ are the predicted camera poses from
PoseNet, with x and q the ground truth label, with norm, An important choice for these losses is the regression norm,
γ, which is discussed in Section 3.3.5.  γ . Typically, deep learning models use an L1 =  1 or
Note that because we are projecting 3-D points using L2 =  2 . We can also use robust norms such as Huber’s
both the ground truth and predicted camera pose we can loss [16] and Tukey’s loss [15], which have been success-
apply any arbitrary camera model, as long as we use the fully applied to deep learning [5]. For camera pose regres-
same intrinsic parameters for both cameras. Therefore for sion, we found that they negatively impacted performance
simplicity, we set the camera intrinsics, K, to the identity by over-attenuating difficult examples. We suspect that for
matrix – camera calibration is not required. more noisy datasets these robust regression functions might
This loss implicitly combines rotation and translational be beneficial. With the datasets used in this paper, we found
quantities into image coordinates. Minimising reprojec- the L1 norm to perform best and therefore use use γ = 1.
tion error is often the most desirable balance between these It does not increase quadratically with magnitude or over-
quantities for many applications, such as augmented reality. attenuate large residuals.

6559
(a) 7 Scenes Dataset - 43,000 images from seven scenes in small indoor environments [46].

(b) Cambridge Landmarks Dataset - over 10,000 images from six scenes around Cambridge, UK [22].

(c) Dubrovnik 6K Dataset - 6,000 images from a variety of camera types in Dubrovnik, Croatia [29].

Figure 3: Example images randomly chosen from each dataset. This illustrates the wide variety of settings and scales and the challenging
array of environmental factors such as lighting, occlusion, dynamic objects and weather which are captured in each dataset.

Dataset Type Scale Imagery Scenes Train Images Test Images 3-D Points Spatial Area
7 Scenes [46] Indoor Room RGB-D sensor (Kinect) 7 26,000 17,000 - 4×3m
Cambridge Landmarks [22] Outdoor Street Mobile phone camera 6 8,380 4,841 2,097,191 100×500m
Dubrovnik 6K [26] Outdoor Small town Internet images (Flikr) 1 6,044 800 2,106,456 1.5×1.5km

Table 1: Summary of the localisation datasets used in this paper’s experiments. These datasets are all publicly available. They
demonstrate our method’s performance over a range of scales for both indoor and outdoor applications.

4. Experiments Flow [1]. All models are optimised end-to-end with ADAM
[23] using the default parameters and a learning rate of
To train and benchmark our model on a number of 1 × 10−4 . We train each model until the training loss con-
datasets we rescale the input images such that the short- verges. We use a batch size of 64 on a NVIDIA Titan X
est side is of length 256. We normalise the images so that (Pascal) GPU, training takes approximately 20k - 100k iter-
input pixel intensities range from −1 to 1. We train our ations, or 4 hours - 1 day.
PoseNet architecture using an implementation in Tensor-

6560
Cambridge Landmarks, King’s College [22] Dubrovnik 6K [29]
Median Error Accuracy Median Error Accuracy
Loss function x [m] q [◦ ] < 2m, 5◦ [%] x [m] q [◦ ] < 10m, 10◦ [%]
Linear sum, β = 500 (2) 1.52 1.19 65.0% 13.1 4.68 30.1%
Learn weighting with homoscedastic uncertainty (3) 0.99 1.06 85.3% 9.88 4.73 41.7%
Reprojection loss does not converge
Learn weighting pretraining → Reprojection loss (7) 0.88 1.04 90.3% 7.90 4.40 48.6%

Table 2: Comparison of different loss functions. We use an L1 distance for the residuals in each loss. Linear sum combines position and
orientation losses with a constant scaling parameter β [19] and is defined in (2). Learn weighting is the loss function in (3) which learns to
combine position and orientation using homoscedastic uncertainty. Reprojection error implicitly combines rotation and translation by using
the reprojection error of the scene geometry as the loss (7). We find that homoscedastic uncertainty is able to learn an effective weighting
between position and orientation quantities. The reprojection loss was not able to converge from random initialisation. However, when
used to fine-tune a network pretrained with (3) it yields the best results.

4.1. Datasets 4.2. Comparison of loss functions


Deep learning performs extremely well on large datasets. In Table 2 we compare different combinations of losses
However annotating ground truth labels on these datasets is and regression norms. We compare results for a scene in the
often expensive or very labour intensive. We can leverage Cambridge Landmarks dataset [22] and the Dubrovnik 6K
structure from motion [48], or similar algorithms [46], to dataset [26], which has imagery from a range of cameras.
autonomously generate training labels (camera poses) from We find that modelling homoscedastic uncertainty with
image data [22]. We use three datasets to benchmark our the loss in (3) is able to effectively learn a weighting be-
approach. These datasets are summarised in Table 1 and ex- tween position and orientation. It outperforms the constant
ample imagery is shown in Figure 3. We use these datasets weighting used in loss (2). The reprojection loss in (7) is
to demonstrate our method’s performance across a range of unable to train the model from a random initialisation. We
settings and scales. We endeavour to demonstrate the gen- observe that the model gets stuck in a poor local minima,
eral applicability of the approach. when using any of the regression norms. However, the re-
Cambridge Landmarks [22] provides labelled video projection loss is able to improve localisation performance
data to train and test pose regression algorithms in an out- when using weights pretrained with any of the other losses.
door urban setting. It was collected using a smart phone For example, we can take the best performing model using
and structure from motion was used to generate the pose la- the loss from (3) and fine tune with the reprojection loss
bels [56]. Significant urban clutter such as pedestrians and (7). We observe that this loss is then able to converge ef-
vehicles were present and data was collected from many fectively. This shows that the reprojection loss is not robust
different points in time representing different lighting and to large residuals. This is because reprojected points can
weather conditions. Train and test images are taken from be easily placed far from the image centre if the network
distinct walking paths and not sampled from the same tra- makes a poor pose prediction. Therefore, we recommend
jectory making the regression challenging. the following two-step training scheme:
1. Train the model using the loss in (3), learning the
7 Scenes [46] is an indoor dataset which was collected
weighting between position and orientation.
with a Kinect RGB-D sensor. Ground truth poses were
computed using Kinect Fusion [46]. The dataset contains 2. If the scene geometry is known (for example from
seven scenes which were captured around an office build- structure from motion or RGBD camera data) then
ing. Each scene typically consists of a single room. The fine-tune the model using the reprojection loss in (7).
dataset was originally created for RGB-D relocalization. It
is extremely challenging for purely visual relocalization us-
4.3. Benchmarking localisation accuracy
ing SIFT-like features, as it contains many ambiguous tex- In Table 3 we show that our geometry based loss
tureless features. outperforms the original PoseNet’s naive loss function
Dubrovnik 6K [26] is a dataset consisting of 6,044 train [22]. We observe a consistent and significant improvement
and 800 test images which were obtained from the internet. across both indoor 7 Scenes outdoor Cambridge Landmarks
They are taken from Dubrovnik’s old town in Croatia which datasets. We conclude that we can simultaneously learn
is a UNESCO world heritage site. The images are predom- both position and orientation more effectively by consid-
inantly captured by tourists with a wide variety of camera ering scene geometry. The improvement is notably more
types. Ground truth poses for this dataset were computed pronounced for the 7Scenes dataset. We believe this is
using structure from motion. due to the significantly larger amount of training data for

6561
Area or Active Search PoseNet Bayesian PoseNet PoseNet (this work) PoseNet (this work)
Scene Volume (SIFT) [43] (β weight) [22] PoseNet [19] Spatial LSTM [54] Learn σ 2 Weight Geometric Reprojection
Great Court 8000m2 – – – – 7.00m, 3.65◦ 6.83m, 3.47◦
King’s College 5600m2 0.42m, 0.55◦ 1.66m, 4.86◦ 1.74m, 4.06◦ 0.99m, 3.65◦ 0.99m, 1.06◦ 0.88m, 1.04◦
Old Hospital 2000m2 0.44m, 1.01◦ 2.62m, 4.90◦ 2.57m, 5.14◦ 1.51m, 4.29◦ 2.17m, 2.94◦ 3.20m, 3.29◦
Shop Façade 875m2 0.12m, 0.40◦ 1.41m, 7.18◦ 1.25m, 7.54◦ 1.18m, 7.44◦ 1.05m, 3.97◦ 0.88m, 3.78◦
St Mary’s Church 4800m2 0.19m, 0.54◦ 2.45m, 7.96◦ 2.11m, 8.38◦ 1.52m, 6.68◦ 1.49m, 3.43◦ 1.57m, 3.32◦
Street 50000m2 0.85m, 0.83◦ – – – 20.7m, 25.7◦ 20.3m, 25.5◦

Chess 6m3 0.04m, 1.96◦ 0.32m, 6.60◦ 0.37m, 7.24◦ 0.24m, 5.77◦ 0.14m, 4.50◦ 0.13m, 4.48◦
Fire 2.5m3 0.03m, 1.53◦ 0.47m, 14.0◦ 0.43m, 13.7◦ 0.34m, 11.9◦ 0.27m, 11.8◦ 0.27m, 11.3◦
Heads 1m3 0.02m, 1.45◦ 0.30m, 12.2◦ 0.31m, 12.0◦ 0.21m, 13.7◦ 0.18m, 12.1◦ 0.17m, 13.0◦
Office 7.5m3 0.09m, 3.61◦ 0.48m, 7.24◦ 0.48m, 8.04◦ 0.30m, 8.08◦ 0.20m, 5.77◦ 0.19m, 5.55◦
Pumpkin 5m3 0.08m, 3.10◦ 0.49m, 8.12◦ 0.61m, 7.08◦ 0.33m, 7.00◦ 0.25m, 4.82◦ 0.26m, 4.75◦
Red Kitchen 18m3 0.07m, 3.37◦ 0.58m, 8.34◦ 0.58m, 7.54◦ 0.37m, 8.83◦ 0.24m, 5.52◦ 0.23m, 5.35◦
Stairs 7.5m3 0.03m, 2.22◦ 0.48m, 13.1◦ 0.48m, 13.1◦ 0.40m, 13.7◦ 0.37m, 10.6◦ 0.35m, 12.4◦

Table 3: Median localization results for the Cambridge Landmarks [22] and 7 Scenes [46] datasets. We compare the performance of
various RGB-only algorithms. Active search [43] is a state-of-the-art traditional SIFT keypoint based baseline. We demonstrate a notable
improvement over PoseNet’s [22] baseline performance using the learned σ 2 and reprojection error proposed in this paper, narrowing the
margin to the state of the art SIFT technique.

each scene in this dataset, compared with Cambridge Land- Position Orientation
Method Mean [m] Median [m] Mean [◦ ] Median [◦ ]
marks. We also outperform the improved PoseNet archi- PoseNet (this work) 40.0 7.9 11.2 4.4
tecture with spatial LSTMs [54]. However, this method APE [50] - 0.56 - -
is complimentary to the loss functions in this paper, and it Voting [57] - 1.69 - -
would be interesting to explore the union of these ideas. Sattler, et al. [41] 14.9 1.3 - -
P2F [29] 18.3 9.3 - -
We observe a difference in relative performance between
position and orientation when optimising with respect to re- Table 4: Localisation results on the Dubrovnik dataset [26],
projection error (7) or homoscedastic uncertainty (3). Over- comparing to a number of state-of-the-art point-feature tech-
all, optimising reprojection loss improves rotation accuracy, niques. Our method is the first deep learning approach to bench-
sometimes at the expense of some positional precision. mark on this challenging dataset. We achieve comparable perfor-
mance, while our method only requires a 256×256 pixel image
4.4. Comparison to SIFT-feature approaches and is much faster to compute.

Table 3 also compares to a state-of-the-art traditional


SIFT feature based localisation algorithm, Active Search faster than these approaches. Furthermore, it is worth not-
[43]. This method outperforms PoseNet, and is effec- ing that PoseNet only sees a 256 × 256 resolution image,
tive in feature-rich outdoor environments. However, in the while these methods register the full size images, often with
7Scenes dataset this deficit is less pronounced. The indoor a few million pixels.
scenes contain much fewer point features and there is signif-
icantly more training data. As an explanation for the deficit
in these results, PoseNet only uses 256 × 256 pixel images, 5. Conclusions
while SIFT based methods require images of a few mega-
pixels in size [43]. Additionally, PoseNet is able to localise We have investigated a number of loss functions for
an image in 5ms, scaling constantly with scene area, while learning to regress position and orientation simultaneously
traditional SIFT feature approaches require over 100ms, with scene geometry. We present an algorithm for training
and scale with scene size [43]. PoseNet which does not require any hyper-parameter tun-
In Table 4 we compare our approach on the Dubrovnik ing. We demonstrate PoseNet’s efficacy on three large scale
dataset to other geometric techniques which localise by reg- datasets. We observe a large improvement of results com-
istering SIFT features [30] to a large 3-D model [26]. Al- pared to the original loss proposed by PoseNet, narrowing
though our method improves significantly over the original the performance gap to traditional point-feature approaches.
PoseNet model, it is still yet to reach the fine grained accu- For many applications which require localization, such
racy of these methods [50, 57, 41, 29]. We hypothesise that as mobile robotics, video data is readily available. Ulti-
this is due to a lack of training data, with only 6k images mately, we would like to extend the architecture to video
across the town. However, our algorithm is significantly input with further use of multi-view stereo [14].

6562
References [21] A. Kendall, Y. Gal, and R. Cipolla. Multi-task deep learning using
task-dependant homoscedastic uncertainty for depth regression, se-
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, mantic and instance segmentation. arXiv preprint arXiv, 2017.
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good- [22] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional
fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, network for real-time 6-dof camera relocalization. arXiv preprint
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, arXiv:1505.07427, 2015.
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, [23] D. Kingma and J. Ba. Adam: A method for stochastic optimization.
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War- arXiv preprint arXiv:1412.6980, 2014.
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: [24] G. Klein and D. Murray. Parallel tracking and mapping for small
Large-scale machine learning on heterogeneous systems, 2015. Soft- ar workspaces. In Mixed and Augmented Reality, IEEE and ACM
ware available from tensorflow.org. International Symposium on, pages 225–234. IEEE, 2007.
[2] S. L. Altmann. Rotations, quaternions, and double groups. Courier [25] R. Li, Q. Liu, J. Gui, D. Gu, and H. Hu. Indoor relocalization in
Corporation, 2005. challenging environments with dual-stream convolutional neural net-
[3] A. Babenko and V. Lempitsky. Aggregating deep convolutional fea- works. IEEE Transactions on Automation Science and Engineering,
tures for image retrieval. In International Conference on Computer 2017.
Vision (ICCV), 2015. [26] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide pose esti-
[4] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural mation using 3d point clouds. In European Conference on Computer
codes for image retrieval. In European Conference on Computer Vision, pages 15–29. Springer, 2012.
Vision, 2014. [27] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide Pose Es-
[5] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab. Robust timation Using 3D Point Clouds. In European Conference on Com-
optimization for deep regression. In International Conference on puter Vision, 2012.
Computer Vision (ICCV), pages 2830–2838. IEEE, 2015. [28] Y. Li, N. Snavely, and D. P. Huttenlocher. Location Recognition us-
[6] D. Chen, G. Baatz, K. Köser, S. Tsai, R. Vedantham, T. Pylvänäinen, ing Prioritized Feature Matching. In European Conference on Com-
K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and puter Vision, 2010.
R. Grzeszczuk. City-scale landmark identification on mobile devices. [29] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using
In Proceedings of the IEEE Conference on Computer Vision and Pat- prioritized feature matching. In European Conference on Computer
tern Recognition, 2011. Vision, pages 791–804. Springer, 2010.
[7] S. Choudhary and P. J. Narayanan. Visibility probability structure [30] D. G. Lowe. Distinctive image features from scale-invariant key-
from sfm datasets and applications. In European Conference on points. International journal of computer vision, 60(2):91–110,
Computer Vision, 2012. 2004.
[8] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. Vidloc: [31] I. Melekhov, J. Kannala, and E. Rahtu. Relative camera pose
6-dof video-clip relocalization. arXiv preprint arXiv:1702.06521, estimation using convolutional neural networks. arXiv preprint
2017. arXiv:1702.01381, 2017.
[9] M. Cummins and P. Newman. FAB-MAP: Probabilistic localization
[32] E. I. Moser, E. Kropff, and M.-B. Moser. Place cells, grid cells,
and mapping in the space of appearance. The International Journal
and the brain’s spatial representation system. Annu. Rev. Neurosci.,
of Robotics Research, 27(6):647–665, 2008.
31:69–89, 2008.
[10] J. Delhumeau, P.-H. Gosselin, H. Jégou, and P. Pérez. Revisiting the
[33] R. Mur-Artal, J. Montiel, and J. D. Tardós. Orb-slam: a versatile and
VLAD image representation. In ACM Multimedia, Barcelona, Spain,
accurate monocular slam system. IEEE Transactions on Robotics,
Oct. 2013.
31(5):1147–1163, 2015.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Ima-
[34] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Dense
genet: A large-scale hierarchical image database. In Proceedings of
tracking and mapping in real-time. In International Conference on
the IEEE Conference on Computer Vision and Pattern Recognition,
Computer Vision (ICCV), pages 2320–2327. IEEE, 2011.
pages 248–255. IEEE, 2009.
[35] D. Nister and H. Stewenius. Scalable recognition with a vocabulary
[12] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large-scale direct
tree. In Proceedings of the IEEE Conference on Computer Vision
monocular slam. In European Conference on Computer Vision, pages
and Pattern Recognition, 2006.
834–849. Springer, 2014.
[36] J. O’keefe and L. Nadel. The hippocampus as a cognitive map, vol-
[13] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless
ume 3. Clarendon Press Oxford, 1978.
pooling of deep convolutional activation features. In European Con-
ference on Computer Vision, 2014. [37] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transfer-
[14] R. Hartley and A. Zisserman. Multiple view geometry in computer ring mid-level image representations using convolutional neural net-
vision. Cambridge university press, 2003. works. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1717–1724. IEEE, 2014.
[15] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding robust
and exploratory data analysis, volume 3. Wiley New York, 1983. [38] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object Re-
trieval with Large Vocabularies and Fast Spatial Matching. In Pro-
[16] P. J. Huber. Robust statistics. Springer, 2011.
ceedings of the IEEE Conference on Computer Vision and Pattern
[17] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local
Recognition, 2007.
descriptors into a compact image representation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, [39] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. A baseline
pages 3304–3311, jun 2010. for visual instance retrieval with deep convolutional networks. In
arXiv:1412.6574, 2014.
[18] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and
C. Schmid. Aggregating local image descriptors into compact codes. [40] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient
IEEE transactions on pattern analysis and machine intelligence, alternative to sift or surf. In International conference on computer
34(9):1704–1716, 2012. vision (ICCV), pages 2564–2571. IEEE, 2011.
[19] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning [41] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization
for camera relocalization. arXiv preprint arXiv:1509.05909, 2015. using direct 2d-to-3d matching. In International Conference on Com-
[20] A. Kendall and Y. Gal. What uncertainties do we need in puter Vision, pages 667–674. IEEE, 2011.
bayesian deep learning for computer vision? arXiv preprint [42] T. Sattler, B. Leibe, and L. Kobbelt. Improving Image-Based Local-
arXiv:1703.04977, 2017. ization by Active Correspondence Search. 2012.

6563
[43] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized
matching for large-scale image-based localization. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2016.
[44] T. Sattler, C. Sweeney, and M. Pollefeys. On sampling focal length
values to solve the absolute pose problem. In European Conference
on Computer Vision, pages 828–843. Springer, 2014.
[45] G. Schindler, M. Brown, and R. Szeliski. City-scale location recogni-
tion. In Computer Vision and Pattern Recognition, 2007. CVPR’07.
IEEE Conference on, pages 1–7. IEEE, 2007.
[46] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
A. Fitzgibbon. Scene coordinate regression forests for camera relo-
calization in RGB-D images. In Computer Vision and Pattern Recog-
nition (CVPR), IEEE Conference on, pages 2930–2937. IEEE, 2013.
[47] J. Sivic, A. Zisserman, et al. Video google: A text retrieval approach
to object matching in videos. In International Conference on Com-
puter Vision (ICCV), volume 2, pages 1470–1477, 2003.
[48] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from
internet photo collections. International Journal of Computer Vision,
80(2):189–210, 2008.
[49] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate local-
ization and pose estimation for large 3d models. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pages 532–539, 2014.
[50] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate local-
ization and pose estimation for large 3d models. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pages 532–539, 2014.
[51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-
han, V. Vanhoucke, and A. Rabinovich. Going deeper with convolu-
tions. arXiv preprint arXiv:1409.4842, 2014.
[52] G. Tolias, R. Sicre, and H. Jgou. Particular object retrieval with inte-
gral max-pooling of cnn activations. In ICLR, 2016.
[53] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual place recogni-
tion with repetitive structures. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 883–890, 2013.
[54] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and
D. Cremers. Image-based localization with spatial lstms. arXiv
preprint arXiv:1611.07890, 2016.
[55] T. Weyand, I. Kostrikov, and J. Philbin. Planet-photo geolocation
with convolutional neural networks. In European Conference on
Computer Vision, pages 37–55. Springer, 2016.
[56] C. Wu. Towards linear-time incremental structure from motion. In
3D Vision-3DV 2013, 2013 International Conference on, pages 127–
134. IEEE, 2013.
[57] B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-
scale image-based localization. In International Conference on Com-
puter Vision (ICCV), 2015.

6564

You might also like