Geometric Loss Functions For Camera Pose Regression With Deep Learning
Geometric Loss Functions For Camera Pose Regression With Deep Learning
Geometric loss functions for camera pose regression with deep learning
Abstract
tures are often not robust enough for localising across dif-
1. Introduction ferent weather, lighting or environmental conditions. Addi-
Designing a system for reliable large scale localisa- tionally, they lack the ability to capture global context, and
tion is a challenging problem. The discovery of the po- require robust aggregation of hundreds of points to form a
sitioning system in mammalian brains, located in the hip- consensus to predict pose [57].
pocampus, was awarded the 2014 Nobel prize in Physiol- To address this problem, we introduced PoseNet [22, 19]
ogy or Medicine [36, 32]. It is an important problem for which uses end-to-end deep learning to predict camera pose
computer vision too, with localisation technology essential from a single input image. It was shown to be able lo-
for many applications including autonomous vehicles, un- calise more robustly using deep learning, compared with
manned aerial vehicles and augmented reality. State of the point features such as SIFT [30]. PoseNet learns a represen-
art localisation systems perform very well within controlled tation using the entire image context based on appearance
environments [24, 34, 12, 33, 44]. However, we are yet to and shape. These features generalise well and can localise
see their wide spread use in the wild because of their inabil- across challenging lighting and appearances changes. It is
ity to cope with large viewpoint or appearance changes. also fast, being able to regress the camera’s pose in only a
Many of the visual localisation systems use point land- few milliseconds. It is very scalable as it does not require
marks such as SIFT [30] or ORB [40] to localise. These a large database of landmarks. Rather, it learns a mapping
features perform well for incremental tracking and estimat- from pixels to a high dimensional space linear with pose.
ing ego-motion [33]. However, these point features are not The main weakness of PoseNet [22] was that despite
able to create a representation which is sufficiently robust to its scalability and robustness it did not produce metric ac-
challenging real-world scenarios. For example, point fea- curacy which is comparable to other geometric methods
6556
ImageNet dataset [11]. We observe that these pretrained pensate for this.
features regularise and improve performance in PoseNet
3.3. Loss function
through transfer learning [37]. Although, to generalise
PoseNet, we may apply it to any deep architecture designed This far, we have described the structure of the pose rep-
for image classification as follows: resentation we would like our network to learn. Next, we
1. Remove the final linear regression and softmax layers discuss how to design an effective loss function to learn to
used for classification estimate the camera’s 6 degree of freedom pose. This is a
2. Append a linear regression layer. This fully connected particularly challenging objective because it involves learn-
layer is designed to output a seven dimensional pose ing two distinct quantities - rotation and translation - with
vector representing position (3 dimensions) and orien- different units and scales.
tation (4 dimensional quaternion) This section defines a number of loss functions and ex-
plores their efficacy for camera pose regression. We be-
3. Insert a normalisation layer to normalise the four di-
gin in Section 3.3.2 by describing the original weighted loss
mensional quaternion orientation vector to unit length
function which was proposed by PoseNet [22]. We improve
3.2. Pose representation on this in Section 3.3.3 by introducing a novel loss function
which can learn the weighting between rotation and transla-
An important consideration when designing a machine tion automatically, using an estimate of the homoscedastic
learning system is the representation space of the output. task uncertainty. Further, in Section 3.3.4 we describe a
We can easily learn camera position in Euclidean space loss function which combines position and orientation as a
[22]. However, learning orientation is more complex. In single scalar using the reprojection error geometry. In Sec-
this section we compare a number of different parametri- tion 4.2 we compare the performance of these loss func-
sations used to express rotational quantities; Euler angles, tions, and discusses their trade-offs.
axis-angle, SO(3) rotation matrices and quaternions [2].
We evaluate their efficacy for deep learning.
3.3.1 Learning position and orientation
Firstly, Euler angles are easily understandable as an in-
terpretable parametrisation of 3-D rotation. However, they We can learn to estimate camera position by forming a
have two problems. Euler angles wrap around at 2π radi- smooth, continuous and injective regression loss in Eu-
ans, having multiple values representing the same angle. clidean space, Lx (I) = x − x̂γ , with norm given by γ
Therefore they are not injective, which causes them to be ([22] used the L2 Euclidean norm).
challenging to learn as a uni-modal scalar regression task. However, learning camera orientation is not as simple.
Additionally, they do not provide a unique parametrisation In Section 3.2 we described a number of options for repre-
for a given angle and suffer from the well-studied problem senting orientation. Quaternions are an attractive choice for
of gimbal lock [2]. The axis-angle representation is another deep learning because they are easily formulated in a con-
three dimensional vector representation. However like Eu- tinuous and differentiable way. The set of rotations lives
ler angles, it too suffers from a repetition around the 2π on the unit sphere in quaternion space. We can easily map
radians representation. any four dimensional vector to a valid quaternion rotation
Rotation matrices are a over-parametrised representation by normalising it to unit length. [22] demonstrates how to
of rotation. For 3-D problems, the set of rotation matrices learn to regress quaternion values:
are 3×3 dimensional members of the special orthogonal Lie
q̂
group, SO(3). These matrices have a number of interesting
Lq (I) = q − (1)
q̂
γ
properties, including orthonormality. However, it is diffi-
cult to enforce the orthogonality constraint when learning a Using a distance norm, γ, in Euclidean space makes no ef-
SO(3) representation through back-propagation. fort to keep q on the unit sphere. We find, however, that
In this work, we chose quaternions as our orientation rep- during training, q becomes close enough to q̂ such that the
resentation. Quaternions are favourable because arbitrary distinction between spherical distance and Euclidean dis-
four dimensional values are easily mapped to legitimate ro- tance becomes insignificant. For simplicity, and to avoid
tations by normalizing them to unit length. This is a simpler hampering the optimization with unnecessary constraints,
process than the orthonormalization required of rotation we chose to omit the spherical constraint. The main prob-
matrices. Quaternions are a continuous and smooth repre- lem with Quaternions is that they are not injective because
sentation of rotation. They lie on the unit manifold, which they have two unique values (from each hemisphere) which
is a simple constraint to enforce through back-propagation. map to a single rotation. This is because quaternion, q, is
Their main downside is that they have two mappings for identical to −q. To address this, we constrain all quater-
each rotation, one on each hemisphere. However, in Sec- nions to one hemisphere such that there is a unique value
tion 3.3.1 we show how to adjust the loss function to com- for each rotation.
6557
3
3.3.2 Simultaneously learning position and orientation Position
Position
6558
for the homoscedastic task uncertainty values. Only an ap- The key advantage of this loss is that it allows the model
proximate initial guess is required, we arbitrarily use initial to vary the weighting between position and orientation, de-
values of ŝx = 0.0, ŝq = −3.0, for all scenes. pending on the specific geometry in the training image. For
example, training images with geometry which is far away
3.3.4 Learning from geometric reprojection error would balance rotational and translational loss differently
to images with geometry very close to the camera.
Perhaps a more desirable loss is one that does not require
balancing of rotational and positional quantities at all. Re- Interestingly, when experimenting with the original
projection error of scene geometry is a representation which weighted loss in (2) we observed that the hyperparameter
combines rotation and translation naturally in a single scalar β was an approximate function of the scene geometry. We
loss [14]. Reprojection error is given by the residual be- observed that it was a function of the landmark distance and
tween 3-D points in the scene projected onto a 2-D image size in the scene. Our intuition was that the optimal choice
plane using the ground truth and predicted camera pose. It for β was approximating the reprojection error in the scene
therefore converts rotation and translation quantities into geometry. For example, if the scene is very far away, then
image coordinates. This naturally weights translation and rotation is more significant than translation and vice versa.
rotation quantities depending on the scene and camera ge- This function is not trivial to model for complex scenes with
ometry. a large number of landmarks. It will vary significantly with
To formulate this loss, we first define a function, π, each training example in the dataset. By learning with re-
which maps a 3-D point, g, to 2-D image coordinates, projection error we can use our knowledge of the scene ge-
(u, v)T : ometry more directly to automatically infer this weighting.
u Projecting geometry through a projection model is a
π(x, q, g) → (5)
v differentiable operation involving matrix multiplication.
Therefore we can use this loss to train our model with
where x and q represent the camera position and orienta-
stochastic gradient descent. It is important to note that
tion. This function, π, is defined as:
we do not need to know the intrinsic camera parameters to
⎛ ⎞
u project this 3-D geometry. This is because we apply the
⎝ v ⎠ = K(Rg + x), u u /w same projection to both the model prediction and ground
= (6)
v v /w truth measurement, so we can use arbitrary values.
w
It should be noted that we need to have some knowledge
where K is the intrinsic calibration matrix of the camera, of the scene’s geometry in order to have 3-D points to repro-
and R is the mapping of q to its SO(3) rotation matrix, ject. The geometry is often known; if our data is obtained
q4×1 → R3×3 . through structure from motion, RGBD data or other sensory
We formulate this loss by taking the norm of the repro- data (see Section 4.1). Only points from the scene which
jection error between the predicted and ground truth camera are visible in the image I are used to compute the loss. We
pose. We take the subset, G , of all 3-D points in the scene, also found it was important for numerical stability to ignore
G, which are visible in the image I. The final loss (7) is points which are projected outside the image bounds.
given by the mean of all the residuals from points, gi ∈ G :
1
Lg (I) = π(x, q, gi ) − π(x̂, q̂, gi )γ (7)
|G |
gi ∈G
3.3.5 Regression norm
where x̂ and q̂ are the predicted camera poses from
PoseNet, with x and q the ground truth label, with norm, An important choice for these losses is the regression norm,
γ, which is discussed in Section 3.3.5. γ . Typically, deep learning models use an L1 = 1 or
Note that because we are projecting 3-D points using L2 = 2 . We can also use robust norms such as Huber’s
both the ground truth and predicted camera pose we can loss [16] and Tukey’s loss [15], which have been success-
apply any arbitrary camera model, as long as we use the fully applied to deep learning [5]. For camera pose regres-
same intrinsic parameters for both cameras. Therefore for sion, we found that they negatively impacted performance
simplicity, we set the camera intrinsics, K, to the identity by over-attenuating difficult examples. We suspect that for
matrix – camera calibration is not required. more noisy datasets these robust regression functions might
This loss implicitly combines rotation and translational be beneficial. With the datasets used in this paper, we found
quantities into image coordinates. Minimising reprojec- the L1 norm to perform best and therefore use use γ = 1.
tion error is often the most desirable balance between these It does not increase quadratically with magnitude or over-
quantities for many applications, such as augmented reality. attenuate large residuals.
6559
(a) 7 Scenes Dataset - 43,000 images from seven scenes in small indoor environments [46].
(b) Cambridge Landmarks Dataset - over 10,000 images from six scenes around Cambridge, UK [22].
(c) Dubrovnik 6K Dataset - 6,000 images from a variety of camera types in Dubrovnik, Croatia [29].
Figure 3: Example images randomly chosen from each dataset. This illustrates the wide variety of settings and scales and the challenging
array of environmental factors such as lighting, occlusion, dynamic objects and weather which are captured in each dataset.
Dataset Type Scale Imagery Scenes Train Images Test Images 3-D Points Spatial Area
7 Scenes [46] Indoor Room RGB-D sensor (Kinect) 7 26,000 17,000 - 4×3m
Cambridge Landmarks [22] Outdoor Street Mobile phone camera 6 8,380 4,841 2,097,191 100×500m
Dubrovnik 6K [26] Outdoor Small town Internet images (Flikr) 1 6,044 800 2,106,456 1.5×1.5km
Table 1: Summary of the localisation datasets used in this paper’s experiments. These datasets are all publicly available. They
demonstrate our method’s performance over a range of scales for both indoor and outdoor applications.
4. Experiments Flow [1]. All models are optimised end-to-end with ADAM
[23] using the default parameters and a learning rate of
To train and benchmark our model on a number of 1 × 10−4 . We train each model until the training loss con-
datasets we rescale the input images such that the short- verges. We use a batch size of 64 on a NVIDIA Titan X
est side is of length 256. We normalise the images so that (Pascal) GPU, training takes approximately 20k - 100k iter-
input pixel intensities range from −1 to 1. We train our ations, or 4 hours - 1 day.
PoseNet architecture using an implementation in Tensor-
6560
Cambridge Landmarks, King’s College [22] Dubrovnik 6K [29]
Median Error Accuracy Median Error Accuracy
Loss function x [m] q [◦ ] < 2m, 5◦ [%] x [m] q [◦ ] < 10m, 10◦ [%]
Linear sum, β = 500 (2) 1.52 1.19 65.0% 13.1 4.68 30.1%
Learn weighting with homoscedastic uncertainty (3) 0.99 1.06 85.3% 9.88 4.73 41.7%
Reprojection loss does not converge
Learn weighting pretraining → Reprojection loss (7) 0.88 1.04 90.3% 7.90 4.40 48.6%
Table 2: Comparison of different loss functions. We use an L1 distance for the residuals in each loss. Linear sum combines position and
orientation losses with a constant scaling parameter β [19] and is defined in (2). Learn weighting is the loss function in (3) which learns to
combine position and orientation using homoscedastic uncertainty. Reprojection error implicitly combines rotation and translation by using
the reprojection error of the scene geometry as the loss (7). We find that homoscedastic uncertainty is able to learn an effective weighting
between position and orientation quantities. The reprojection loss was not able to converge from random initialisation. However, when
used to fine-tune a network pretrained with (3) it yields the best results.
6561
Area or Active Search PoseNet Bayesian PoseNet PoseNet (this work) PoseNet (this work)
Scene Volume (SIFT) [43] (β weight) [22] PoseNet [19] Spatial LSTM [54] Learn σ 2 Weight Geometric Reprojection
Great Court 8000m2 – – – – 7.00m, 3.65◦ 6.83m, 3.47◦
King’s College 5600m2 0.42m, 0.55◦ 1.66m, 4.86◦ 1.74m, 4.06◦ 0.99m, 3.65◦ 0.99m, 1.06◦ 0.88m, 1.04◦
Old Hospital 2000m2 0.44m, 1.01◦ 2.62m, 4.90◦ 2.57m, 5.14◦ 1.51m, 4.29◦ 2.17m, 2.94◦ 3.20m, 3.29◦
Shop Façade 875m2 0.12m, 0.40◦ 1.41m, 7.18◦ 1.25m, 7.54◦ 1.18m, 7.44◦ 1.05m, 3.97◦ 0.88m, 3.78◦
St Mary’s Church 4800m2 0.19m, 0.54◦ 2.45m, 7.96◦ 2.11m, 8.38◦ 1.52m, 6.68◦ 1.49m, 3.43◦ 1.57m, 3.32◦
Street 50000m2 0.85m, 0.83◦ – – – 20.7m, 25.7◦ 20.3m, 25.5◦
Chess 6m3 0.04m, 1.96◦ 0.32m, 6.60◦ 0.37m, 7.24◦ 0.24m, 5.77◦ 0.14m, 4.50◦ 0.13m, 4.48◦
Fire 2.5m3 0.03m, 1.53◦ 0.47m, 14.0◦ 0.43m, 13.7◦ 0.34m, 11.9◦ 0.27m, 11.8◦ 0.27m, 11.3◦
Heads 1m3 0.02m, 1.45◦ 0.30m, 12.2◦ 0.31m, 12.0◦ 0.21m, 13.7◦ 0.18m, 12.1◦ 0.17m, 13.0◦
Office 7.5m3 0.09m, 3.61◦ 0.48m, 7.24◦ 0.48m, 8.04◦ 0.30m, 8.08◦ 0.20m, 5.77◦ 0.19m, 5.55◦
Pumpkin 5m3 0.08m, 3.10◦ 0.49m, 8.12◦ 0.61m, 7.08◦ 0.33m, 7.00◦ 0.25m, 4.82◦ 0.26m, 4.75◦
Red Kitchen 18m3 0.07m, 3.37◦ 0.58m, 8.34◦ 0.58m, 7.54◦ 0.37m, 8.83◦ 0.24m, 5.52◦ 0.23m, 5.35◦
Stairs 7.5m3 0.03m, 2.22◦ 0.48m, 13.1◦ 0.48m, 13.1◦ 0.40m, 13.7◦ 0.37m, 10.6◦ 0.35m, 12.4◦
Table 3: Median localization results for the Cambridge Landmarks [22] and 7 Scenes [46] datasets. We compare the performance of
various RGB-only algorithms. Active search [43] is a state-of-the-art traditional SIFT keypoint based baseline. We demonstrate a notable
improvement over PoseNet’s [22] baseline performance using the learned σ 2 and reprojection error proposed in this paper, narrowing the
margin to the state of the art SIFT technique.
each scene in this dataset, compared with Cambridge Land- Position Orientation
Method Mean [m] Median [m] Mean [◦ ] Median [◦ ]
marks. We also outperform the improved PoseNet archi- PoseNet (this work) 40.0 7.9 11.2 4.4
tecture with spatial LSTMs [54]. However, this method APE [50] - 0.56 - -
is complimentary to the loss functions in this paper, and it Voting [57] - 1.69 - -
would be interesting to explore the union of these ideas. Sattler, et al. [41] 14.9 1.3 - -
P2F [29] 18.3 9.3 - -
We observe a difference in relative performance between
position and orientation when optimising with respect to re- Table 4: Localisation results on the Dubrovnik dataset [26],
projection error (7) or homoscedastic uncertainty (3). Over- comparing to a number of state-of-the-art point-feature tech-
all, optimising reprojection loss improves rotation accuracy, niques. Our method is the first deep learning approach to bench-
sometimes at the expense of some positional precision. mark on this challenging dataset. We achieve comparable perfor-
mance, while our method only requires a 256×256 pixel image
4.4. Comparison to SIFT-feature approaches and is much faster to compute.
6562
References [21] A. Kendall, Y. Gal, and R. Cipolla. Multi-task deep learning using
task-dependant homoscedastic uncertainty for depth regression, se-
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, mantic and instance segmentation. arXiv preprint arXiv, 2017.
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good- [22] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional
fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, network for real-time 6-dof camera relocalization. arXiv preprint
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, arXiv:1505.07427, 2015.
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, [23] D. Kingma and J. Ba. Adam: A method for stochastic optimization.
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War- arXiv preprint arXiv:1412.6980, 2014.
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: [24] G. Klein and D. Murray. Parallel tracking and mapping for small
Large-scale machine learning on heterogeneous systems, 2015. Soft- ar workspaces. In Mixed and Augmented Reality, IEEE and ACM
ware available from tensorflow.org. International Symposium on, pages 225–234. IEEE, 2007.
[2] S. L. Altmann. Rotations, quaternions, and double groups. Courier [25] R. Li, Q. Liu, J. Gui, D. Gu, and H. Hu. Indoor relocalization in
Corporation, 2005. challenging environments with dual-stream convolutional neural net-
[3] A. Babenko and V. Lempitsky. Aggregating deep convolutional fea- works. IEEE Transactions on Automation Science and Engineering,
tures for image retrieval. In International Conference on Computer 2017.
Vision (ICCV), 2015. [26] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide pose esti-
[4] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural mation using 3d point clouds. In European Conference on Computer
codes for image retrieval. In European Conference on Computer Vision, pages 15–29. Springer, 2012.
Vision, 2014. [27] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua. Worldwide Pose Es-
[5] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab. Robust timation Using 3D Point Clouds. In European Conference on Com-
optimization for deep regression. In International Conference on puter Vision, 2012.
Computer Vision (ICCV), pages 2830–2838. IEEE, 2015. [28] Y. Li, N. Snavely, and D. P. Huttenlocher. Location Recognition us-
[6] D. Chen, G. Baatz, K. Köser, S. Tsai, R. Vedantham, T. Pylvänäinen, ing Prioritized Feature Matching. In European Conference on Com-
K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and puter Vision, 2010.
R. Grzeszczuk. City-scale landmark identification on mobile devices. [29] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recognition using
In Proceedings of the IEEE Conference on Computer Vision and Pat- prioritized feature matching. In European Conference on Computer
tern Recognition, 2011. Vision, pages 791–804. Springer, 2010.
[7] S. Choudhary and P. J. Narayanan. Visibility probability structure [30] D. G. Lowe. Distinctive image features from scale-invariant key-
from sfm datasets and applications. In European Conference on points. International journal of computer vision, 60(2):91–110,
Computer Vision, 2012. 2004.
[8] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. Vidloc: [31] I. Melekhov, J. Kannala, and E. Rahtu. Relative camera pose
6-dof video-clip relocalization. arXiv preprint arXiv:1702.06521, estimation using convolutional neural networks. arXiv preprint
2017. arXiv:1702.01381, 2017.
[9] M. Cummins and P. Newman. FAB-MAP: Probabilistic localization
[32] E. I. Moser, E. Kropff, and M.-B. Moser. Place cells, grid cells,
and mapping in the space of appearance. The International Journal
and the brain’s spatial representation system. Annu. Rev. Neurosci.,
of Robotics Research, 27(6):647–665, 2008.
31:69–89, 2008.
[10] J. Delhumeau, P.-H. Gosselin, H. Jégou, and P. Pérez. Revisiting the
[33] R. Mur-Artal, J. Montiel, and J. D. Tardós. Orb-slam: a versatile and
VLAD image representation. In ACM Multimedia, Barcelona, Spain,
accurate monocular slam system. IEEE Transactions on Robotics,
Oct. 2013.
31(5):1147–1163, 2015.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Ima-
[34] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Dense
genet: A large-scale hierarchical image database. In Proceedings of
tracking and mapping in real-time. In International Conference on
the IEEE Conference on Computer Vision and Pattern Recognition,
Computer Vision (ICCV), pages 2320–2327. IEEE, 2011.
pages 248–255. IEEE, 2009.
[35] D. Nister and H. Stewenius. Scalable recognition with a vocabulary
[12] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large-scale direct
tree. In Proceedings of the IEEE Conference on Computer Vision
monocular slam. In European Conference on Computer Vision, pages
and Pattern Recognition, 2006.
834–849. Springer, 2014.
[36] J. O’keefe and L. Nadel. The hippocampus as a cognitive map, vol-
[13] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale orderless
ume 3. Clarendon Press Oxford, 1978.
pooling of deep convolutional activation features. In European Con-
ference on Computer Vision, 2014. [37] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transfer-
[14] R. Hartley and A. Zisserman. Multiple view geometry in computer ring mid-level image representations using convolutional neural net-
vision. Cambridge university press, 2003. works. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1717–1724. IEEE, 2014.
[15] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding robust
and exploratory data analysis, volume 3. Wiley New York, 1983. [38] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object Re-
trieval with Large Vocabularies and Fast Spatial Matching. In Pro-
[16] P. J. Huber. Robust statistics. Springer, 2011.
ceedings of the IEEE Conference on Computer Vision and Pattern
[17] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local
Recognition, 2007.
descriptors into a compact image representation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, [39] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. A baseline
pages 3304–3311, jun 2010. for visual instance retrieval with deep convolutional networks. In
arXiv:1412.6574, 2014.
[18] H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez, and
C. Schmid. Aggregating local image descriptors into compact codes. [40] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient
IEEE transactions on pattern analysis and machine intelligence, alternative to sift or surf. In International conference on computer
34(9):1704–1716, 2012. vision (ICCV), pages 2564–2571. IEEE, 2011.
[19] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning [41] T. Sattler, B. Leibe, and L. Kobbelt. Fast image-based localization
for camera relocalization. arXiv preprint arXiv:1509.05909, 2015. using direct 2d-to-3d matching. In International Conference on Com-
[20] A. Kendall and Y. Gal. What uncertainties do we need in puter Vision, pages 667–674. IEEE, 2011.
bayesian deep learning for computer vision? arXiv preprint [42] T. Sattler, B. Leibe, and L. Kobbelt. Improving Image-Based Local-
arXiv:1703.04977, 2017. ization by Active Correspondence Search. 2012.
6563
[43] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritized
matching for large-scale image-based localization. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2016.
[44] T. Sattler, C. Sweeney, and M. Pollefeys. On sampling focal length
values to solve the absolute pose problem. In European Conference
on Computer Vision, pages 828–843. Springer, 2014.
[45] G. Schindler, M. Brown, and R. Szeliski. City-scale location recogni-
tion. In Computer Vision and Pattern Recognition, 2007. CVPR’07.
IEEE Conference on, pages 1–7. IEEE, 2007.
[46] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and
A. Fitzgibbon. Scene coordinate regression forests for camera relo-
calization in RGB-D images. In Computer Vision and Pattern Recog-
nition (CVPR), IEEE Conference on, pages 2930–2937. IEEE, 2013.
[47] J. Sivic, A. Zisserman, et al. Video google: A text retrieval approach
to object matching in videos. In International Conference on Com-
puter Vision (ICCV), volume 2, pages 1470–1477, 2003.
[48] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from
internet photo collections. International Journal of Computer Vision,
80(2):189–210, 2008.
[49] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate local-
ization and pose estimation for large 3d models. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pages 532–539, 2014.
[50] L. Svarm, O. Enqvist, M. Oskarsson, and F. Kahl. Accurate local-
ization and pose estimation for large 3d models. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
pages 532–539, 2014.
[51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-
han, V. Vanhoucke, and A. Rabinovich. Going deeper with convolu-
tions. arXiv preprint arXiv:1409.4842, 2014.
[52] G. Tolias, R. Sicre, and H. Jgou. Particular object retrieval with inte-
gral max-pooling of cnn activations. In ICLR, 2016.
[53] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual place recogni-
tion with repetitive structures. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 883–890, 2013.
[54] F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck, and
D. Cremers. Image-based localization with spatial lstms. arXiv
preprint arXiv:1611.07890, 2016.
[55] T. Weyand, I. Kostrikov, and J. Philbin. Planet-photo geolocation
with convolutional neural networks. In European Conference on
Computer Vision, pages 37–55. Springer, 2016.
[56] C. Wu. Towards linear-time incremental structure from motion. In
3D Vision-3DV 2013, 2013 International Conference on, pages 127–
134. IEEE, 2013.
[57] B. Zeisl, T. Sattler, and M. Pollefeys. Camera pose voting for large-
scale image-based localization. In International Conference on Com-
puter Vision (ICCV), 2015.
6564