Resch Scalable Structure From 2015 CVPR Paper
Resch Scalable Structure From 2015 CVPR Paper
1
bration in substantially less time then previous approaches, All these methods are tuned towards heterogeneous, un-
and on datasets with 1000s of high resolution frames. Fur- structured data, and as a consequence have difficulty when
thermore, our approach generalizes to an arbitrary number applied to densely sampled, coherent image sequences.
of input video sequences, allowing for rapid, globally con- This is due to per-frame feature point detection and pose
sistent calibration and scene reconstruction across multiple estimation using n-point algorithms causing unstable recon-
capture devices. In this paper we describe all of these con- structions, as well as computational inefficiency. In our ex-
tributions, and present detailed psuedocode for reimplemen- periments, we show that by explicitly considering coher-
tation in the supplemental material [1]. ence in the data, it is possible to achieve high quality recon-
In our system we leverage established existing ap- structions at significantly faster convergence and computa-
proaches, such as the commonly used KLT tracker [7], tion times.
SIFT histogram based frame similarity cost matrices [30], Coherent, densely sampled input. In contrast to above,
a global linear solver that integrates relative camera pose most techniques for densely sampled input such as video se-
constraints [14], and a robust depth-based point parameter- quences are based on continuously tracking feature points
ization [33]. throughout image sequences and iterative pose optimiza-
tion techniques [12, 24, 27]. These original methods were
2. Related Work designed for short, low resolution video sequences and did
not consider multi-loop closing.
Depending on underlying methodology and applications,
a multitude of different terminologies exists for geometric Particularly related are SLAM approaches and their vari-
camera auto-calibration, the most common being variants ants, as their aim is to compute accurate camera poses of
of structure from motion (SfM) and simultaneous localiza- a dynamically moving camera from a video stream. Of-
tion and mapping (SLAM). Here we classify prior works ten, however, such techniques are limited with respect to
into two categories according to their preferred input data; the supported scene size [16] or require additional sen-
unstructured and sparse vs. coherent and dense sampling. sor modalities [17]. Real-time methods based on feature
Unstructured, sparsely sampled input. A key chal- points [9] or dense, per-pixel tracking [21] are generally de-
lenge for methods focusing on sparsely sampled input such signed to provide as good as possible results with a small
as photo collections is that the data is generally unstructured input lag, rather than a final, fully consistent and high qual-
and heterogeneous, with significant appearance changes be- ity reconstruction that globally optimizes the poses of all
tween images. The current state-of-the-art is therefore gen- input frames. CoSLAM [35] combines data from coop-
erally based on iterating (see [12]): (i) robust detection and eratively acquired videos as long as some of the cameras
matching of feature points, (ii) n-point algorithms to estab- see the same content at the same time. Other approaches
lish initial geometric relationships between views, and (iii) achieve real time performance, but only on preconstructed
global BA. This approach has been successfully extended scenes [18], i.e., with known geometry.
to massive, very large scale datasets [6,11,26], with various Recently, a direct SLAM method (LSD-SLAM) was pro-
publicly available implementations [2, 4]. posed [10], which does not require detection and tracking
A central problem for such techniques is bootstrapping, of feature points, but instead recovers sparse depth maps
i.e., finding a good global initialization for BA that includes based directly on epipolar line scanning. However, such
all images, without having to run many iterations of BA on an approach does not scale well to high image resolutions,
parts of the reconstruction. Martinec and Pajdla [20] present as it requires depth estimates for many pixels. In addition,
a robust solution for finding global camera poses, concen- depth recovery is very sensitive to accurate intrinsic calibra-
trating on the camera orientation. Wilson and Snavely [32] tion. Our approach instead focuses on a subset of reliable
and Jiang et al. [14] show how to find global camera posi- features tracks, which is more efficient and less sensitive to
tions given known orientations. For massive datasets like image distortions, especially for high resolution input.
Internet photo collections, a second problem is the sheer Specific light field calibration techniques have been pro-
amount of images, often in the order of millions. To this posed for dense spatio-temporal-angular sampling using
end, techniques such as skeletal graphs [25] have been pro- camera arrays [13,15,31] and plenoptic cameras [22]. How-
posed, which remove unnecessary data by focusing on sta- ever, these methods generally focus on the static geomet-
ble subsets of cameras. Agarwal et al. [5] showed that it ric calibration of a light field, rather than computing both
is necessary to reconsider well established strategies in or- structure and motion, and hence cannot be applied to the
der to tackle large datasets consisting of 10s of thousands of acquisition scenarios we discuss in this paper. The work on
images. A further alternative is to perform an incremental, unstructured light field acquisition [8] explores this to some
piecewise reconstruction of a scene [23], and later assemble extent, but only supports small scale scenes and focuses on
individual fragments based, e.g., on extracted scene point an interactive interface for guiding the user during the ac-
descriptors. quisition process.
0.16 8
FeatureWdetectionW/Wtracking
offi
be
st h
offi ay
sh 2
offi
be
st h
offi ay
sh 2
ai
ai
ip
ip
nc
nc
ce
ce
ce
ce
rw
rw
GlobalWanchorWconstraints Test scene Test scene
CameraWpose
WWconstraints Figure 2. Influence of KLT drift on the reconstruction result. Note
how drift free KLT tracks reduce the drift of the camera positions
LinearWcameraWposeWestimation
as well as the average reprojection error.
FinalWBAWoptimization
all input frames from all input sequences.
Figure 1. Algorithm overview. The feature detection stage com- Figure 1 shows an overview of the key steps of our algo-
putes KLT tracks for window BA and SIFT features for wide base-
rithm. The following sections discuss each step in detail.
line handling. The window BA sweeps through the input, selects
confident frames and computes camera pose constraints between
interleaving sets of the selected frames. The global anchor stage 3.1. Drift reduced feature tracking
uses the SIFT features to establish global links between different
sections of the sequence. A linear camera pose estimation pro- Detectors optimized for wide baseline matching such
duces an initial arrangement of cameras which is further refined as SIFT [19] compute incoherent feature point sets even
by bundle adjustment steps.
between neighboring video frames. For continuous se-
quences, feature point tracking produces more reliable and
Our method focuses on densely sampled image se- efficient results. We build on the standard KLT tracker im-
quences and overcomes several of the previously mentioned plemented in OpenCV [3], which is also the basis for earlier
limitations. This results in a SfM approach that is sta- video-centered SfM techniques [27]. There are, however,
ble and globally consistent over long, high resolution se- two limitations of standard continuous KLT that have to be
quences, while still being able to robustly handle wide base- addressed in our application setting.
line matches. Firstly, we observed that for densely sampled video se-
quences, feature tracks that are visible for hundreds of
3. Method frames exhibit noticeable drift. Note that for high frame rate
The input to our method is one or more image sequences. cameras, this often corresponds to just a second of video.
We focus on extrinsic calibration and assume the intrin- We therefore modify the basic tracking to perform a simple
sics to be fixed and known (in practice they can be com- drift correction: when adding a frame, we track each fea-
puted from a few frames of the image sequences by using ture from the previous frame to the new one, and then re-
Bundler [2]). fine the feature position in the new frame using the original
On a high level our strategy is as follows. First, we per- frame where the feature was detected. In our experiments
form a modified 2D tracking of feature points utilizing data this simple modification led to considerably reduced drift
coherence to reduce drift. Next, we apply a window BA and higher reconstruction quality (see Figure 2).
strategy on a set of confident frames only. These are frames Secondly, simply tracking points over an image sequence
that are well connected via continuous tracks. To incorpo- cannot guarantee any form of global consistency of the re-
rate loop closing, we further establish global anchor links constructed cameras and scene. For example, when the
between carefully selected frame pairs of different parts of camera revisits the same scene elements multiple times over
the video or even different video streams altogether. In ad- a longer image sequence with intermediate occlusions, a
dition to these global constraints, relative camera pose con- single scene point will be represented by multiple, indi-
straints from the window BA are integrated with an efficient vidually tracked and reconstructed points. This problem
linear camera pose estimation [14]. We then perform global is known as the so called loop-closing problem in SLAM.
BA, and finally add all the less confident images by inter- For each feature track we therefore extract SIFT descrip-
polation and BA of their poses. During this step, we keep tors [19] in confident frames after the window BA, which
the scene structure fixed as determined by the confident im- is later used to re-identify points and for the generation of
ages. The final result is a globally consistent calibration of global anchor constraints.
0.14 8
offi
be e
st h
offi wa
sh 2
offi
be e
st h
offi wa
sh 2
ai
ai
ip
ip
nc
nc
c
ce y
ce
r
r
3.2. Interleaved window bundle adjustment
y
Test scene Test scene
Time (s)
Global optimization
or jointly processing multiple individual sequences, both of 250
200
which will be addressed later. 150
100
50
0
3.2.1 Window initialization
offi
be
st
offi
sh
ai
ip
nc
ce
ce
rw
h
2
ay
Test scene
Initializing camera geometry is usually accomplished us-
ing n-point algorithms [12], which are (in contrast to BA)
Figure 4. Comparison of reconstruction with and without point
limited in the number of constraints and therefore generally
subsampling. Our subsampling strategy leads to comparable
not sufficiently accurate given the very small camera base- reconstruction results while reducing the computation time by
lines encountered in high framerate video sequences. Our roughly a factor of 3. This factor is assumed to increase on larger
method is inspired by the approach of Yu and Gallup [33] image resolutions since more points can be excluded there.
designed for accidental small baseline camera motion.
We initialize a window by picking the first N consecu-
tive images from an image sequence and immediately per- scheme on the available scene points in all BA steps. We
form a BA step using the parameterization proposed in [33], found that sampling points randomly can lead to unstable or
where points are represented by inverse depth values pro- underconstrained optimization. We therefore choose point
jected from a reference frame (we use the center image in samples according to three rules explained below, achieving
the window). We found, however, that identical initializa- similar reconstruction quality at a fraction of the optimiza-
tion of all cameras [33] may cause BA to get stuck in local tion time.
minima. According to our observations, this can reliably be First, a minimum number τ of points should be visible
avoided by starting from different linearly displaced config- from each camera. Second, the reprojected 2D positions of
urations (see Figure 3) and optimizing first for the camera the points should be uniformly distributed in all camera im-
orientation and then for all extrinsics. Finally we pick the ages. Finally, the points should be visible in “sufficiently
best result in terms of reprojection error. Moreover, we ob- many” images, since in general a point provides more reli-
served more robust results when initializing scene points able constraints when seen by more cameras. At the same
with uniform instead of random depth [33]. The origi- time, however, we observed that points visible in a very
nal method of Yu and Gallup requires a comparably large large number of frames, i.e. points with very long 2D tracks,
number of images for robust convergence. With our above are more likely to be affected by tracking errors along ob-
modifications we observed stable convergence already with ject silhouettes, corrupting the result. Each point is there-
N = 11. For high frame rate handheld video, spacing be- fore picked with a probability proportional to the length of
tween frames (e.g., 3 in our experiments) for slightly in- the track, but capped to no longer then 10 frames. This sub-
creased baselines led to improved convergence. sampling strategy resulted in a 3-fold speedup without sac-
The next step is to grow this window. To this end, we rificing result quality (see Figure 4). For all experiments,
first describe a subsampling scheme of the scene points that we use a value of τ = 100.
allows us to reduce the BA computational cost significantly
at similar reconstruction quality. 3.2.3 Confidence criteria
3.2.2 Scene point subsampling For efficiency reasons, and to improve result quality, we
compute scene structure and camera poses initially only for
Following the observation that BA requires a certain mini- a sparse set of confident frames. In order to find this set, we
mum number of scene points but does not improve signif- test cameras with a linearly increasing step size and add the
icantly with many more points, we employ a subsampling furthest possible frame fulfilling a set of confidence criteria.
We define the following three confidence criteria ξ1 , ξ2 , ξ3
for measuring whether a tested camera c is suitable for win-
dow BA.
The first term ξ1 measures the number of features of the
camera that can be matched to the points pi of the window
with a low reprojection error. This ensures that there are
sufficiently many constraints for BA.
Figure 5. Selecting keyframes for interleaved window BA. Offsets
ξ2 represents how far the camera moved around the scene
between the keyframes selected for BA are linearly increasing to-
points. We use the the median of all points’ angular differ- wards older frames. To determine a consistent pose for a camera
ences φ(e− →, −
pc →
p pcn ) between the vectors to the new cn and the which was not part of the BA, we use the relative pose constraints
previous camera cp . This term makes sure that two cameras that were generated in previous windows where the camera was
are not too far apart from each other and ensures that the part of the BA.
visual appearance of the feature points doesn’t change to
much so that the next confident frame has mostly the same
feature tracks. We experimented with various offsetting strategies be-
The last term ξ3 is set to the median reprojection er- sides the growth strategy described above. The linear in-
ror ee of the tested camera ct and its visible points pt : crease provided the best results in terms of algorithm stabil-
ξ3 = ee(ct , pt ). This ensures that no cameras are added to ity, camera sampling, and computation time. The output of
the optimization which are too inconsistent with the content this stage are camera pose constraints from each window,
of the window. which we will later use for initializing the global scene op-
We label a camera as sufficiently confident when the fol- timization. We also keep all the windows for finding global
lowing criteria are fulfilled: ξ1 ≥ 30, ξ2 ≤ 5°, ξ3 ≤ 5px, anchor constraints as described in the following section.
i.e., the camera must be linked to at least 30 points, must not 3.3. Global anchor constraints
rotate more than five degrees around at least half of these
points, and at least half of the points must have less than The goal of these constraints is to establish global links
five pixels reprojection error. Similarly, a camera is labeled between different parts (possibly different subsequences) of
as candidate for removal from the current window as soon a video that have shared scene content. These links can later
as it does not satisfy the following confidence constraints be used in the linear camera pose estimation stage to obtain
anymore: ξ1 ≥ 70, ξ2 ≤ 10°, i.e., at least 70 points and less a good global initialization.
than ten degrees of camera rotation around at least 50% of We establish these constraints by importance sampling
the points. frame pairs from the set of confident frames and by join-
ing them based on SIFT features and the previously re-
constructed window scene structure. To do this, we ex-
3.2.4 Window processing tract SIFT descriptors for all KLT features in the confident
Given the confidence criteria for addition and removal, in frames, and for each pair, try to integrate the camera of
each iteration of the algorithm, we first remove images la- one frame using the scene structure associated with another
beled for removal from the current window, keeping a min- frame using BA. The optimized camera pose is rated based
imum of 5 cameras in the window at all times. After this on a confidence measure. Stable matching pairs among all
step, the current window contains usually about five to ten possible confident frame pairs in the video sequence(s) are
confident cameras. used as relative camera pose constraints for the global linear
However, for some camera (sub-)trajectories, stable win- pose estimation stage (see Figure 1).
dows can be much larger. To retain the efficiency of BA
while keeping as much information as possible, we select a 3.3.1 Camera stability
subset of the cameras in the window on which to perform
The stability of cameras for being used as global anchor
the actual BA. We pick cameras with increased spacing for
constraints is based on the following measures ζ:
older images (see Figure 5). This subset is then optimized
using standard BA. • The number of remaining points attached to a camera:
After BA, all cameras in the current window are made ζ1 = n. This makes sure that there are enough con-
consistent with a linear camera pose estimation tech- straints for optimization.
nique [14], using the relative camera pose constraints of for-
mer windows. This solver works on the camera poses only, • The distribution of the point projections ρ in the image:
producing faster results than BA in comparable quality as ζ2 = min(Std(ρx ), Std(ρy )). This avoids unstable
long as the input is consistent. configurations with very localized feature positions.
windows that share a camera, this step propagates the
variance information to arbitrary indirectly connected
camera pairs.
This results in a variance matrix V (see Figure 6). We
can now estimate a matrix S representing potential anchor
C V S = (1 − C) ◦ V frames to be used as global links:
Figure 6. Cost, variance and sampling matrices for wide baseline
candidate picking. The camera circled an object twice. Dark parts S = (1 − C) ◦ V (3)
of C indicate regions where good global anchor constraints are
likely to be found. S shows where we sample for global anchor Note that Cij ∈ (0, 1) and ◦ is the element-wise product
constraints. of matrices. We importance sample S to get frame pairs
(f1 , f2 ) that represent useful anchor constraints.
Geometrical verification. To ensure that a global an-
• The ratio between the smallest and the largest princi- chor constraint is truly useful, we perform a geometrical
pal component PCA(p)min , PCA(p)max of the scene verification. We pick the window with the most available
PCA(p)min
point positions p: ζ3 = PCA(p) max
. This avoids using scene points that contains f1 and BA for the pose of f2 ’s
two-dimensional scenes which tend to be ambiguous camera based on those points, utilizing SIFT matches for
(e.g. camera in plane) or unstable (e.g. frontoparallel linking f2 ’s features to f1 ’s points.
plane). In our experiments we observed that up to 40% of the
matches were outliers when matching SIFT features ex-
3.3.2 Anchor selection tracted from KLT keypoints. Therefore, we exploit the al-
ready known scene geometry to gain robustness in this pro-
To find good anchor candidates for wide baseline links we cess. We apply four passes of BA for the camera pose pa-
use two measures: The cost for matching two frames, and rameters while removing all the points with reprojection er-
the uncertainty of relative camera poses computed during rors worse than the average between the passes. Since BA
the window BA. tends to prefer consistent constraints, inconsistent reprojec-
Cost estimation. For robust estimation of the basic link- tions are removed by this procedure. If there is not enough
ing cost, we compute frame similarity based on histograms consistent data, BA diverges which leads to a violation of
of SIFT features [30]. The output is a cost matrix C rep- our stability constraints. We consider the geometric verifi-
resenting the cost for matching two frames of a video se- cation successful if it passes the following stability thresh-
quence (see Figure 6). olds: ζ1 ≥ 25, ζ2 ≥ 0.075 · ImageSize and ζ3 ≥ 0.1, which
Uncertainty estimation. For uncertainty estimation, we worked well in all our experiments. When a pair of frames
approximate the variance of the camera poses for every pair representing a global anchor constraint fulfils these thresh-
of confident cameras in the following three steps: olds, we add the respective relative camera pose constraints
to the existing set of constraints. In all our experiments,
1. Estimate the variance of each camera’s pose ci relative
these thresholds reliably removed all outliers.
to each window’s structure, i.e., the 3D points com-
Figure 7 illustrates the effect of using the anchor con-
puted from cameras in window wj :
straints, based on sampling costs C and S, which addition-
V ar(ci , wj ) ∝ 1/(min(25, ζ1 ) · ζ2 · ζ3 )2 (1) ally takes into account variance matrix V. Using anchor
constraints considerably reduces camera drift. By concern-
We assume that 25 reprojections are sufficiently many ing V in addition to the basic matching cost, drift can be
constraints. reduced by another 40%.
2. Use this information to estimate the variance between 3.4. Final optimization
windows by averaging the summed variances to com- The window BA and the global anchors now provide a
mon camera poses: large set of pose constraints. Using all these constraints we
n again apply global linear optimization [14] in order to com-
X V ar(ci , wj1 ) + V ar(ci , wj2 ) pute a globally consistent 3D scene and camera calibration
V ar(wj1 , wj2 ) =
i=1
n2 for all input frames.
(2) We then apply a series of nonlinear least squares opti-
mization passes based on the following three strategies:
3. Find the camera→window→...→window→camera
path with the lowest summed variance for each cam- A No Field of View (FoV) optimization, no bad point re-
era pair. While step 2 only considers variances for moval.
Processing time 637 frames (s)
0.001 4.5 250 25 10 500
Standard deviation
0.0008 3.5 200 20 1 400 PTAM
3
0.0006 150 15 LSD
2.5 0.1 300
2 Voodoo
0.0004 100 10
1.5 0.01 200 VisSfM
0.0002 1 50 5
0.5 0.001 100
0 0 0 0
2 loops 2 loops 2 loops 2 loops 0.0001 0
Test scene Test scene Test scene Test scene 176 637 Inkl. Excl.
Tracking Tracking
Figure 7. Comparison of loop closing strategies. Beside camera Figure 9. Evaluation based on synthetic ground truth. We give the
drift and reprojection error, we also show the number of samples standard deviation of the reconstructions fitted to the ground truth
needed to get 20 verified wide baseline links and the computation with an affine transformation plus timings. Our approach runs or-
time for wide baseline handling. Without loop closing, the camera ders of magnitude faster than other SfM systems while producing
drift is quite high (out of scale: 0.08). Cost based frame selec- results which are an order of magnitude more accurate than SLAM
tion for wide baseline handling reduces the drift drastically (use). systems. PTAM failed after 176 frames due to too slow map up-
Choosing frames also based on their value for wide baseline han- date.
dling (encoded in V) reduces the drift by another 40% for a fair
amount of extra samples/runtime (use+cost). Note that the repro-
180
jection error increases because of the extra constraints that have to 160 Linear reconstruction
0.12 8 600
Camera drift (units)
7 40
0.1 500
Overall time (s)
6 20
0.08 5 400 0
0.06 4 300
Sh
Be
St
2
ffi
lo
ai
ip
nc
3
ce
ce
op
rw
0.04 200
s
ay
2
2 Scene
0.02 1 100
0 0 0
Figure 10. Breakdown of reconstruction timings to the individual
offi
be e
st ch
offi rwa
sh e2
offi
be e
st ch
offi rwa
sh e2
offi
be e
st ch
offi rwa
sh e2
ai
ai
ai
ip
ip
ip
n
n
c
c y
c y
c y
pipeline parts.
Test scene Test scene Test scene
full sparse
Number of frames
Figure 11. Computation time comparison for a FullHD (1080p) image sequence. Our technique exhibits the lowest computation time of
all tested approaches. Note that if only the SfM part without feature processing is considered, we are about 6x faster than the nearest
competitor. For 400, 800 and 1600 frames, we did not obtain reconstruction timings from all methods. Separate timings for IO/features
and SfM reconstruction could not be obtained with Voodoo.
typical to SfM. However, Figure 9 shows that it runs orders current OpenCV KLT tracking by a GPU based implemen-
of magnitude faster than SfM systems, at comparable speed tation and improving our point subsampling strategy, e.g.,
to current SLAM systems, while producing results which using stratified sampling, could lead to improved recon-
are an order of magnitude more accurate than SLAM sys- struction quality and speed.
tems.
Timings. Figure 10 breaks down the reconstruction tim- 5. Conclusion
ings for the scenes used in this paper to the individual parts
We introduced a novel pipeline that enables efficient
of our pipeline. Reconstruction timings of our approach are
computation of extrinsic camera poses and scene structure
further compared with several other techniques in Figure
on high spatiotemporal resolution, densely sampled video
11 analyzing timings for varying numbers of frames of the
sequences. One of the key insights in this work is that the
FullHD outdoor video sequence. We compare to two ap-
coherence of such data enables the use of modified track-
proaches designed for handling images (Bundler [2] and Vi-
ing, subsampling, and global optimization schemes, which
sualSfM [4] (GPU accelerated, parallelized)) as well as two
in combination allow for considerably faster and more ro-
approaches for video sequences (Voodoo Camera Tracker
bust computation, similar to observations made in previous
and the recent LSD-SLAM [10]).
works [15,33] in the context of 3D reconstruction. In partic-
Methods intended for sparse, unstructured data suffer
ular we found that common choices in SfM such as n-point
from n2 runtime for searching corresponding images. The
algorithms for initialization are problematic in this context
Voodoo Camera Tracker performs well for small tracks but
and can be entirely replaced by BA-based approaches.
becomes much slower when BA has to correct accumu-
Given the constant increase of camera resolution and
lated drift in the end. Even in comparison to recent effi-
frame rate, and the advent of light field sensors by com-
cient SLAM approaches such as LSD SLAM our method
panies such as Lytro or Pelican Imaging, we believe that
is faster. We also observed that increased image resolu-
algorithms specifically designed for densely sampled input
tion can lead to significant drops in performance for the
represent a great opportunity for future research in this area.
tested methods, whereas our method scales well due to the
proposed subsampling. We expect further significant speed References
gains by improved preprocessing such as feature extraction
and tracking, as the majority of computing time is spent on [1] http://www.disneyresearch.com/project/scalablesfm. 2, 7, 8
these steps, and not our core optimization procedure (Fig- [2] Bundler Structure from Motion Toolkit. https://
ure 11). github.com/snavely/bundler_sfm. [Online; ac-
Please also refer to the supplemental material on the cessed 09-Nov-2014]. 2, 3, 8
[3] Open Source Computer Vision Library. http://opencv.
project webpage [1] for reconstruction results on several
org/. [Online; accessed 09-Nov-2014]. 3
other scenes, including very high resolution 5k video, mul-
[4] VisualSFM : A Visual Structure from Motion System.
tiple video sequence reconstructions and reconstructions
http://ccwu.me/vsfm/. [Online; accessed 09-Nov-
from the Stanford Light Field datasets. 2014]. 2, 8
Limitations and future work. Our method currently [5] S. Agarwal, N. Snavely, S. M. Seitz, and R. Szeliski. Bundle
computes only extrinsic camera parameters. As future work adjustment in the large. In ECCV, 2010. 1, 2
it would be interesting to support uncalibrated cameras with [6] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and
changing intrinsics. Moreover, the algorithm is limited by R. Szeliski. Building rome in a day. In ICCV, 2009. 1,
some of the components used. For instance, replacing the 2
[7] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. 2 [29] T. Tuytelaars and K. Mikolajczyk. Local invariant feature
[8] A. Davis, M. Levoy, and F. Durand. Unstructured light fields. detectors: A survey. In Foundations and Trends in Computer
Comp. Graph. Forum, 31(2pt1):305–314, May 2012. 2 Graphics and Vision, pages 177–280, 2007. 1
[9] A. J. Davison, I. D. Reid, N. Molton, and O. Stasse. [30] O. Wang, C. Schroers, H. Zimmer, M. H. Gross, and
Monoslam: Real-time single camera SLAM. IEEE TPAMI, A. Sorkine-Hornung. Videosnapping: interactive synchro-
29(6):1052–1067, 2007. 2 nization of multiple videos. ACM Trans. Graph., 33(4):77,
[10] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: large- 2014. 2, 6
scale direct monocular SLAM. In ECCV, 2014. 2, 8 [31] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez,
[11] J. Frahm, P. F. Georgel, D. Gallup, T. Johnson, R. Raguram, A. Barth, A. Adams, M. Horowitz, and M. Levoy. High per-
C. Wu, Y. Jen, E. Dunn, B. Clipp, and S. Lazebnik. Building formance imaging using large camera arrays. ACM Trans.
rome on a cloudless day. In ECCV, 2010. 2 Graph., 24(3):765–776, 2005. 2
[12] A. Harltey and A. Zisserman. Multiple view geometry in [32] K. Wilson and N. Snavely. Robust global translations with
computer vision. Cambridge University Press, 2006. 1, 2, 1dsfm. In ECCV, 2014. 2
4 [33] F. Yu and D. Gallup. 3d reconstruction from accidental mo-
[13] M. B. Hullin, J. Hanika, B. Ajdin, H.-P. Seidel, J. Kautz, and tion. In CVPR, 2014. 1, 2, 4, 8
H. P. A. Lensch. Acquisition and analysis of bispectral bidi- [34] C. Zhang, J. Gao, O. Wang, P. Georgel, R. Yang, J. Davis,
rectional reflectance and reradiation distribution functions. J. Frahm, and M. Pollefeys. Personal photograph enhance-
ACM Trans. Graph., 29(4):97:1–97:7, July 2010. 2 ment using internet photo collections. IEEE Trans. Vis. Com-
[14] N. Jiang, Z. Cui, and P. Tan. A global linear method for put. Graph., 20(2):262–275, 2014. 1
camera pose registration. In ICCV, pages 481–488, 2013. 2, [35] D. Zou and P. Tan. Coslam: Collaborative visual slam in
3, 5, 6 dynamic environments. IEEE TPAMI, 35(2):354–366, 2013.
[15] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and 2
M. Gross. Scene reconstruction from high spatio-angular
resolution light fields. ACM Trans. Graph., 32(4):73:1–
73:12, 2013. 1, 2, 8
[16] G. Klein and D. Murray. Parallel tracking and mapping for
small AR workspaces. In ISMAR, 2007. 2
[17] M. Li and A. I. Mourikis. High-precision, consistent EKF-
based visual-inertial odometry. Int. J. Robotics Research,
32(6):690–711, 2013. 2
[18] H. Lim, S. N. Sinha, M. F. Cohen, M. Uyttendaele, and H. J.
Kim. Real-time monocular image-based 6-dof localization.
Int. J. Robotics Research, 2014. 2
[19] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, 2004. 3
[20] D. Martinec and T. Pajdla. Robust rotation and translation
estimation in multiview reconstruction. In CVPR, 2007. 2
[21] R. A. Newcombe, S. Lovegrove, and A. J. Davison. Dtam:
Dense tracking and mapping in real-time. In ICCV, 2011. 2
[22] R. Ng. Digital Light Field Photography. PhD thesis, 2006.
2
[23] R. Parys and A. Schilling. Incremental large scale 3d recon-
struction. 3DIMPVT, 2012. 2
[24] M. Pollefeys, L. J. V. Gool, M. Vergauwen, F. Verbiest,
K. Cornelis, J. Tops, and R. Koch. Visual modeling with
a hand-held camera. IJCV, 59(3):207–232, 2004. 2
[25] N. Snavely, S. Seitz, and R. Szeliski. Skeletal graphs for
efficient structure from motion. In CVPR, 2008. 2
[26] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world
from internet photo collections. IJCV, 80(2):189–210, 2008.
1, 2
[27] R. Szeliski and S. B. Kang. Recovering 3d shape and motion
from image streams using nonlinear least squares. In CVPR,
1993. 1, 2, 3
[28] T. Roosendaal (Producer). Sintel. Blender Foundation,
Durian Open Movie Project. http://www.sintel.
org/, 2010. 7