0% found this document useful (0 votes)
8 views

Towards Linear-Time Incremental Structure from Motion

This paper presents a novel approach to incremental structure from motion (SfM) that reduces the time complexity from O(n^4) to O(n) for major steps, including bundle adjustment (BA), by introducing a new BA strategy and a preemptive feature matching technique. The proposed method maintains high accuracy while significantly improving efficiency, making it suitable for large-scale reconstructions. Extensive experiments demonstrate the effectiveness of the algorithm on large photo collections and video sequences.

Uploaded by

jldm.entity
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Towards Linear-Time Incremental Structure from Motion

This paper presents a novel approach to incremental structure from motion (SfM) that reduces the time complexity from O(n^4) to O(n) for major steps, including bundle adjustment (BA), by introducing a new BA strategy and a preemptive feature matching technique. The proposed method maintains high accuracy while significantly improving efficiency, making it suitable for large-scale reconstructions. Extensive experiments demonstrate the effectiveness of the algorithm on large photo collections and video sequences.

Uploaded by

jldm.entity
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2013 International Conference on 3D Vision

Towards Linear-time Incremental Structure from Motion

Changchang Wu
University of Washington

Abstract including BA and point filtering, require only O(n) time


in practice when using a novel BA strategy.
The time complexity of incremental structure from mo- • Without sacrificing the time-complexity, we introduce
tion (SfM) is often known as O(n4 ) with respect to the a re-triangulation step to deal with the problem of
number of cameras. As bundle adjustment (BA) being sig- accumulated drifts without explicit loop closing.
nificantly improved recently by preconditioned conjugate The rest of the paper is organized as follows: Section 2
gradient (PCG), it is worth revisiting how fast incremen- gives the background and related work. We first introduce
tal SfM is. We introduce a novel BA strategy that provides a preemptive feature matching in Section 3. We analyze the
good balance between speed and accuracy. Through al- time complexity of bundle adjustment in Section 4 and pro-
gorithm analysis and extensive experiments, we show that pose our new SfM algorithms in Section 5. The experiments
incremental SfM requires only O(n) time on many ma- and conclusions are given in Section 6 and Section 7.
jor steps including BA. Our method maintains high ac-
curacy by regularly re-triangulating the feature matches
2. Related Work
that initially fail to triangulate. We test our algorithm
on large photo collections and long video sequences with In a typical incremental SfM system (e.g. [14]), two-
various settings, and show that our method offers state of view reconstructions are first estimated upon successful
the art performance for large-scale reconstructions. The feature matching between two images, 3D models are then
presented algorithm is available as part of VisualSFM at reconstructed by initializing from good two-view recon-
http://homes.cs.washington.edu/˜ccwu/vsfm/. structions, repeatedly adding matched images, triangulating
feature matches, and bundle-adjusting the structure and
motion. The time complexity of such an incremental SfM
1. Introduction algorithms is commonly known to be O(n4 ) for n images,
and this high computation cost impedes the application of
Structure from motion (SfM) has been successfully used
such a simple incremental SfM on large photo collections.
for the reconstruction of increasingly large uncontrolled
Large-scale data often contains lots of redundant infor-
photo collections [8, 2, 5, 4, 7]. Although a large photo
mation, which allows many computations of 3D recon-
collection can be reduced to a subset of iconic images [8, 5]
struction to be approximated in favor of speed, for example:
or skeletal images [15, 2] for reconstruction, the commonly
known O(n4 ) cost of the incremental SfM is still high for Image Matching Instead of matching all the image to
large scene components. Recently an O(n3 ) method was each other, Agarwal et al. [2] first identify a small number
demonstrated by using a combination of discrete and con- of candidates for each images by using the vocabulary
tinuous optimization [4]. Nevertheless, it remains crucial tree recognition [11], and then match the features by using
for SfM to achieve low complexity and high scalability. approximate nearest neighbor. It is shown that the two-
This paper demonstrates our efforts for further improving folded approximation of image matching still preserves
the efficiency of SfM by the following contributions: enough feature matches for SfM. Frahm et al. [5] exploit
• We introduce a preemptive feature matching that can re- the approximate GPS tags and match images only to the
duce the pairs of image matching by up to 95% while still nearby ones. In this paper, we present a preemptive feature
recovering sufficient good matches for reconstruction. matching to further improve the matching speed.
• We analyze the time complexity of the conjugate gra- Bundle Adjustment Since BA already exploits linear
dient bundle adjustment methods. Through theoretical approximations of the non-linear optimization problem, it is
analysis and experimental validation, we show that the often unnecessary to solve the exact descent steps. Recent
bundle adjustment requires O(n) time in practice. algorithms have achieved significant speedup by using Pre-
• We show that many sub-steps of incremental SfM, conditioned Conjugate Gradient (PCG) to approximately

978-0-7695-5067-1/13 $26.00 © 2013 IEEE 127


DOI 10.1109/3DV.2013.25
solve the linear systems [2, 3, 1, 16]. Similarly, there is scales of invariant features [10], we propose the following
no need to run full BA for every new image or always preemptive feature matching for this purpose:
wait until BA/PCG converges to very small residuals. In 1. Sort the features of each image into decreasing scale
this paper, we show that linear time is required by bundle order. This is a one-time O(n) preprocessing.
adjustments for large-scale uncontrolled photo collections. 2. Generate the list of pairs that need to be matched, either
Scene Graph Large photo collections often contain more using all the pairs or a select a subset ( [2, 5]).
than enough images for high-quality reconstructions. The 3. For each image pair (parallelly), do the following:
efficiency can be improved by reducing the number of (a) Match the first h features of the two images.
images for the high-cost SfM. [8, 5] use the iconic images (b) If the number of matches from the subset is smaller
as the main skeleton of scene graphs, while [15, 2] extract than th , return and skip the next step.
skeletal graphs from the dense scene graphs. In practice, (c) Do regular matching and geometry estimation.
other types of improvements to large-scale SfM should be where h is the parameter for the subset size, and th is the
used jointly with scene graph simplifications. However, threshold for the expected number of matches. The feature
to push the limits of incremental SfM, we consider the matching of subset and full-set uses the same nearest
reconstruction of a single connected component without neighbor algorithm with distance ratio test [10] and require
simplifying the scene graphs. the matched features to be mutually nearest.
In contrast to incremental SfM, other work tries to avoid Let k1 and k2 be the number of features of two images,
the greedy manner. Gherard et al. [6] proposed a hierarchi- and k = max(k1 , k2 ). Let mp (h) and mi (h) be the number
cal SfM through balanced branching and merging, which of putative and inlier matches when using up to h top-scale
lowers the time complexity by requiring fewer BAs of large features. We define the yield of feature matching as
models. Sinha et al. [13] recover the relative camera rota- mp (h) mi (h)
tions from vanishing points, and converts SfM to efficient Yp (h) = and Yi (h) = . (1)
h h
3D model merging. Recently, Crandall et al. [4] exploit
We are interested in how the yields of the feature subset cor-
GPS-based initialization and model SfM as a global MRF
relate with the final yield Yi (k). As shown in Figure 1(a),
optimization. This work is a re-investigation of incremental
the distributions of Yp (h) and Yi (h) are closely related
SfM, and still shares some of its limitations, such as initial-
to Yi (k) even for very small h such as 100. Figure 1(b)
izations affecting completeness, but does not rely on addi-
shows that the chances of have a match within the top-scale
tional calibrations, GPS tags or vanishing point detection.
features are on par with other features. This means that
Notations We use n, p and q to respectively denote the top-scale subset has roughly h/k chance to preserve a
the number of cameras, points and observations during a match, which is much higher than the h2 /(k1 k2 ) chance
reconstruction. Given that each image has a limited number of random sampling h features. Additionally, the matching
of features, we have p = O(n) and q = O(n). Since this time for the majority is reduced to a factor of h2 /(k1 k2 )
paper considers a single connected scene graph, n is also along with better caching. In this paper, we choose h = 100
used as the number of input images with abuse of notation. with consideration of both efficiency and robustness.
The top-scale features match well for several reasons: 1)
3. Preemptive Feature Matching A small number of top-scale feature can cover a relatively
large scale range due to the decreasing number of features
Image matching is one of the most time-consuming steps on higher Gaussian levels. 2) Feature matching is well
of SfM. The full pairwise matching takes O(n2 ) time for structured such that the large-scale features in one image
n input images, however it is fine for large datasets to often match with the large-scale features in another. The
compute a subset of matching (e.g. by using vocabulary scale variation between two matched features is jointly
tree [2]), and the overall computation can be reduced to determined by the camera motion and the scene structure,
O(n). In addition, image matching can be easily paral- so the scale variations of multiple feature matches are not
lelized onto multiple machines [2] or multiple GPUs [5] for independent to each other. Figure 1(c) shows our statistics
more speedup. Nevertheless, feature matching is still one of the feature scale variations. The feature scale variations
of the bottlenecks because typical images contain several of an image pair usually have a small variance due to small
thousands of features, which is the aspect we try to handle. viewpoint changes or well-structured scene depths. For the
Due to the diversity of viewpoints in large photo collec- same reasons, we use up to 8192 features for the regular
tions, the majority of image pairs do not match (75% − 98% feature matching, which are sufficient for most scenarios.
for the large datasets we experimented). A large portion While large photo collections contain redundant views and
of matching time can be saved if we can identify the features, our preemptive feature matching allows putting
good pairs robustly and efficiently. By exploiting the most efforts in the ones that are more likely to be matched.

128
0.5 1
Median of Yp(100) Histogram divided into three parts [0-200][200-1000][1000-] Deviations from means
Median of Yi(400) 60K Mean of each image pair
0.4 Median of Yi(200) 0.8
Yields when using subsets

Median of Yi(100) 50K

0.3
Median of Yi(50) 0.6
40K

0.2 30K 0.4

20K
0.1 0.2
10K

0 0
0 0.1 0.2 0.3 0.4 0.5 0 100 200 600 1000 5000 -3 -2 -1 0 1 2 3
The final inlier yield Yi(k) Difference of log2 scales

(a) Yield when using subset of features (b) Histogram of max index of a feature match (c) The distribution of feature scale variations
Figure 1. (a) shows the relationship between the final yield and the yield from a subset of top-scale features. For each set of image pairs
that have roughly the same final yield, we compute the median of their subset yield for different h. (b) gives the histogram of the max
of two indices of a feature match, where the index is the position in the scale-decreasing order. We can see that the chances of matching
within top-scale features are similar to other features. (c) shows two distributions of the scale variations computed from 130M feature
matches. The mean scale variations between two images are given by the red curve, while the deviations of scale variation from means are
given by the blue one. The variances of scale changes are often small due to small viewpoint changes or structured scene depths.

4. How Fast Is Bundle Adjustment? CG iteration has achieved O(n) time complexity [1, 3].
Recently, the multicore bundle adjustment [16] takes
Bundle adjustment (BA) is the joint non-linear opti-
one step further by using implicit multiplication of the
mization of structure and motion parameters, for which
Hessian matrices and Schur Complements, which requires
Levenberg-Marquardt (LM) is method of choice. Recently,
to construct only the O(n) space Jacobian matrices. In
the performance of large-scale bundle adjustment has
this work, we use the GPU-version of multicore bundle
been significantly improved by Preconditioned Conjugate
adjustment. Figure 2(a) shows the timing of CG iterations
Gradient (PCG) [1, 3, 16], hence it is worth re-examining
from our bundle adjustment problems, which exhibits linear
the time complexity of BA and SfM.
relationship between the time Tcg of a CG iteration and n.
Let x be a vector of parameters and f (x) be the vector √
In each LM step, PCG requires O( κ) iterations to ac-
of reprojection errors for a 3D reconstruction. The opti-
curately solve a linear system [12], where κ is the condition
mization we wish to solve is the non-linear least squares
number of the linear system. Small condition numbers can
problem: x∗ = arg minx f (x)2 . Let J be the Jacobian of
be obtained by using good pre-conditioners, for example,
f (x) and Λ a non-negative diagonal vector for regulariza-
quick convergence rate has been demonstrated with block-
tion, then LM repeatedly solves the following linear system
jacobi preconditioner [1, 16]. In our experiments, PCG
(J T J + Λ) δ = −J T f, uses an average of 20 iterations solve a linear system.
Surprisingly, the time complexity of bundle adjustment
and updates x ← x + δ if f (x + δ) < f (x). The has already reached O(n), provided that there are O(1)
matrix HΛ = J T J + Λ is known as the augmented Hessian CG iterations in a BA. Although the actual number of the
matrix. Gaussian elimination is typically used to first CG/LM iterations depends on the difficulty of the input
obtain a reduced linear system of the camera parameters, problems, the O(1) assumption for CG/LM iterations is
which is called Schur Complement. well supported by the statistics we collect from a large
The Hessian matrix require O(q) = O(n) space and number of BAs of varying problem sizes. Figure 2(b) gives
time to compute, while Schur complement requires O(n2 ) the distribution of LM iterations used by a BA, where the
space and time to compute. It takes cubic time O(n3 ) or average is 37 and 93% BAs converge within less than 100
O(p3 ) to solve the linear system by Cholesky factorization. LM iterations. Figure 2(c) shows the distribution of the total
Because the number of cameras is much smaller than CG iterations (used by all LM steps of a BA), which are also
the number of points, the Schur complement method can normally small. In practice, we choose to use at most 100
reduce the factorization time by a large constant factor. LM iterations per BA and at most 100 CG iterations per LM,
One impressive example of this algorithm is Lourakis and which guarantees good convergence for most problems.
Argyros’ SBA [9] used by Photo Tourism [14].
For conjugate gradient methods, the dominant compu- 5. Incremental Structure from Motion
tation is the matrix-vector multiplications in multiple CG
iteration, of which the time complexity is determined by the This section presents the design of our SfM that practi-
size of the involved matrices. By using only the O(n) space cally has a linear run time. The fact that bundle adjustment
Hessian matrices and avoiding the Schur complement, the can be done in O(n) time opens an opportunity of push-

129
0.08 5% 20%
Central Rome
0.07 Mean = 37 Mean = 881
Arts Quad
4% Median = 29 Median = 468
0.06 Loop 15%
93% smaller than 100 90% smaller than 2000
Tcg (second)

0.05 St. Peter’s Basilica


Colosseum 3% Histogram bin size = 100
0.04
10%
0.03 2%
0.02
5%
1%
0.01
0
0 2000 4000 6000 8000 10000 12000 14000
n - Number of cameras 0 20 40 60 80 0 2000 4000 6000 8000

(a) Time spent on a single CG iteration. (b) Number of LM iterations used by a BA (c) Number of CG iterations used by a BA
Figure 2. Bundle adjustment statistics. (a) shows that Tcg is roughly linear to n regardless of the scene graph structure. (b) and (c) show
the distributions of the number of LM and CG iterations used by a BA. It can be seen that BA typically converges within a small number
of LM and CG iterations. Note that we consider BA converges if the mean squared errors drop to below 0.25 or cannot be decreased.

ing incremental reconstruction time closer to O(n). As Although the latter added cameras are optimized by fewer
illustrated in Figure 3, our algorithm adds a single image full BAs, there are normally no accuracy problems because
at each iteration, and then runs either a full BA or a partial the full BAs always improve more for the parts that have
BA. After BA and filtering, we take another choice between larger errors. As the model gets larger, more cameras are
continuing to the next iteration and a re-triangulation step. added before running a full BA. In order to reduce the ac-
cumulated errors, we keep running local optimizations by
Partial BA using partial BAs on a constant number of recently added
Add Camera Filtering Re-triangulation cameras (we use 20) and their associated 3D points. Such
Full BA & Full BA partial optimizations involve O(1) cameras and points pa-
rameters, so each partial BA takes O(1) time. Therefore, the
Figure 3. A single iteration of the proposed incremental SfM, time spent on full BAs and partial BAs adds to O(n), which
which is repeated until no images can be added to reconstruction. is experimentally validated by the time curves in Figure 6.
Following the BA step, we do filtering of the points that
5.1. How Fast Is Incremental SfM? have large reprojection errors or small triangulation angles.
The time complexity of a full point filtering step is O(n).
The cameras and 3D points normally get stabilized Fortunately, we only need to process the 3D points that
quickly during reconstruction, thus it is unnecessary to have been changed, so the point filtering after a partial BA
optimize all camera and point parameters at every iteration. can be done O(1) time. Although each point filtering after
A typical strategy is to perform full BAs after adding a full BA takes O(n) time, they add to only O(n) due to
a constant number of cameras (e.g. α), for which the the geometric sequence. Therefore, the accumulated time
accumulated time of all the full BAs is on all point filtering is also O(n).
n/α  n/α  n2 
 Another expensive step is to organize the resection candi-
 
TBA (i ∗ α) = O (i ∗ α) = O , (2) dates and to add new cameras to 3D models. We keep track
i i
α of the potential 2D-3D correspondences during SfM by
using the feature matches of newly added images. If each
when using PCG. This is already a significant reduction image is matched to O(1) images, which is a reasonable as-
from the O(n4 ) time of Cholesky factorization based BA. sumption for large-scale photo collections, it requires O(1)
As the models grow larger and more stable, it is affordable time to update the correspondence information at each iter-
to skip more costly BAs, so we would like to study how ation. Another O(1) time is needed to add the camera to the
much further we can go without losing the accuracy. model. The accumulated time of these steps is again O(n).
In this paper, we find that the accumulated time of BAs We have shown that major steps of incremental SfM
can be reduced even more to O(n) by using a geomet- contribute to an O(n) time complexity. However, the above
ric sequence for full BAs. We propose to perform full analysis ignored several things: 1) finding the portion of
optimizations only when the size of a model increases data for partial BA and partial filtering. 2) finding the subset
relatively by a certain ratio r (e.g. 5%), and the resulting of images that match with a single image in the resection
time spent on full BAs approximately becomes stage. 3) comparison of cameras during the resection stage.
These steps require O(n) scan time at each step and add to

  n   n  n

TBA = O = O . (3) O(n2 ) time in theory. It is possible to keep track of these
i
(1 + r)i i
(1 + r)i r subsets for further reduction of time complexity. However,

130
since the O(n) part dominates the reconstruction in our
experiments (up to 15K, see Table 2 for details), we have
not tried to further optimize these O(n2 ) steps. Without
5.2. Re-triangulation (RT) RT

Incremental SfM is known to have drifting problems


due to the accumulated errors of relative camera poses.
The constraint between two camera poses is provided by
their triangulated feature matches. Because the initially
estimated poses and even the poses after a partial BA may With
not be accurate enough, some correct feature matches may RT
fail to triangulate for some triangulation threshold and
filtering threshold. The accumulated loss of correct feature
matches is one of the main reasons of the drifting.
To deal with this problem, we propose to re-triangulate Figure 4. Our reconstruction of the Loop dataset of 4342 frames
(RT) the failed feature matches regularly (with delay) dur- (the blue dots are the camera positions). Our algorithm correctly
ing incremental SfM. A good indication of possible bad rel- handles the drifting problem in this long loop by using RT.
ative pose between two cameras is a low ratio between their
common points and their feature matches, which we call
under-reconstructed. In order to wait until the poses to get full pair-wise matching is computed for the St. Peter’s
more stabilized, we re-triangulate the under-reconstructed Basilica and Colosseum dataset. For the Arts Quad and
pairs under a geometric sequence (e.g. r = 25% when the Loop datasets, each image is matched to the nearby ones
size of a model increases by 25%). To obtain more points, according to GPS. For the Central Rome dataset, our
we also increase the threshold for reprojection errors during preemptive matching with h = 100 and th = 4 is applied.
RT. After re-triangulating the feature matches, we run full We then run our incremental SfM with the subset of
BA and point filtering to improve the reconstruction. Each image matches that satisfy the preemptive matching for
RT step requires O(n) time and accumulates to the same h = 100 and different th . Table 1 shows the statistics of the
O(n) time thanks to the geometric sequence. feature matches and the reconstructed number of cameras
The proposed RT step is similar to loop-closing, which of the largest reconstructed models. With the preemptive
however deals with drifts only when loops are detected. matching, we are able to reconstruct the larges SfM model
By looking for the under-reconstructed pairs, our method of 15065 cameras for Rome and complete models for other
is able to reduce the drift errors without explicit loop detec- datasets. Preemptive matching is able to significantly re-
tions, given that there are sufficient feature matches. In fact, duce the number of image pairs for regular matching and
the RT step is more general because it works for the relative still preserve a large portion of the correct feature matches.
pose between any matched images, and it also makes loop For example, 43% feature matches are obtained with 6% of
detection even easier. Figure 4 shows our incremental SfM image pairs for the St. Peter’s. It is worth noting that it is ex-
with RT on a 4K image loop, which correctly reconstruct tremely fast to match the top 100 features. Our system has
the long loop using RT without explicit loop closing. an average speed of 73K pairs per second with 24 threads.
The preemptive matching has a small chance of losing
6. Experiments weak links when using a large threshold. Complete models
are reconstructed for all the datasets when th = 2 . However,
We apply our algorithms on five datasets of different for th = 4, we find that one building in Arts Quad is missing
sizes. The Central Rome dataset and the Arts Quad datasets due to occlusions, and the Colosseum model breaks into the
are obtained from the authors of [4], which contain 32768 interior and the exterior. We believe a more adaptive scheme
images and 6514 images respectively. The Loop dataset are of preemptive matching should be explored in the future.
4342 frames of high resolution video sequences of a street
block. The St. Peter’s Basilica and Colosseum datasets 6.2. Incremental SfM
have 1275 and 1164 images respectively. We run all the We run our algorithm with all computed feature matches
reconstructions on a PC with an Intel Xenon 5680 3.33Ghz for the five datasets using the same settings. Table 2 summa-
CPU (24 cores), 12GB RAM, and an nVidia GTX 480 GPU. rizes the statistics and time of the experiment for r = 5% and
6.1. Feature Matching r = 25%. Figure 4, 5 and 7 show the screenshots of our SfM
models. Our method efficiently reconstructs large, accurate
To allow experiment on the largest possible model, and complete models with high point density. In particular,
we have tried to first match sufficient image pairs. The the two 1K image datasets are reconstructed in less than 10

131
Without Preemptive Matching Using Preemptive Matching (h = 100)
Pairs to Pairs With Feature Pairs to Pairs With Feature
Dataset n th n
Match 15+ Inliers Matches Match 15+ Inliers Matches
Central Rome N/A N/A N/A N/A 4 13551K 540K 67M 15065
4 521K, 3% 62K, 32% 25M, 78% 4272
Arts Quad 15402K 192K 32M 5624
2 4308K, 28% 121K, 63% 29M, 91% 5393
4 269K, 38% 235K, 71% 150M, 95% 4342
Loop 709K 329K 158M 4342
8 151K, 21% 150K, 46% 135M, 85% 4342
4 46K, 6% 38K, 18% 9.1M, 43% 1211
St. Peter’s 812K 217K 21M 1267
8 220K, 27% 100K, 46% 14M, 67% 1262
4 23K, 3% 13K, 24% 4.1M, 60% 517+426
Colosseum 677K 54K 6.8M 1157
2 149K, 22% 28K, 52% 5.4M, 79% 1071

Table 1. Reconstruction comparison for different feature matching. We first try th = 4, and then try th = 8 of the result is complete, or
try th = 2 if the resulting reconstruction is incomplete. Comparably complete models are reconstructed when using preemptive matching,
where only a small set of image pairs need to be matched. All reconstructions use the same setting r = 5% and r = 25%.

Input Cameras Points Observa- Time t Time Time Time Time


Dataset
Images n p tions q Overall Full BA Partial BA Adding Filtering
Central Rome 32768 15065 1660415 12903348 6010 2008 2957 549 247
Arts Quad 6514 5624 819292 5838784 2132 1042 807 122 57
Loop 4342 4342 1101515 7195960 3251 1731 478 523 47
St. Peter’s 1275 1267 292379 2706250 583 223 268 48 20
Colosseum 1164 1157 293724 1759136 591 453 100 19 9

Table 2. Reconstruction summary and timing (in seconds). Only the reconstruction of largest model is reported for each dataset. The
reported time of full BA includes the BAs after the RT steps. The “adding” part includes the time on updating resection candidates,
estimating camera poses, and adding new cameras and points to a 3D model.

(a) Sparse reconstruction of Central Rome (15065 cameras) (b) Overlay on the aerial image
Figure 5. Our Rome reconstruction (the blue dots are the camera positions). The recovered structure is accurate compared to the map.

minutes, and the 15K camera model of Rome is computed time of full BAs and partial BAs) increases roughly linearly
within 1.67 hours. Additional results under various settings as the models grow larger, which validates our analysis of
can be found in Table 3 and the supplemental material. the time complexity. The Loop reconstruction has a higher
The timing in Table 2 shows that BAs (full + partial) slope mainly due to its higher number of feature matches.
account for the majority of the reconstruction time, so the
6.3. Reconstruction Quality and Speed
time complexity of the BAs approximates that of the incre-
mental SfM. We keep track of the time of reconstruction Figure 4, 5 and 7 demonstrate high quality reconstruc-
and the accumulated time of each step. As shown in Fig- tions that are comparable to other methods, and we have
ure 6, the reconstruction time (as well as the accumulated also reconstructed more cameras than DISCO [4] for both

132
6000 3000 3000
Central Rome Central Rome Central Rome
Arts Quad 2500 Arts Quad 2500 Arts Quad
5000
Loop Loop Loop
St. Peter’s Basilica St. Peter’s Basilica St. Peter’s Basilica
Time (second)

Time (second)

Time (second)
4000 2000 2000
Colosseum Colosseum Colosseum
3000 1500 1500

2000 1000 1000

1000 500 500

0 0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
n - Number of cameras n - Number of cameras n - Number of cameras

(a) Reconstruction time. (b) Accumulated time of full BA. (c) Accumulated time of partial BA.
Figure 6. The reconstruction time in seconds as the 3D models grow larger. Here we report the timing right after each full BA. The
reconstruction time and the accumulated time of full BAs and partial BAs increase roughly linearly with respect to the number of cameras.
Note the higher slope of the Loop reconstruction is due to the higher number of feature matches between nearby video frames.

(a) Arts Quad (5624 cameras) (b) St. Peter’s Basilica (1267 cameras) (c) Colosseum (1157 cameras)
Figure 7. Our reconstruction of Arts Quad, St. Peter’s Basilica and Colosseum.

Central Rome and Arts Quad. In particular, Figure 5 shows t/n, our method is 8-19X faster than DISCO and 55-163X
the correct overlay of our 15065 camera model of Central faster than bundler. Similarly by t/q, our method is 5-11X
Rome on an aerial image. The Loop reconstruction shows faster than DISCO and 56-186X faster than bundler. It is
that the RT steps correctly deal with the drifting problem worth pointing out that our system uses only a single PC
without explicit loop closing. The reconstruction is first (12GB RAM) while DISCO uses a 200 core cluster.
pushed away from local minima by RT and then improved
by the subsequent BAs. The robustness and accuracy of
6.4. Discussions
our method is due to the strategy of mixed BA and RT.
We evaluate the accuracy of the reconstructed cameras Although our experiments show approximately linear
by comparing their positions to the ground truth locations. running times, the reconstruction takes O(n2 ) in theory
For the Arts Quad dataset, our reconstruction (r = 5% and and will exhibit such a trend the problem gets even larger.
r = 25%) contains 261 out of the 348 images whose ground In addition, it is possible that the proposed strategy will fail
truth GPS location is provided by [4]. We used RANSAC for extremely larger reconstruction problems due to larger
to estimate a 3D similarity transformation between the 261 accumulated errors. Thanks to the stability of SfM, mixed
camera locations and their Euclidean coordinates. With the BAs and RT, our algorithm works without quality problems
best found transformation, our 3D model has a mean error even for 15K cameras.
of 2.5 meter and a median error of 0.89 meter, which is This paper focuses on the incremental SfM stage, while
smaller than the 1.16 meter error reported by [4]. the bottleneck of 3D reconstruction sometimes is the image
We evaluate the reconstruction speed by two measures: matching. In order to test our incremental SfM algorithm
• t/n The time needed to reconstruct a camera. on the largest possible models, we have matched up to
• t/q The time needed to reconstruct an observation. O(n2 ) image pairs for several datasets, which is higher
Table 3 shows the comparison between our reconstruction than the vocabulary strategy that can choose O(n) pairs
under various settings and the DISCO and bundler recon- to match [2]. However, this does not limit our method
struction of [4]. While producing comparably large models, from working with fewer image matches. From a different
our algorithm normally requires less than half a second angle, we have contributed the preemptive matching that
to reconstruct a camera in terms of overall speed. By can significantly reduce the matching cost.

133
Dataset Full BA Partial BA RT n q t t/n t/q
r = 5% Every Image r = 25% 15065 12903K 1.67hour 0.40s 0.47ms
r = 25% Every Image r = 50% 15113 12958K 1.32hour 0.31s 0.37ms
Central Rome r = 5% Every 3 Images r = 25% 14998 12599K 1.03hour 0.25s 0.29ms
DISCO result of [4] 14754 21544K 13.2hour 3.2s 2.2ms
Bundler result of [4] 13455 5411K 82hour 22s 54ms
r = 5% Every Image r = 25% 5624 5839K 0.59hour 0.38s 0.37ms
r = 25% Every Image r = 50% 5598 5850K 0.42hour 0.27s 0.26ms
Arts Quad r = 5% Every 3 Images r = 25% 5461 5530K 0.53hour 0.35s 0.35ms
DISCO result of [4] 5233 9387K 7.7hour 5.2s 2.9ms
Bundler result of [4] 5028 10521K 62hour 44s 21ms
r = 5% Every Image r = 25% 4342 7196K 3251s 0.75s 0.45ms
Loop r = 25% Every Image r = 50% 4342 7574K 1985s 0.46s 0.26ms
r = 5% Every 3 Images r = 25% 4341 7696K 3207s 0.74s 0.41ms
r = 5% Every Image r = 25% 1267 2706K 583s 0.46s 0.22ms
St. Peter’s r = 25% Every Image r = 50% 1267 2760K 453s 0.36s 0.16ms
r = 5% Every 3 Images r = 25% 1262 2668K 367s 0.29s 0.14ms
r = 5% Every Image r = 25% 1157 1759K 591s 0.51s 0.34ms
Colosseum r = 25% Every Image r = 50% 1087 1709K 205s 0.19s 0.12ms
r = 5% Every 3 Images r = 25% 1091 1675K 471s 0.43s 0.28ms

Table 3. The statistics of our reconstructions under various settings. We reconstruct larger models than DISCO and bundler under the
various settings, and our method also runs significantly faster. Additional results will be presented in the supplemental material.

7. Conclusions and Future Work [5] J. Frahm, P. Fite Georgel, D. Gallup, T. Johnson, R. Ragu-
ram, C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, and
This paper revisits and improves the classic incremental M. Pollefeys. Building rome on a cloudless day. In ECCV,
SfM algorithm. We propose a preemptive matching method pages IV: 368–381, 2010. 1, 2
that can significantly reduce the feature matching cost for [6] R. Gherardi, M. Farenzena, and A. Fusiello. Improving the
large scale SfM. Through algorithmic analysis and exten- efficiency of hierarchical structure-and-motion. In CVPR,
sive experimental validation, we show that incremental pages 1594–1600, 2010. 2
[7] A. Kushal, B. Self, Y. Furukawa, C. Hernandez, D. Gallup,
SfM is of O(n2 ) time complexity, but requires only O(n)
B. Curless, and S. Seitz. Photo tours. In 3DimPVT, 2012. 1
time on its major steps while still maintaining the recon- [8] X. Li, C. Wu, C. Zach, S. Lazebnik, and J. Frahm. Modeling
struction accuracy using mixed BAs and RT. The practical and recognition of landmark image collections using iconic
run time is approximately O(n) for large problems up to scene graphs. In ECCV, 2008. 1, 2
15K cameras. Our system demonstrates state of the art [9] M. A. Lourakis and A. Argyros. SBA: A Software Package
performance by an average speed of reconstructing about 2 for Generic Sparse Bundle Adjustment. ACM Trans. Math.
cameras and more than 2000 features points in a second for Software, 36(1):1–30, 2009. 3
very large photo collections. [10] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60:91–110, 2004. 2
In the future, we wish to explore a more adaptive pre- [11] D. Nister and H. Stewenius. Scalable recognition with a
emptive feature matching and guide the full BAs according vocabulary tree. In CVPR, pages 2161–2168, 2006. 1
to the accumulation of reprojection errors. [12] J. R. Shewchuk. An introduction to the conjugate gradient
method without the agonizing pain, 1994. 3
References [13] S. N. Sinha, D. Steedly, and R. Szeliski. A multi-stage
linear approach to structure from motion. In ECCV RMLE
[1] S. Agarwal, N. Snavely, S. Seitz, and R. Szeliski. Bundle workshop, 2010. 2
adjustment in the large. In ECCV, pages II: 29–42, 2010. 2, 3 [14] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism:
[2] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and exploring photo collections in 3D. In SIGGRAPH, pages
R. Szeliski. Building Rome in a day. In ICCV, 2009. 1, 2, 7 835–846, 2006. 1, 3
[3] M. Byrod and K. Astrom. Conjugate gradient bundle [15] N. Snavely, S. M. Seitz, and R. Szeliski. Skeletal graphs for
adjustment. In ECCV, pages II: 114–127, 2010. 2, 3 efficient structure from motion. In CVPR, 2008. 1, 2
[4] D. Crandall, A. Owens, N. Snavely, and D. P. Huttenlocher. [16] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore
Discrete-continuous optimization for large-scale structure bundle adjustment. In CVPR, 2011. 2, 3
from motion. In CVPR, 2011. 1, 2, 5, 6, 7, 8

134

You might also like