Towards Linear-Time Incremental Structure from Motion
Towards Linear-Time Incremental Structure from Motion
Changchang Wu
University of Washington
128
0.5 1
Median of Yp(100) Histogram divided into three parts [0-200][200-1000][1000-] Deviations from means
Median of Yi(400) 60K Mean of each image pair
0.4 Median of Yi(200) 0.8
Yields when using subsets
0.3
Median of Yi(50) 0.6
40K
20K
0.1 0.2
10K
0 0
0 0.1 0.2 0.3 0.4 0.5 0 100 200 600 1000 5000 -3 -2 -1 0 1 2 3
The final inlier yield Yi(k) Difference of log2 scales
(a) Yield when using subset of features (b) Histogram of max index of a feature match (c) The distribution of feature scale variations
Figure 1. (a) shows the relationship between the final yield and the yield from a subset of top-scale features. For each set of image pairs
that have roughly the same final yield, we compute the median of their subset yield for different h. (b) gives the histogram of the max
of two indices of a feature match, where the index is the position in the scale-decreasing order. We can see that the chances of matching
within top-scale features are similar to other features. (c) shows two distributions of the scale variations computed from 130M feature
matches. The mean scale variations between two images are given by the red curve, while the deviations of scale variation from means are
given by the blue one. The variances of scale changes are often small due to small viewpoint changes or structured scene depths.
4. How Fast Is Bundle Adjustment? CG iteration has achieved O(n) time complexity [1, 3].
Recently, the multicore bundle adjustment [16] takes
Bundle adjustment (BA) is the joint non-linear opti-
one step further by using implicit multiplication of the
mization of structure and motion parameters, for which
Hessian matrices and Schur Complements, which requires
Levenberg-Marquardt (LM) is method of choice. Recently,
to construct only the O(n) space Jacobian matrices. In
the performance of large-scale bundle adjustment has
this work, we use the GPU-version of multicore bundle
been significantly improved by Preconditioned Conjugate
adjustment. Figure 2(a) shows the timing of CG iterations
Gradient (PCG) [1, 3, 16], hence it is worth re-examining
from our bundle adjustment problems, which exhibits linear
the time complexity of BA and SfM.
relationship between the time Tcg of a CG iteration and n.
Let x be a vector of parameters and f (x) be the vector √
In each LM step, PCG requires O( κ) iterations to ac-
of reprojection errors for a 3D reconstruction. The opti-
curately solve a linear system [12], where κ is the condition
mization we wish to solve is the non-linear least squares
number of the linear system. Small condition numbers can
problem: x∗ = arg minx f (x)2 . Let J be the Jacobian of
be obtained by using good pre-conditioners, for example,
f (x) and Λ a non-negative diagonal vector for regulariza-
quick convergence rate has been demonstrated with block-
tion, then LM repeatedly solves the following linear system
jacobi preconditioner [1, 16]. In our experiments, PCG
(J T J + Λ) δ = −J T f, uses an average of 20 iterations solve a linear system.
Surprisingly, the time complexity of bundle adjustment
and updates x ← x + δ if f (x + δ) < f (x). The has already reached O(n), provided that there are O(1)
matrix HΛ = J T J + Λ is known as the augmented Hessian CG iterations in a BA. Although the actual number of the
matrix. Gaussian elimination is typically used to first CG/LM iterations depends on the difficulty of the input
obtain a reduced linear system of the camera parameters, problems, the O(1) assumption for CG/LM iterations is
which is called Schur Complement. well supported by the statistics we collect from a large
The Hessian matrix require O(q) = O(n) space and number of BAs of varying problem sizes. Figure 2(b) gives
time to compute, while Schur complement requires O(n2 ) the distribution of LM iterations used by a BA, where the
space and time to compute. It takes cubic time O(n3 ) or average is 37 and 93% BAs converge within less than 100
O(p3 ) to solve the linear system by Cholesky factorization. LM iterations. Figure 2(c) shows the distribution of the total
Because the number of cameras is much smaller than CG iterations (used by all LM steps of a BA), which are also
the number of points, the Schur complement method can normally small. In practice, we choose to use at most 100
reduce the factorization time by a large constant factor. LM iterations per BA and at most 100 CG iterations per LM,
One impressive example of this algorithm is Lourakis and which guarantees good convergence for most problems.
Argyros’ SBA [9] used by Photo Tourism [14].
For conjugate gradient methods, the dominant compu- 5. Incremental Structure from Motion
tation is the matrix-vector multiplications in multiple CG
iteration, of which the time complexity is determined by the This section presents the design of our SfM that practi-
size of the involved matrices. By using only the O(n) space cally has a linear run time. The fact that bundle adjustment
Hessian matrices and avoiding the Schur complement, the can be done in O(n) time opens an opportunity of push-
129
0.08 5% 20%
Central Rome
0.07 Mean = 37 Mean = 881
Arts Quad
4% Median = 29 Median = 468
0.06 Loop 15%
93% smaller than 100 90% smaller than 2000
Tcg (second)
(a) Time spent on a single CG iteration. (b) Number of LM iterations used by a BA (c) Number of CG iterations used by a BA
Figure 2. Bundle adjustment statistics. (a) shows that Tcg is roughly linear to n regardless of the scene graph structure. (b) and (c) show
the distributions of the number of LM and CG iterations used by a BA. It can be seen that BA typically converges within a small number
of LM and CG iterations. Note that we consider BA converges if the mean squared errors drop to below 0.25 or cannot be decreased.
ing incremental reconstruction time closer to O(n). As Although the latter added cameras are optimized by fewer
illustrated in Figure 3, our algorithm adds a single image full BAs, there are normally no accuracy problems because
at each iteration, and then runs either a full BA or a partial the full BAs always improve more for the parts that have
BA. After BA and filtering, we take another choice between larger errors. As the model gets larger, more cameras are
continuing to the next iteration and a re-triangulation step. added before running a full BA. In order to reduce the ac-
cumulated errors, we keep running local optimizations by
Partial BA using partial BAs on a constant number of recently added
Add Camera Filtering Re-triangulation cameras (we use 20) and their associated 3D points. Such
Full BA & Full BA partial optimizations involve O(1) cameras and points pa-
rameters, so each partial BA takes O(1) time. Therefore, the
Figure 3. A single iteration of the proposed incremental SfM, time spent on full BAs and partial BAs adds to O(n), which
which is repeated until no images can be added to reconstruction. is experimentally validated by the time curves in Figure 6.
Following the BA step, we do filtering of the points that
5.1. How Fast Is Incremental SfM? have large reprojection errors or small triangulation angles.
The time complexity of a full point filtering step is O(n).
The cameras and 3D points normally get stabilized Fortunately, we only need to process the 3D points that
quickly during reconstruction, thus it is unnecessary to have been changed, so the point filtering after a partial BA
optimize all camera and point parameters at every iteration. can be done O(1) time. Although each point filtering after
A typical strategy is to perform full BAs after adding a full BA takes O(n) time, they add to only O(n) due to
a constant number of cameras (e.g. α), for which the the geometric sequence. Therefore, the accumulated time
accumulated time of all the full BAs is on all point filtering is also O(n).
n/α n/α n2
Another expensive step is to organize the resection candi-
TBA (i ∗ α) = O (i ∗ α) = O , (2) dates and to add new cameras to 3D models. We keep track
i i
α of the potential 2D-3D correspondences during SfM by
using the feature matches of newly added images. If each
when using PCG. This is already a significant reduction image is matched to O(1) images, which is a reasonable as-
from the O(n4 ) time of Cholesky factorization based BA. sumption for large-scale photo collections, it requires O(1)
As the models grow larger and more stable, it is affordable time to update the correspondence information at each iter-
to skip more costly BAs, so we would like to study how ation. Another O(1) time is needed to add the camera to the
much further we can go without losing the accuracy. model. The accumulated time of these steps is again O(n).
In this paper, we find that the accumulated time of BAs We have shown that major steps of incremental SfM
can be reduced even more to O(n) by using a geomet- contribute to an O(n) time complexity. However, the above
ric sequence for full BAs. We propose to perform full analysis ignored several things: 1) finding the portion of
optimizations only when the size of a model increases data for partial BA and partial filtering. 2) finding the subset
relatively by a certain ratio r (e.g. 5%), and the resulting of images that match with a single image in the resection
time spent on full BAs approximately becomes stage. 3) comparison of cameras during the resection stage.
These steps require O(n) scan time at each step and add to
∞
n n n
∞
TBA = O = O . (3) O(n2 ) time in theory. It is possible to keep track of these
i
(1 + r)i i
(1 + r)i r subsets for further reduction of time complexity. However,
130
since the O(n) part dominates the reconstruction in our
experiments (up to 15K, see Table 2 for details), we have
not tried to further optimize these O(n2 ) steps. Without
5.2. Re-triangulation (RT) RT
131
Without Preemptive Matching Using Preemptive Matching (h = 100)
Pairs to Pairs With Feature Pairs to Pairs With Feature
Dataset n th n
Match 15+ Inliers Matches Match 15+ Inliers Matches
Central Rome N/A N/A N/A N/A 4 13551K 540K 67M 15065
4 521K, 3% 62K, 32% 25M, 78% 4272
Arts Quad 15402K 192K 32M 5624
2 4308K, 28% 121K, 63% 29M, 91% 5393
4 269K, 38% 235K, 71% 150M, 95% 4342
Loop 709K 329K 158M 4342
8 151K, 21% 150K, 46% 135M, 85% 4342
4 46K, 6% 38K, 18% 9.1M, 43% 1211
St. Peter’s 812K 217K 21M 1267
8 220K, 27% 100K, 46% 14M, 67% 1262
4 23K, 3% 13K, 24% 4.1M, 60% 517+426
Colosseum 677K 54K 6.8M 1157
2 149K, 22% 28K, 52% 5.4M, 79% 1071
Table 1. Reconstruction comparison for different feature matching. We first try th = 4, and then try th = 8 of the result is complete, or
try th = 2 if the resulting reconstruction is incomplete. Comparably complete models are reconstructed when using preemptive matching,
where only a small set of image pairs need to be matched. All reconstructions use the same setting r = 5% and r = 25%.
Table 2. Reconstruction summary and timing (in seconds). Only the reconstruction of largest model is reported for each dataset. The
reported time of full BA includes the BAs after the RT steps. The “adding” part includes the time on updating resection candidates,
estimating camera poses, and adding new cameras and points to a 3D model.
(a) Sparse reconstruction of Central Rome (15065 cameras) (b) Overlay on the aerial image
Figure 5. Our Rome reconstruction (the blue dots are the camera positions). The recovered structure is accurate compared to the map.
minutes, and the 15K camera model of Rome is computed time of full BAs and partial BAs) increases roughly linearly
within 1.67 hours. Additional results under various settings as the models grow larger, which validates our analysis of
can be found in Table 3 and the supplemental material. the time complexity. The Loop reconstruction has a higher
The timing in Table 2 shows that BAs (full + partial) slope mainly due to its higher number of feature matches.
account for the majority of the reconstruction time, so the
6.3. Reconstruction Quality and Speed
time complexity of the BAs approximates that of the incre-
mental SfM. We keep track of the time of reconstruction Figure 4, 5 and 7 demonstrate high quality reconstruc-
and the accumulated time of each step. As shown in Fig- tions that are comparable to other methods, and we have
ure 6, the reconstruction time (as well as the accumulated also reconstructed more cameras than DISCO [4] for both
132
6000 3000 3000
Central Rome Central Rome Central Rome
Arts Quad 2500 Arts Quad 2500 Arts Quad
5000
Loop Loop Loop
St. Peter’s Basilica St. Peter’s Basilica St. Peter’s Basilica
Time (second)
Time (second)
Time (second)
4000 2000 2000
Colosseum Colosseum Colosseum
3000 1500 1500
0 0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
n - Number of cameras n - Number of cameras n - Number of cameras
(a) Reconstruction time. (b) Accumulated time of full BA. (c) Accumulated time of partial BA.
Figure 6. The reconstruction time in seconds as the 3D models grow larger. Here we report the timing right after each full BA. The
reconstruction time and the accumulated time of full BAs and partial BAs increase roughly linearly with respect to the number of cameras.
Note the higher slope of the Loop reconstruction is due to the higher number of feature matches between nearby video frames.
(a) Arts Quad (5624 cameras) (b) St. Peter’s Basilica (1267 cameras) (c) Colosseum (1157 cameras)
Figure 7. Our reconstruction of Arts Quad, St. Peter’s Basilica and Colosseum.
Central Rome and Arts Quad. In particular, Figure 5 shows t/n, our method is 8-19X faster than DISCO and 55-163X
the correct overlay of our 15065 camera model of Central faster than bundler. Similarly by t/q, our method is 5-11X
Rome on an aerial image. The Loop reconstruction shows faster than DISCO and 56-186X faster than bundler. It is
that the RT steps correctly deal with the drifting problem worth pointing out that our system uses only a single PC
without explicit loop closing. The reconstruction is first (12GB RAM) while DISCO uses a 200 core cluster.
pushed away from local minima by RT and then improved
by the subsequent BAs. The robustness and accuracy of
6.4. Discussions
our method is due to the strategy of mixed BA and RT.
We evaluate the accuracy of the reconstructed cameras Although our experiments show approximately linear
by comparing their positions to the ground truth locations. running times, the reconstruction takes O(n2 ) in theory
For the Arts Quad dataset, our reconstruction (r = 5% and and will exhibit such a trend the problem gets even larger.
r = 25%) contains 261 out of the 348 images whose ground In addition, it is possible that the proposed strategy will fail
truth GPS location is provided by [4]. We used RANSAC for extremely larger reconstruction problems due to larger
to estimate a 3D similarity transformation between the 261 accumulated errors. Thanks to the stability of SfM, mixed
camera locations and their Euclidean coordinates. With the BAs and RT, our algorithm works without quality problems
best found transformation, our 3D model has a mean error even for 15K cameras.
of 2.5 meter and a median error of 0.89 meter, which is This paper focuses on the incremental SfM stage, while
smaller than the 1.16 meter error reported by [4]. the bottleneck of 3D reconstruction sometimes is the image
We evaluate the reconstruction speed by two measures: matching. In order to test our incremental SfM algorithm
• t/n The time needed to reconstruct a camera. on the largest possible models, we have matched up to
• t/q The time needed to reconstruct an observation. O(n2 ) image pairs for several datasets, which is higher
Table 3 shows the comparison between our reconstruction than the vocabulary strategy that can choose O(n) pairs
under various settings and the DISCO and bundler recon- to match [2]. However, this does not limit our method
struction of [4]. While producing comparably large models, from working with fewer image matches. From a different
our algorithm normally requires less than half a second angle, we have contributed the preemptive matching that
to reconstruct a camera in terms of overall speed. By can significantly reduce the matching cost.
133
Dataset Full BA Partial BA RT n q t t/n t/q
r = 5% Every Image r = 25% 15065 12903K 1.67hour 0.40s 0.47ms
r = 25% Every Image r = 50% 15113 12958K 1.32hour 0.31s 0.37ms
Central Rome r = 5% Every 3 Images r = 25% 14998 12599K 1.03hour 0.25s 0.29ms
DISCO result of [4] 14754 21544K 13.2hour 3.2s 2.2ms
Bundler result of [4] 13455 5411K 82hour 22s 54ms
r = 5% Every Image r = 25% 5624 5839K 0.59hour 0.38s 0.37ms
r = 25% Every Image r = 50% 5598 5850K 0.42hour 0.27s 0.26ms
Arts Quad r = 5% Every 3 Images r = 25% 5461 5530K 0.53hour 0.35s 0.35ms
DISCO result of [4] 5233 9387K 7.7hour 5.2s 2.9ms
Bundler result of [4] 5028 10521K 62hour 44s 21ms
r = 5% Every Image r = 25% 4342 7196K 3251s 0.75s 0.45ms
Loop r = 25% Every Image r = 50% 4342 7574K 1985s 0.46s 0.26ms
r = 5% Every 3 Images r = 25% 4341 7696K 3207s 0.74s 0.41ms
r = 5% Every Image r = 25% 1267 2706K 583s 0.46s 0.22ms
St. Peter’s r = 25% Every Image r = 50% 1267 2760K 453s 0.36s 0.16ms
r = 5% Every 3 Images r = 25% 1262 2668K 367s 0.29s 0.14ms
r = 5% Every Image r = 25% 1157 1759K 591s 0.51s 0.34ms
Colosseum r = 25% Every Image r = 50% 1087 1709K 205s 0.19s 0.12ms
r = 5% Every 3 Images r = 25% 1091 1675K 471s 0.43s 0.28ms
Table 3. The statistics of our reconstructions under various settings. We reconstruct larger models than DISCO and bundler under the
various settings, and our method also runs significantly faster. Additional results will be presented in the supplemental material.
7. Conclusions and Future Work [5] J. Frahm, P. Fite Georgel, D. Gallup, T. Johnson, R. Ragu-
ram, C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, and
This paper revisits and improves the classic incremental M. Pollefeys. Building rome on a cloudless day. In ECCV,
SfM algorithm. We propose a preemptive matching method pages IV: 368–381, 2010. 1, 2
that can significantly reduce the feature matching cost for [6] R. Gherardi, M. Farenzena, and A. Fusiello. Improving the
large scale SfM. Through algorithmic analysis and exten- efficiency of hierarchical structure-and-motion. In CVPR,
sive experimental validation, we show that incremental pages 1594–1600, 2010. 2
[7] A. Kushal, B. Self, Y. Furukawa, C. Hernandez, D. Gallup,
SfM is of O(n2 ) time complexity, but requires only O(n)
B. Curless, and S. Seitz. Photo tours. In 3DimPVT, 2012. 1
time on its major steps while still maintaining the recon- [8] X. Li, C. Wu, C. Zach, S. Lazebnik, and J. Frahm. Modeling
struction accuracy using mixed BAs and RT. The practical and recognition of landmark image collections using iconic
run time is approximately O(n) for large problems up to scene graphs. In ECCV, 2008. 1, 2
15K cameras. Our system demonstrates state of the art [9] M. A. Lourakis and A. Argyros. SBA: A Software Package
performance by an average speed of reconstructing about 2 for Generic Sparse Bundle Adjustment. ACM Trans. Math.
cameras and more than 2000 features points in a second for Software, 36(1):1–30, 2009. 3
very large photo collections. [10] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60:91–110, 2004. 2
In the future, we wish to explore a more adaptive pre- [11] D. Nister and H. Stewenius. Scalable recognition with a
emptive feature matching and guide the full BAs according vocabulary tree. In CVPR, pages 2161–2168, 2006. 1
to the accumulation of reprojection errors. [12] J. R. Shewchuk. An introduction to the conjugate gradient
method without the agonizing pain, 1994. 3
References [13] S. N. Sinha, D. Steedly, and R. Szeliski. A multi-stage
linear approach to structure from motion. In ECCV RMLE
[1] S. Agarwal, N. Snavely, S. Seitz, and R. Szeliski. Bundle workshop, 2010. 2
adjustment in the large. In ECCV, pages II: 29–42, 2010. 2, 3 [14] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism:
[2] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and exploring photo collections in 3D. In SIGGRAPH, pages
R. Szeliski. Building Rome in a day. In ICCV, 2009. 1, 2, 7 835–846, 2006. 1, 3
[3] M. Byrod and K. Astrom. Conjugate gradient bundle [15] N. Snavely, S. M. Seitz, and R. Szeliski. Skeletal graphs for
adjustment. In ECCV, pages II: 114–127, 2010. 2, 3 efficient structure from motion. In CVPR, 2008. 1, 2
[4] D. Crandall, A. Owens, N. Snavely, and D. P. Huttenlocher. [16] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicore
Discrete-continuous optimization for large-scale structure bundle adjustment. In CVPR, 2011. 2, 3
from motion. In CVPR, 2011. 1, 2, 5, 6, 7, 8
134