Multi-View Stereo A Tutorial.
Multi-View Stereo A Tutorial.
Tutorial
Multi-View Stereo: A
Tutorial
Yasutaka Furukawa
Washington University in St. Louis
[email protected]
Carlos Hernández
Google Inc.
[email protected]
Boston — Delft
Foundations and Trends
R
in
Computer Graphics and Vision
Published, sold and distributed by:
now Publishers Inc.
PO Box 1024
Hanover, MA 02339
United States
Tel. +1-781-985-4510
www.nowpublishers.com
[email protected]
Outside North America:
now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31-6-51115274
The preferred citation for this publication is
Y. Furukawa and C. Hernández . Multi-View Stereo: A Tutorial. Foundations and
Trends
R
in Computer Graphics and Vision, vol. 9, no. 1-2, pp. 1–148, 2013.
This Foundations and TrendsR
issue was typeset in LATEX using a class file designed
by Neal Parikh. Printed on acid-free paper.
ISBN: 978-1-60198-836-2
c 2015 Y. Furukawa and C. Hernández
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen-
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of specific clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy-
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho-
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +1 781 871 0245; www.nowpublishers.com;
[email protected]
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to
now Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com;
e-mail: [email protected]
Foundations and Trends R
in
Computer Graphics and Vision
Volume 9, Issue 1-2, 2013
Editorial Board
Editors-in-Chief
William T. Freeman
Massachusetts Institute of Technology
United States
Editors
Topics
Yasutaka Furukawa
Washington University in St. Louis
[email protected]
Carlos Hernández
Google Inc.
[email protected]
Contents
1 Introduction 3
1.1 Imagery collection . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Camera projection models . . . . . . . . . . . . . . . . . . 8
1.3 Structure from Motion . . . . . . . . . . . . . . . . . . . 10
1.4 Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . 13
1.5 Multi-View Stereo . . . . . . . . . . . . . . . . . . . . . . 14
2 Multi-view Photo-consistency 17
2.1 Photo-consistency measures . . . . . . . . . . . . . . . . . 18
2.2 Visibility estimation in state-of-the-art algorithms . . . . . 32
ix
x
Acknowledgements 135
References 137
Abstract
3
4 Introduction
the most likely 3D shape that explains those photographs, under the
assumptions of known materials, viewpoints, and lighting conditions”
(See Figure 1.1). The definition highlights the difficulty of the task,
namely the assumption that materials, viewpoints, and lighting are
known. If these are not known, the problem is generally ill-posed since
multiple combinations of geometry, materials, viewpoints, and lighting
can produce exactly the same photographs. As a result, without fur-
ther assumptions, no single algorithm can correctly reconstruct the 3D
geometry from photographs alone. However, under a set of reasonable
extra assumptions, e.g. rigid Lambertian textured surfaces, state-of-
the-art techniques can produce highly detailed reconstructions even
from millions of photographs.
There exist many cues that can be used to extract geometry from
photographs: texture, defocus, shading, contours, and stereo correspon-
dence. The latter three have been very successful, with stereo corre-
spondence being the most successful in terms of robustness and the
number of applications. Multi-view stereo (MVS) is the general term
given to a group of techniques that use stereo correspondence as their
main cue and use more than two images [165, 176].
All the MVS algorithms described in the following chapters assume
the same input: a set of images and their corresponding camera param-
eters. This chapter gives an overview of an MVS pipeline starting from
5
• Collect images,
Figure 1.3: Different MVS capture setups. From left to right: a controlled MVS
capture using diffuse lights and a turn table, outdoor capture of small-scale scenes,
and crowd-sourcing from online photo-sharing websites.
In the chapter we will give more insight into the first three main
stages of MVS: imagery collection, camera parameters estimation, and
3D geometry reconstruction. Chapter 2 develops the notion of photo-
consistency as the main signal being optimized by MVS algorithms.
Chapter 3 presents and compares some of the most successful MVS al-
gorithms. Chapter 4 discusses the use of domain knowledge, in particu-
lar, structural priors in improving the reconstruction quality. Chapter 5
gives an overview of successful applications, available software, and best
practices. Finally Chapter 6 describes some of the current limitations
of MVS as well as research directions to solve them.
One can roughly classify MVS capture setups into three categories (See
Figure 1.3):
• Laboratory setting,
could be easily calibrated, e.g. from a robotic arm [165], rotation ta-
ble [93], fiducial markers [2, 43, 192], or early SfM algorithms [62]. MVS
algorithms went through two major developments that took them to
their current state: They left the laboratory setting to a small-scale
outdoor scenes [174, 102, 85, 169, 190], e.g. a building facade or a foun-
tain, then scaled up to much larger scenes, e.g. entire buildings and
cities [129, 153, 97, 69].
These major changes were not solely due to the developments in the
MVS field itself. It was a combination of new hardware to capture bet-
ter images, more computation power, and scalable camera estimation
algorithms.
Improvements in hardware: Two areas of hardware improvements
had the most impact on MVS: digital cameras and computation power.
Digital photography became mainstream and image digital sensors con-
stantly improved in terms of resolution and quality. Additionally, mass
production and miniaturization of geo positioning sensors (GPS) made
them ubiquitous in digital cameras, tablets, and mobile phones. Al-
though the precision of commercial units is not enough for MVS pur-
poses, it does provide an initial estimate on camera parameters that
can be refined using Computer Vision techniques. The second signifi-
cant hardware improvement was computation power. The rise of inex-
pensive computer clusters [5] or GPU general computation [6] enabled
SfM algorithms [25, 64] and MVS algorithms [69] to easily handle tens
of thousands of images.
Improvements in Structure-from-Motion algorithms: Re-
searchers have been working on visual reconstruction algorithms for
decades [183, 182]. However, only relatively recently have these tech-
niques matured enough to be used in large-scale industrial applications.
Nowadays industrial algorithms are able to estimate camera parameters
for millions of images. Two slightly different techniques have made great
progress in recent years: Structure-from-Motion (SfM) [88] and Visual
Simultaneous Localization and Mapping (VSLAM) [53]. Both rely on
the correspondence cue and the assumption that the scene is rigid. SfM
is most commonly used to compute camera models of unordered sets
of images, usually offline, while VSLAM specializes in computing the
8 Introduction
Figure 1.4: Common deviations from pinhole camera model. Left: a fish eye lens
exhibiting large radial distortion (top) and a rectified version of the same image
after removing radial distortion (bottom). Right: rolling shutter artifacts caused by
a fast moving object in the scene [155].
the rotation of the camera and T is the translation of the camera. Note
that, due to the quality of digital sensors, one rarely estimates the 11
parameters of the projection matrix. In particular, pixels are assumed
to have no skew (s = 0), and be square (fx = fy ). Also, if an image
has not been cropped, it is safe to assume the principal point is at the
center of the image. As a result, a common pinhole camera model is
just composed of 7 parameters: the focal length f , the rotation matrix
R and the translation vector T .
If the attached lens is low quality, or wide-angle (See Figure 1.4 left),
the pure pinhole model is not enough and often extended with a radial
distortion model. Radial distortion is particularly important for high-
resolution photographs, where small deviations from the pure pinhole
model can amount to multiple pixels near the image boundaries.
Radial distortion can typically be removed from the photographs
before they enter the MVS pipeline. If the radial distortion parameters
of an image have been estimated, one can undistort the image by resam-
pling as if it had been taken with an ideal lens without distortion (See
10 Introduction
Figure 1.4 bottom left). Undistorting the images simplifies the MVS
algorithm and often leads to faster processing times. Some cameras,
e.g. those in mobile phones, incorporate dedicated hardware to remove
radial distortion during the processing of the image just after its cap-
ture. Note however that rectifying wide-angle images will introduce
resampling artifacts as well as field of view cropping. To avoid these is-
sues MVS pipelines can support radial distortion and more complicated
camera models directly, at the expense of extra complexity.
Finally, rolling shutter is another source of complexity particularly
important for video processing applications (See Figure 1.4 right). A
digital sensor with an electronic rolling shutter exposes each row of an
image at slightly different times. This is in contrast to global shutters
where the whole image is exposed at the same time. A rolling shut-
ter often provides higher sensor throughput at the expense of a more
complicated camera model. As a result, if the camera or the scene are
moving while capturing the image, each row of the image captures ef-
fectively a slightly different scene. If the camera or scene motion is slow
w.r.t. the shutter speed, rolling shutter effects can be small enough to
be ignored. Otherwise the camera projection model needs to incorpo-
rate the effects [63].
Figure 1.5: Main stages of a generic SfM pipeline, clockwise: feature detection,
feature matching, track generation, structure-from-motion and bundle adjustment.
Figure 1.6: Large scale SfM examples from [25]. Left: SfM model of the city of
Dubrovnik. Right: SfM model of San Marco Square in Venice.
Figure 1.7: Matching images with known camera parameters. Left: The 3D geom-
etry of the scene defines a correspondence between pixels in different images. Right:
when camera parameters are known, matching a pixel in one image with pixels in
another image is a 1D search problem.
image. That is, for each pixel one has to do a 2D search in the other
image. However, when the camera parameters are known (and the scene
is rigid), the image matching problem is simplified from a 2D search
to a 1D search (See Figure 1.7 right). A pixel in an image generates a
3D optic ray that passes through the pixel and the camera center of
the image. The corresponding pixel on another image can only lie on
the projection of that optic ray into the second image. The different
geometric constraints that originate when multiple cameras look at
the same 3D scene from different viewpoints are known as epipolar
geometry [88].
As for measures to tell how likely a candidate match is, there is a
vast literature on how to build so called photo-consistency measures
that estimate the likelihood of two pixels (or groups of pixels) being
in correspondence. Photo-consistency measures in the context of MVS
are presented in more detail in Chapter 2.
2
Multi-view Photo-consistency
17
18 Multi-view Photo-consistency
Given a set of N input images and a 3D point p seen by all the images,
one can define the photo-consistency of p w.r.t. each pair of images Ii
and Ij as:
Cij (p) = ρ(Ii (Ω(πi (p))), Ij (Ω(πj (p)))), (2.1)
where ρ(f, g) is a similarity measure that compares two vectors, πi (p)
denotes the projection of p into image i, Ω(x) defines a support domain
around point x, and Ii (x) denotes the image intensities sampled within
the domain. Every photo-consistency measure can be described as a
particular choice of ρ and Ω.
Some photo-consistency measures do not need the support domain
Ω to be defined while others do (see Table 2.1). The main purpose
of the support domain Ω is to define the size of a region where the
appearance of the scene is expected to be unique and somewhat invari-
ant to illumination and viewpoint changes. Note that uniqueness and
invariance are often two competing properties of a photo-consistency
measure. The larger the domain Ω is, the more unique the local ap-
pearance inside the domain is, which makes it easier to match to other
images. At the same time, the larger the domain is, the harder it is
2.1. Photo-consistency measures 19
Table 2.1: Summary table of different similarity measures used to compute photo-
consistency.
and define the photo-consistency w.r.t. all the images by averaging the
individual photo-consistencies
1 1
C(p) = Ci (p) = Cij (p). (2.3)
N i N (N − 1) i j=i
Given color images, different strategies exist to deal with the dif-
ferent channels:
• Convert the color image into gray scale before computing the
photo-consistency.
• Concatenate the vectors from all the color channels into a single
larger vector.
250 250
200 200
150 150
SSD SAD
100 100
50 50
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth inverse depth
8
NCC Census 4
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth
8 0
6
5
Rank 4
MI
3
2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth inverse depth
Figure 2.3: Different photo-consistency measures computed along the epipolar line
for the textured example of Figure 2.2 top. The correct depth is roughly at 0.5 depth
units. All the measures use a domain Ω 3x3 except for MI, which uses a domain of
9x9.
top. Figure 2.4 shows the photo-consistency plots for the untextured
region of Figure 2.2 bottom.
250 250
200 200
150 150
SSD SAD
100 100
50 50
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth inverse depth
8
NCC Census 4
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth
8 0
6
5
Rank 4
MI
3
2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
inverse depth inverse depth
Figure 2.4: Different photo-consistency measures computed for the textureless ex-
ample of Figure 2.2 bottom. All the measures use a domain Ω 3x3 except for MI,
which uses a domain of 9x9.
are often subtle shadings and shadowing effects on its surface. We want
NCC to capture subtle spatial intensity variation in each color chan-
nel, which are much smaller than the intensity variation across color
channels. With the simple concatenation, NCC amounts to only cap-
turing the intensity variation across color channels. A better solution
is to compute NCC per color channel independently and return the
average NCC score. A more sophisticated approach is to compute and
subtract the average intensity per color channel independently (f¯ and
ḡ), but concatenate all the color channels together as a single vector
when computing its variance (σf and σg ). This allows NCC to capture
spatial intensity variations in each color channel, while down weighting
color channels with smaller intensity variations.
Note that SSD does not require a support domain Ω to be defined, but
it easily generalizes to use one.
The use of the L2 norm makes SSD sensitive to outliers, e.g. visi-
bility outliers or bias and gain perturbations. A normalized variant of
SSD exists that helps mitigate some of these issues:
f − f¯ g − ḡ 2
ρN SSD (f, g) = || − || ,
σf σg
f − f¯ g − ḡ 2
ρN SSD (f, g) = || − || (2.4)
σf σg
f − f¯ 2 g − ḡ 2 f − f¯ g − ḡ
= || || + || || − 2 · (2.5)
σf σg σf σg
(f − f¯) · (g − ḡ)
= 2(1 − ) (2.6)
σf σg
= 2(1 − N CC(f, g)) (2.7)
The sum of absolute differences (SAD) is very similar to SSD, but uses
an L1 norm instead of an L2 norm, which makes it more robust to
outliers
ρSAD (f, g) = ||f − g||1 .
Similarly to SSD it is sensitive to bias and gain, so it is rarely used in
algorithms that match images with a wide variability in illumination.
It is however a very good measure for applications that can guaran-
tee similar capture conditions for the different images, e.g., real-time
applications or mobile applications.
2.1.4 Census
2.1.5 Rank
Rank was proposed at the same time as Census [206], and shares some
of its characteristics. Like Census, it is invariant to changes in bias and
gain and requires an explicit support domain to be computed. Unlike
Census, it is also invariant to rotation. Instead of summarizing the
values in the domain as a bit string, Rank summarizes the domain
by computing the percentile of the reference pixel w.r.t. all the other
values in the domain
rank(f ) = ξ(f (p), f (q)).
q∈Ω
where P (x, y) is the joint probability of X and Y , and P (x) and P (y)
are the marginals. In the context of image similarity, the mutual infor-
mation between two images (or image regions) measures how similar
the two images are, i.e., how well one image predicts the other image.
The photo-consistency measure is defined as
ρM I (f, g) = −M I(f, g).
The joint probability is estimated using a Parzen window method [151]
1
P (x, y) = K(f (q) − x, g(q) − y),
|Ω| q∈Ω
ρSAD (f min , f max , g min , g max ) = max(0, g min − f max , f min − g max ).
Photo-consistency normalization
The photo-consistency values computed by the measures described
above are rarely used "as-is" in the later stages of MVS. Instead they
are usually transformed through a non-linear operation with two goals:
i) normalize different photo-consistency values to the same range, ii)
transform the original photo-consistency into something closer to "like-
lihood of geometry". Normalization is important for parameter tuning as
well as combining photo-consistency measures together, or with other
cues. Typical transforms include the “exponential”, “linear truncated”
and “smooth step” functions, see Figure 2.5. The normalization func-
tion also serves as a “black box” that transforms the particular photo-
consistency score into a geometry likelihood measure, i.e., it decides
the range of scores where there is likely to be 3D geometry.
2.1. Photo-consistency measures 29
Figure 2.6: Cost volume filtering, courtesy of [105]. Anisotropic filters such as
the bilateral filter [181] or guided filter [90] provide significantly better results than
simple approaches around color boundaries. (a) A close-up view of the green line
in the input image. (b) Slice of the cost volume (white/black/red: high/low/lowest
costs) for line in (a). (c-e) Cost slice smoothed along x and y-axes (y is not shown
here) with box filter, bilateral filter [181] and guided filter [90], respectively. (f)
Ground truth labeling.
computing geometry. In the extreme case, one can see the geometry
and its attached confidence as a sparse definition of photo-consistency.
Processing time, image variability, and surface coverage are three im-
portant variables in the design of the photo-consistency measure, which
are dependent on the application from real-time MVS to cloud process-
ing of controlled sequences from airplanes. Some measures such as SAD
or SSD are extremely fast to compute and easily adapted to specialized
hardware such as GPUs [202, 101, 76, 142, 145]. Others are particularly
effective in the presence of bias and gain changes across the images, e.g.,
NCC and Census. Finally, the more images see the same piece of a sur-
face, the more strict the normalization of the photo-consistency can be
(See Section 2.1.7). Two photo-consistency measures that are particu-
larly popular are NCC and SAD. NCC is mainly used to match images
with varying lighting conditions and usually good coverage. SAD on the
other hand is used when the images are not expected to have bias or
gain changes, and coverage is low, e.g., two view stereo. Note that, even
when the illumination is constant and the camera response is known,
images can still show local gain changes due to the presence of non-
Lambertian materials. As a result, some techniques [141] combine SAD
with another measure such as NCC or Census.
The work of space carving [166, 128] was seminal in the proposal of
a theory of visibility consistency. Given a 3D volume partitioned into
a 3D grid of voxels, the volume is iteratively carved out by removing
voxels that are not photo-consistent. The main contribution of the work
was the proposal of a geometric constraint on the camera centers such
that there exists an ordinal visibility constraint on all the 3D voxels
in the scene. That is, starting from an initial bounding volume, one
can visit all the voxels in an order that guarantees that any potential
occluder voxel is always visited before its potential occluded voxel.
This property effectively solves the visibility issue as it provides a
voxel visiting order that guarantees that we never use an image where
the voxel being tested is occluded. The ordinal visibility constraint is
satisfied whenever no 3D point of the scene is contained in the con-
vex hull of the camera centers. This is a strong constraint, but there
are many useful capture configurations that satisfy this, for example,
when the camera centers lie on a 3D plane while the camera is looking
downwards, as commonly used to capture 3D geometry at large scale
by flying high-altitude airplanes or drones.
34 Multi-view Photo-consistency
Figure 2.8: Voxel coloring [166] results of Dinosaur toy and Rose. The objects were
rotated 360◦ below a camera. Left shows one of the 21 input images. Right shows
two different views of the reconstruction.
Merged
SFM points SFM points
View point
View points clusters
Figure 2.9: Large scale view clustering of internet images of the Colosseum [69].
Top: The first step of the algorithm is to merge SFM points to enrich visibility
information (SFM filter). Bottom: Sample results of the view clustering algorithm.
View points belonging to extracted clusters are illustrated in different colors.
algorithm (See Figure 2.7). One popular approach is to use the current
reconstructed geometry to compute occlusions (e.g., a z-buffer testing),
select which views see which parts of the geometry, and iterate visibility
estimation and reconstruction. This iterative approach assumes there
is an initial geometry that is being refined [58, 175, 154, 78, 56, 55].
Therefore, the main disadvantage of these methods is that they depend
on a good initialization. If the initialization is not accurate, they can
easily get stuck in a local minimum. For this reason, iterative methods
are often used as a refinement step, once an initial coarser solution is
available, e.g. from a volumetric fusion approach [193].
Another popular solution is to rely on robust photo-consistency
statistics without explicitly estimating occlusion. An intuition is that
“bad” photographs would yield poor photo-consistency measures,
whose effects can be suppressed by the use of robust statistics if there
are enough “good” images present. The exact choice and usage of ro-
bust statistics is algorithm dependent, see Chapter 3 for more details.
3
Algorithms: From Photo-Consistency to 3D
Reconstruction
39
40 Algorithms: From Photo-Consistency to 3D Reconstruction
Figure 3.1: MVS algorithms can be classified based on the output scene represen-
tation. The four popular representations are a depthmap(s), a point cloud, a volume
scalar-field, and a mesh. Note that a point cloud is very dense and may look like a
textured mesh model, but is simply a collection of 3D points. Reconstruction exam-
ples are from state-of-the-art MVS algorithms presented in [48], [74], [94], and [93]
respectively, from top to bottom.
1
There exist fully automated silhouette extraction algorithms for scenes where
there is a clear separation between an object and a background [47, 49, 50]. However,
they are outside the scope of this article and details are referred to those papers.
43
'"
geometric proxy for the sky per view for better rendering. This is not
easy for a mesh, a point cloud, or a voxel, whose definitions are usually
independent of the views.
Free-viewpoint rendering, on the other hand, allows one to move
freely in the space, which serves better for navigation and exploration
purposes. Google Earth is a good example of free-viewpoint render-
ing. However, the rendering is usually view-independent and lacks re-
alism. For this task, a mesh or a point cloud is naturally more suitable.
Texture-mapped MVS meshes have been successfully used in real prod-
ucts for outdoor city visualization [30, 108, 144]. Point based rendering
techniques have been extensively studied in the Computer Graphics lit-
erature [83]. High quality visualization results have been demonstrated
for high fidelity MVS point clouds or depthmaps, which can be treated
as a point cloud [69, 117, 104]. However, relatively small amount of
work exist for the point-based rendering of MVS point clouds. MVS
point clouds often suffer from severe noise and large reconstruction
holes, and the rendering quality may degrade significantly.
The last application is the geometry manipulation. This is getting
more and more important, as MVS techniques evolve and make it pos-
sible to reconstruct complex and large scenes. The handling of multiple
MVS reconstructions are necessary to complete a model of a scene. A
mesh representation faces challenges for this task, as it is often difficult
to control the topology of the mesh through the merging and splitting
operations, then enforce the manifoldness.
In Figure 3.2, polished voxels (volumetric scalar-field) and mesh are
at the bottom of the diagram. However, this may not be the goal of
every MVS system. For example, if view-dependent texture mapping is
the application, one should simply pick a depthmap reconstruction al-
gorithm. If free-viewpoint rendering is the application, one can conduct
point-cloud reconstruction from images, then can use a point-based ren-
dering technique for an application without running any of the other
steps in the diagram. Of course, a high-quality polygonal mesh model is
often a preferred scene representation, and all the processing converges
to the mesh reconstruction in the diagram.
3.1. Depthmap Reconstruction 45
Evaluations
Figure 3.3: Winner-takes-all strategy for depthmap reconstruction. The figure il-
lustrates a process to estimate a depth value for a pixel highlighted by a black
rectangle in the left image. The global maximum of the photo-consistency function
such as the NCC score is chosen to be the reconstructed depth for the pixel.
correlation score
photo-consistency curve
local maximum
robustified curve
simple average
depth along optic ray
Figure 3.6: Reconstruction results by Goesele, Curless and Seitz [80]. The first
row shows a reference image and two depthmaps with different thresholds for the
robust photo-consistency evaluation. The second row shows that even though a
single depthmap may contain many holes, multiple depthmaps can be merged into
a complete 3D model. (Figure courtesy of Goesele et al.)
Figure 3.7: The effects of the number of input images after depthmap merging for
the two datasets. The algorithm by Goesele, Curless, and Seitz is used [80]. (Figure
courtesy of Goesele et al.)
The first summation is over all the pixels in the image, while the
second summation is over all the pairs of neighboring pixels denoted
3.1. Depthmap Reconstruction 53
Unary Potentials
Optimization
wise cost at every pair of neighboring pixels satisfies the following sub-
modularity condition [122]:
different pixels have different label sets. They also allow the “unknown”
label to indicate that no correct depth can be estimated in certain
cases. In this situation, they acknowledge that the depth at this pixel
is unknown and should therefore offer no contribution for the surface
location. This process means that the returned depth map should only
contain accurate depths, estimated with a high degree of certainty.
The process consists of two phases: 1) extraction of depth labels;
and 2) MRF optimization to assign extracted depth labels. We now
describe the details of the algorithm.
The first phase is to obtain a hypothesis set of possible depths for each
pixel p in a reference image Iref . After computing a photo-consistency
curve within a depth range between Iref and each of the neighboring
images, they store the top K peaks {di (p)|i ∈ [1, K]} with the greatest
scores from all the curves. NCC is the photo-consistency function. As
described before, another key feature of the algorithm is the inclusion
of an unknown state U, which is to be selected when there is insufficient
evidence. Therefore, for each pixel, they form an augmented depth label
set {{di (p)}, U}.
MRF Optimization
neighboring pixels:
exp [−β · C(p, x)] If x ∈ {di (p)}
Φ(kp = x) = (3.7)
ΦU If x = U.
less scale dependent. In the second and the third cases where one of the
labels is the unknown state, the constant penalty is imposed to prevent
frequent switches between depth labels and the unknown state. In the
last case, both labels are the same unknown state and the penalty is
set to 0 to favor the spatial consistency.
The pairwise cost is unfortunately not submodular in this formula-
tion, because depth labels are extracted for each pixel independently,
and the meanings of the ith label are arbitrarily different for different
pixels. For example, Ψ(di (p), di (q)) is 0 in the standard MRF formu-
lation (3.6), because di (p) and di (q) are pixel-independent and corre-
spond to the same depth value. However, that is not the case in this
formulation. Therefore, alpha-expansion is not applicable, but message
passing algorithms such as loopy belief propagation (LBP) [204] and
tree-reweighted message passing (TRW) [194], which are other popu-
lar optimization techniques for MRF, can be used. In particular, TRW
has been successfully applied to solve many Computer Vision problems
including depthmap reconstruction [123, 179], and is used in their work.
Figure 3.8 illustrates the photo-consistency curves and the loca-
tions of their local maxima at ten contiguous pixels across an occlusion
boundary. Notice that the unknown label is assigned to a pixel at the
occlusion boundary (sixth pixel from the top), where the spatial reg-
ularization is enforced to assign a correct depth label even where the
global maximum of the curve corresponds to a false depth (fourth pixel
from the top). Figure 3.9 lists more experimental results together with
some intermediate reconstructions for evaluation. As the figure illus-
trates, a single depthmap contains holes both at the pixels where the
unknown state label is assigned, and at the surface regions that are not
visible in the reference image. However, it is important to only recon-
struct regions with high confidence to minimize the presence of noise
in the following merging step. Figure 3.9g illustrates that the model
becomes near complete in the superimposition of only two depthmaps.
58 Algorithms: From Photo-Consistency to 3D Reconstruction
Figure 3.10: Plane sweeping stereo algorithm by Gallup, Frahm, Mordohai, Yang,
and Pollefeys [76]. Top: photo-consistency evaluation becomes the most accurate
and exact when the sweeping plane direction aligns with the surface orientation.
Bottom: multiple sweeping directions are extracted from an SfM point cloud, and
used for the stereo algorithm. (Figure courtesy of Gallup et al.)
Figure 3.13: Left: a patch p is a (3D) rectangle with its center and normal denoted
as c(p) and n(p), respectively.
analyzes consistency of patches across all the views and removes falsely
reconstructed ones. We will first explain several fundamental building
blocks for the algorithm, then provide details of the three processing
steps, namely, initial feature matching, expansion and filtering. Note
that a simplified version of the algorithm is described in this article to
be concise, and the full details are referred to their journal paper [74].
Patch Model
0.0
-0.05.
-0.10
-0.15
-0.20
-0.25
1.0 0.6 0.2 -0.2 -0.6 -1.0
Figure 3.14: A robust function is applied to the raw NCC score. The curve shows
that an NCC score below 0.4 is automatically ignored, where it has more effects
(slope in the curve) when its value is closer to 1.0.
bust against outlier signals (See Section 3.1.2 for a similar robust photo-
consistency technique). Let C denote the NCC score, then the robust
photo-consistency is defined as −C /(3C +1), where C = min(τ, 1−C).
τ is a truncation threshold and set around 0.4. The shape of the robust
function is illustrated in Figure 3.14.
Having defined the photometric consistency measure for a patch
as a function of its position and the normal, reconstructing a patch
is simply achieved by maximizing the photo-consistency function with
respect to those parameters. At first sight, the function has five param-
eters to be optimized, because the position consists of three parameters
and the normal consists of two parameters. However, a patch should
not move tangentially on a surface during optimization, where only its
vertical offset of the positional parameters should be optimized. The
vertical direction depends on the patch normal, which is also optimized.
Therefore, in practice, the perpendicular direction is fixed before and
throughout the optimization, where one parameter for position and
two parameters for normal are optimized via a standard non-linear
least squares technique.
3.2. Point-cloud Reconstruction 67
Figure 3.15: Image projections of reconstructed patches in their visible images are
used to perform fundamental tasks such as accessing neighboring patches, enforcing
regularization, etc. See text for more details.
3.2.2 Algorithm
Their patch-based MVS algorithm attempts to reconstruct at least one
patch in every image cell Ci (x, y). It is divided into three steps: (1)
initial feature matching, (2) patch expansion, and (3) patch filtering.
The purpose of the initial feature matching step is to generate a sparse
set of patches (possibly containing some false positives). The expan-
sion and the filtering steps are iterated n times (typically n = 3) to
make patches dense and remove erroneous matches. The three steps
are detailed in the following sections.
all algorithm description for this step is given in Fig. 3.16. Of course,
this relatively simple procedure may not be perfect and yield mistakes,
but the filtering step will handle such errors.
Expansion
The goal of the expansion step is to reconstruct at least one patch in
every image cell Ci (x, y), where they repeat taking existing patches and
generating new ones in nearby empty spaces. More concretely, given a
patch p, they first identify a set of neighboring image cells Cells(p)
that do not contain any patches yet:
For each collected image cell Ci (x, y) in Cells(p), the following ex-
pansion procedure is performed to generate a new patch p . They first
initialize n(p ) and V (p ) by the corresponding values of p. c(p ) is, in
turn, initialized as the point where the viewing ray, passing through
the center of Ci (x, y), intersects the plane containing the patch p.
They then refine c(p ) and n(p ) by the optimization procedure de-
scribed in Sect.3.2.1. They remove images from V (p ), whose average
pairwise photo consistency score with the remaining images in V (p ) is
less than a threshold. They also add images to V (p ) if their average
pairwise photo consistency scores are above the threshold. Finally, if
|V (p )| ≥ γv , they accept the patch as a success and update Qi (x, y)
for its visible images. The process repeats until the expansion process
is performed from every patch that has been reconstructed. The overall
algorithm description is given in Fig. 3.17.
Filtering
The expansion step is greedy and solely relies on photo consistency
measures for reconstructing patches, where it is difficult to avoid gen-
erating any erroneous patches. In the last step of the algorithm, the
following two filters are used to remove erroneous patches. The first
filter relies on visibility consistency. Let us define that patches p and p
3.2. Point-cloud Reconstruction 71
Figure 3.17: Patch expansion algorithm. The expansion and the filtering procedure
is iterated n(= 3) times to make patches dense and remove outliers.
are neighbors if their distance along the normals is less than a threshold:
Figure 3.18: The first filter enforces global visibility consistency to remove outliers
(red patches). An arrow pointing from pi to Ij represents a relationship Ij ∈ V (pi ).
In both cases (left and right), U (p) denotes a set of patches that is inconsistent in
visibility information with p.
of patches that are neighbors of p (Eq. 3.13) in this set is lower than
0.25, p is removed as an outlier.
Figure 3.19 shows a sample input image, the image resolution, and the
number of input images for each dataset. Patch based representation
is flexible and can handle both “object” like datasets (top row of Fig-
ure 3.19), where cameras surround an object, and“scene” like datasets,
where cameras are surrounded by a scene. Their reconstructed patches
are shown in Figure 3.20. Note that patches are dense and look like a
surface model, but are merely point clouds. The figure illustrates that
reconstructed patches are free from noise and shows the robustness
of the algorithm despite that patches are reconstructed independently
without explicit regularization. The bottom half of Figure 3.20 shows
the polygonal surface models converted from the patches, which veri-
fies the geometric accuracy of reconstructed patches (See Section 3.3
for surface meshing techniques).
3.3. Volumetric data fusion 73
are robust against noisy point clouds with full of reconstruction holes,
which are more common in Computer Vision applications. However,
more sophisticated optimization and regularization techniques have
also been sought in tackling challenges by Computer Vision researchers.
This section explains such techniques.
Volumetric surface extraction is flexible and the input 3D informa-
tion can come from many different sources such as photo-consistency
volumes, depthmaps, MVS point clouds, laser scanned 3D points, or
any combination of those. It is a challenging task to fuse such diverse
set of 3D measurements into a single clean mesh model with the right
3.3. Volumetric data fusion 75
The first term is the summation of per voxel cost over the entire domain,
where Φ(kv ) encodes the cost of assigning a label to voxel v. The second
term is the summation over all the pairs of adjacent voxels denoted as
N . Notice its resemblance to the Markov Random Field formulation
for a depthmap reconstruction (3.2). Φ is a unary term, which depends
on a single variable, while Ψ is a pairwise interaction term. We first
explain how these cost terms should be set, then introduce optimization
algorithms to solve the problem.
4
In the 90’s, Roy and Cox proposed a reconstruction algorithm with the max-
flow min-cut optimization method, whose problem formulation is very similar [158].
However, they reconstruct a single depthmap (disparity map) as opposed to a full
polygonal mesh model.
78 Algorithms: From Photo-Consistency to 3D Reconstruction
Optimization
The MRF formulation in (Equation 3.15) has only two possible la-
bels (“interior” or “exterior”) and is much simpler than that in (Equa-
tion 3.2) for depthmap reconstruction. Therefore, the problem in the
form of (Equation 3.15) can be solved exactly and efficiently with a
graph-cuts algorithm, as long as each pairwise term Ψ(kv , kw ) is sub-
modular [122]:
Ψ(interior, interior) + Ψ(exterior, exterior) ≤ (3.17)
Ψ(interior, exterior) + Ψ(exterior, interior) (3.18)
Usually, pairwise terms satisfy the above condition for our reconstruc-
tion problems, because the submodularity goes well with the smooth-
ness prior, and the left hand side of the inequality is typically 0.
Reconstruction Results
Figure 3.23 shows the reconstruction results by Vogiatzis, Torr, and
Cipolla [191], who proposed one of the earliest volume fusion techniques
based on graph-cuts in 2005. They used the constant ballooning term
and took the photo-consistency volume as the input.
One limitation of the use of a voxel grid is that the memory allo-
cation quickly becomes very expensive. One effective solution to this
problem is an octree data structure, which is essentially an adaptive
voxel grid. The grid is subdivided based on the input depthmaps so
that the grid is subdivided where the surface likely exists. Figure 3.24
shows input images, and a sample reconstruction result of a volumet-
ric fusion technique with the octree space discretization by Hernández,
Vogiatzis, and Cipolla [94]. The top row shows four of the seventy two
images of a “crouching man” sculpture by the modern sculptor Antony
Gormley. The second row shows the reconstruction when Φ(ku ) is set to
a constant ballooning. In the third row, Φ(ku ) is set to the 3D visibility
information computed from the depthmap data. Due to the shrinkage
bias, the reconstruction with the constant ballooning term fails in re-
constructing deep concavities in various parts of the model. Figure 3.25
illustrates the unary and pairwise potentials on a octree that are col-
lected from multiple depthmaps [94]. The unary potential alone clearly
3.3. Volumetric data fusion 79
Figure 3.23: One of the earliest volume fusion techniques based on the volumetric
graph-cuts by Vogiatzis, Torr and Cipolla [191]. (Figure courtesy of Vogiatzis et al.)
shows the separation between the object interior and exterior spaces,
where the pairwise terms concentrate on surface boundaries. The right
of Figure 3.25 shows the octree structure.
Sinha, Mordohai, and Pollefeys presented a similar approach [169],
where a grid of tetrahedra (with a technique similar to the octree to
save space) is used to discretize the space, and the graph-cuts algorithm
is used to extract a surface model. In their work, the photo-consistency
is used to drive the tetrahedral subdivision (See Figure 3.26). They
refine the model for polishing after extracting the mesh from the tetra-
hedral grid. The figure shows the model after the refinement. Refer to
Section 3.4 for the details of similar mesh refinement techniques.
80
Algorithms: From Photo-Consistency to 3D Reconstruction
The unary term at p0 enforces that the cell containing the camera
82 Algorithms: From Photo-Consistency to 3D Reconstruction
center must be in the scene exterior space, while the term at q2 adds a
bias for the cell to be in the scene interior space. For every pair (p1 , q1 )
of triangles whose common face intersects with the visible ray, where p1
is on the camera side, a pairwise interaction term is defined as follows
to penalize a pairwise configuration (p1 = interior, q1 = exterior):
β if p1 = interior and q1 = exterior
Ψ(p1 , q1 ) = (3.21)
0 in the other 3 cases.
For every ray between a 3D point and its visible camera, the above
unary and pairwise terms are computed and accumulated as op-
posed to overwriting. Due to the pairwise term construction, the
above energy is guaranteed to be submodular. Therefore, the graph-
cuts algorithm can be used to find the optimal labeling to tri-
angular cells, where the label boundary is extracted as a surface
model.
3.3. Volumetric data fusion 83
p0
p1 q1 3D point
q2
camera center
surfa
β if p1 = interior, q1 = exterior
ce
(p1, q1) =
0 in the other 3 cases
∞ if int rior 0 if int rior
(p0) = (q2) =
0 if xt rior α if xt rior
Figure 3.28: Given a single pixel in a depthmap or a single 3D point, unary and
binary cost terms are defined for cells that intersect with the visible ray.
Figure 3.31: Jancosek and Pajdla modify the Delaunay tetrahedralization based
volumetric graph-cuts approach to handle “weak” surfaces that would have been
missed with a standard method. (Figure courtesy of Jancosek et al.)
iteration [208], although the behavior of these techniques are not well
understood and may pose issues for challenging cases.
In the remainder of this section, we first explain algorithmic details
and framework that are common to many mesh refinement methods,
provide specific examples, then finally show their reconstruction results.
of the mesh. They search for the optimal offset d̂i along the surface
normal direction that maximizes the photometric consistency score on
the tangent plane
The squared distance error metric is prone to outliers, and more ro-
bust function can also be used to define the penalty term [74]. They
recompute the surface normal and the offset at each vertex in every
iteration.
Lastly, Hiep, Keriven, Labatut, and Pons [193] defined the term as
a sum of photometric consistency scores over all the faces of a discrete
triangulated mesh model:
Ep ({vi }) = C(f, Ii , Ij ). (3.26)
f ∈F (Ii ,Ij )∈Vf
F denotes the set of faces in the mesh model, Vf is a set of pairs of im-
ages in which face f is visible, and C(f, Ii , Ij ) denotes the photometric
consistency score evaluated on face f based on the two images Ii and Ij .
They handle triangulated mesh models, and the definition of the term
deduces the fact that its gradient with respect to a vertex position vi
is determined by the incident faces. Different from many methods that
use tangent planes or front-parallel surface approximations, the use of
triangulated mesh model for photometric consistency evaluation has
3.4. MVS Mesh Refinement 89
Figure 3.32: Regularization force in the mesh refinement framework is given by the
Mesh Laplacian and Mesh Bi-Laplacian operators. The figure shows examples for
1D surfaces. Laplacian becomes zero when the surface is planar, while Bi-Laplacian
becomes zero when the surface has constant or uniform curvature. Regularization
forces are calculated by setting α = 0.5 and β = 0.5 in (3.30).
Figure 3.34: An MVS algorithm by Hernández and Schmitt calculates the photo-
consistency volume in an octree data structure (top left), from which the gradient
vector flow (GVF) [200] is used to compute the photometric consistency force (top
middle). The close-ups of the octree data structure and the GVF are shown at the
top right. Bottom left shows a shaded rendering of the photo-consistency volume
before GVF, which is basically a 3D point cloud. Their algorithm enforces silhou-
ette consistency in the mesh refinement process. The bottom middle figure shows
the signed distance to the closest image silhouette for each vertex of a mesh (ds(vi ) ),
where the sign is positive for vertices inside the silhouette and negative for those
outside the silhouette. Silhouette consistency should be enforced only near contour
generators, that is, vertices that project near input image silhouettes, where the bot-
tom right figure visualizes the strength of the silhouette consistency to be enforced.
3.4. MVS Mesh Refinement 95
Figure 3.35: Reconstruction results by Hernández and Schmitt [93]. From left to
right, one of the 36 input images, the reconstructed mesh model, and the texture
mapped mesh model for each dataset. The visual hull model is used to initialize the
mesh, which is then iteratively refined.
96 Algorithms: From Photo-Consistency to 3D Reconstruction
tour generators, that is, surface regions that are visible as a silhouette
boundary in an image. Only the Laplacian operator is used to compute
the regularization force without the bi-Laplacian operator, that is, β
is set to 0 in (3.30). An input image, the reconstructed mesh model,
and the texture mapped mesh model of their algorithm are shown for
the three datasets in Figure 3.35. They used a turn-table to acquire 36
images of 2008 × 3040 pixels of for each dataset. The numbers of the
vertices in the final model are 45,843, 83,628, and 114,496, respectively,
where more reconstruction results can be found in the paper. In ad-
dition to the intricate details, the method succeeded in reconstructing
thin structures, where silhouette consistency plays an important role.
This work was published in 2004, and really the first to prove the po-
tential of multi-view stereo algorithms. Their total running time can
reach 16 hours for a dataset, but this is a running time in 2004, where
a Pentium4 1.4 GHz machine was used. The system should run much
faster with state-of-the-art multi-core processors potentially with a use
of GPUs for further speedup.
One of the most successful state-of-the-art MVS systems was pro-
posed by Vu, Labatut, Keriven, and Pons in 2009 [193], where the mesh
refinement algorithm is used in the last step. Their overall pipeline is
illustrated in Figure 3.36, where their reconstruction results are shown
3.4. MVS Mesh Refinement 97
99
100 Multi-view Stereo and Structure Priors
Figure 4.1: Standard MVS algorithms fail in reconstructing even a simple scene
like this, which is full of homogeneous surfaces. The figure shows the reconstruction
result of PMVS software by Furukawa and Ponce [74] followed by Poisson Surface
Reconstruction software by Kazhdan [115].
After the work of Birchfield and Tomasi, MVS research rather fo-
cused on the reconstruction of high fidelity 3D models by exploring
effective photo-consistency evaluation schemes without relying much
on priors. After the success of MVS algorithms on near Lambertian
textured surfaces (Chapter 3), MVS researchers revisited the use of
structure priors. Many such algorithms are based on the 2D MRF for-
mulation as in the depthmap estimation in Section 3.1, while resorting
to powerful optimization machineries such as graph-cuts or belief prop-
agation techniques.
In a standard MRF formulation for depthmap estimation, the cost
function is a combination of unary terms and pairwise interaction
terms, where labels encode depth values. We repeat Equation 3.2 de-
scribing the MRF formulation in the following for easy reference.
E(k) = Φ(kp ) + Ψ(kp , kq ). (4.1)
p (p,q)∈N
Figure 4.2: Birchfield and Tomasi proposed a piecewise affine disparity estimation
algorithm back in the 90s. The left is one of the input images, while the middle and
the right show the reconstructed 3D model rendered from two different viewpoints.
(Figure courtesy of Birchfield et al.)
term usually penalizes the distance of labels, in our case, the amount
of discrepancy of depth values at adjacent pixels, which enforces front-
parallel surfaces, because the pairwise term becomes zero when all the
pixels have the same depth labels.
As described in Section 3.1.5, the second order smoothness prior by
Woodford, Torr, Reid, and Fitzgibbon is a natural extension to this,
which favors piecewise planar surfaces [196]. However, the introduction
of higher order terms makes the energy non-submodular, which resulted
in much more complex optimization steps. Olsson, Ulen, and Boykov
changed the meaning of labels from depth values to depth values plus
surface normal orientations [148] to also enforce piecewise planarity.
This approach does not require higher order terms, but does not guar-
antee submodularity and need complicated optimization steps, too.
In the following sections, we take a close look at MVS algorithms
that change the definition of labels to enforce high-level structure priors
such as planarity, while still keeping the submodularity to enable the
simple application of powerful graph-cuts technique.
Figure 4.3: Manhattan stereo algorithm [67]. Given a set of 3D points reconstructed
by a standard MVS algorithm, they estimate dominant axes and extract hypothesis
planes along each axis by finding point density peaks. Finally, the MRF formulation
is used to assign a hypothesis plane to each pixel.
1
The details of other steps in the pipeline such as the plane extraction are refereed
to the paper [67].
4.1. Departure from Depthmap to Planemap 103
Figure 4.4: Data term measures visibility conflicts between a plane hypothesis
at a pixel and all the reconstructed points {Pi }. There are three different cases in
which the visibility conflict occurs. The smoothness term in this figure measures the
penalty of assigning a hypothesis H n to pixel q, and a hypothesis H m to pixel p.
Data Term
Smoothness term
The smoothness term Ψ(kp , kq ) enforces spatial consistency and is 0,
if the same plane hypothesis is assigned, that is, hp = hq . Otherwise,
they seek a smoothness function that penalizes the amount of depth
4.1. Departure from Depthmap to Planemap 105
Figure 4.5: Input image and extracted dominant lines, used as cues to control the
smoothness penalty. The red, green and blue components in the right figure shows
the results of edge detection along the three dominant directions, respectively. Note
that yellow indicates ambiguity between the red and green directions.
Reconstruction Results
Figure 4.6 shows the reconstruction results of Manhattan-world Stereo
algorithm [67] on both indoor and outdoor architectural scenes, which
are full of planar surfaces that meet at right angles. Notice the pres-
ence of abundant textureless or highly non-Lambertian surfaces such
as white walls or shiny metallic surfaces, which are reconstructed suc-
cessfully. There is also a fair number of curved surfaces, which unfor-
tunately breaks the Manhattan-world assumption, but their algorithm
does its best to approximate its shape by piecewise planar surfaces.
Figure 4.7 compares results against a state-of-the-art MVS pipeline
without structure priors (PMVS [74] and Poisson Surface Reconstruc-
tion software [115, 41, 116]). Manhattan-world Stereo algorithm is, in
essence, a depthmap algorithm, and reconstructs a geometry visible
only in a single image as opposed to the competing pipeline that uses
all the input images to reconstruct an entire scene model. Nonetheless,
the competing approach often makes gross errors due to the lack of reli-
able textures, which are cleanly reconstructed with the use of structure
priors.
Reconstruction results of a piecewise planar stereo algorithm by
Sinha, Steedly, and Szeliski [170] are presented in Figure 4.8 for various
4.2. Departure from Planes to Geometric Primitives 107
Texture mapped
Target image Depth map Depth normal map Mesh model
mesh model
Figure 4.9) which goes beyond piecewise planar and can handle curved
surfaces, in particular, surfaces of revolution in addition to planes of
arbitrary orientations.
They estimate an elevation model or a height field representation of
cities, which is formulated as a 2D MRF problem in a top down view.
Their system is optimized and tested for aerial photographs of cities,
and yields very impressive 3D building models (See Figure 4.10). The
formulation is very similar to the piecewise planar stereo algorithms in
the previous section, where an MRF is used to assign a primitive ID,
but with two key differences. The first difference is that the domain
is discretized by a grid of much larger rectangular cells (left of Fig-
ure 4.9) based on the 3D line segments reconstructed by a multi-view
line matching algorithm [32]. A primitive ID is assigned to each cell,
which has additional data-aware regularization effects. The second dif-
ference is the handling of surfaces of revolution as geometric primitives,
which may not be effective for arbitrary scenes, but often arise in out-
4.3. Image Classification for Structure Priors 109
Figure 4.9: Left: Reconstructed 3D lines are used to segment the top down view
of a scene. (Figure courtesy of Zebedin et al.)
Figure 4.10: Extracted geometric primitive IDs are assigned to each 2D cell in
a top down view of a city, yielding impressive 3D building models [209]. (Figure
courtesy of Zebedin et al.)
simply indicate a non-planar structure that has the same depth value
as the raw depthmap. The algorithm employs several techniques be-
yond the standard RANSAC to improve the quality of planes, mostly
by exploiting the fact that the input samples (depth values at pixels)
have a spatial structure, that is, laid out on a 2D grid as an image.
They also merge planes extracted from different images to obtain a
single set of planes for all the input images. Details are referred to the
paper [77].
The image classification is used to change the data term for these
labels, where kp is the label to be assigned to pixel p: 2
⎧
⎨ C(kp ) + (1 − a)
⎪ if kp is a plane label
Φ(kp ) = C(kp ) + a + ηnon_plane if kp is a non-plane label (4.5)
⎪
⎩
ηdiscard if kp is a discard label
2
The exact definition of the data term Φ(kp ) has an upper bound as a hinge loss,
but is omitted here for simplicity.
4.3. Image Classification for Structure Priors 113
5.1 Software
117
118 Software, Best Practices, and Successful Applications
ware by Michael Misha Kazhdan [115, 41, 116], which turns an oriented
point cloud into a mesh model, people have built an end-to-end fully
automated 3D reconstruction pipeline.
More recently, an SfM software “VisualSFM” by Changchang
Wu [198], which is a GUI application for SfM, and an MVS software
“CMPMVS” by Michal Jancosek and Tomas Pajdla [109] were devel-
oped as compelling alternatives.
The Multi-View Environment (MVE) is an implementation of a
complete end-to-end pipeline for image-based geometry reconstruction,
developed at TU-Darmstadt [12]. It features SfM, MVS, and Surface
Reconstruction. The individual steps of the pipeline are available as
command line applications, but most features are also available from a
user interface.
Open Multiple View Geometry (OpenMVG) [13] is an open-source
library for computer-vision scientists, especially targeted to the multi-
ple view geometry community. The library is designed to provide an
easy access to the classical problems in multiple view geometry, for ex-
ample, feature detection/description/matching, feature tracking, and
bundle adjustment. The library also includes the two complete SfM
pipelines. OpenMVG provides customizable tools, where the commu-
nity has built data pipeline based on OpenMVG for other multiview
geometry software such as PMVS, CMPMVS, and MVE.
Image acquisition is the first critical step for successful MVS. Here, we
summarize best practices and know-hows for successful image acquisi-
tion.
Accuracy of the camera models: MVS techniques are highly depen-
dent on the accuracy of the camera parameters. Typical reprojection
error RMSE should be sub-pixel, ideally smaller than 0.5 pixels. In case
reprojection error is large, one possibility is to shrink the input images
and modify the corresponding camera parameters, which will reduce
the reprojection error proportionally to the shrinkage ratio.
Image resolution: One of the reasons that MVS has been so suc-
cessful is the improvements in image sensors. Consumer-grade cameras
5.2. Best practices for Image Acquisition 119
produce high quality and high resolution images. High resolution im-
ages bring up details that can be used to uniquely identify a pixel from
its neighbors, thus improving the correspondence cue used by MVS al-
gorithms to find similar pixels across multiple images. Note however
that by resolution we do not mean just having lots of pixels, the qual-
ity of camera lens also matters. Having a very high-res image captured
with a poor quality lens will not improve results, and it may actually
make them worse, e.g. due to worse results at the SfM stage.
Image overlap: For MVS to work correctly, multiple images need to
see the same piece of geometry from multiple view points. Although the
bare minimum is only two images, at least three images are typically
required to observe each 3D point for robustness.
Baseline: MVS uses the principle of triangulation to reconstruct 3D
geometry from pixel matches across multiple images. This means that
the accuracy of the geometry has a direct dependency on the baseline of
the triangulation. The more baseline, the more accurate reconstruction.
On the other hand, a larger baseline makes it harder to match pixels
across images for two reasons: 1) there would be more occlusions, and
2) the appearance of the same pixel would vary more across images,
making it harder to match. As a result, there is a compromise between
accuracy, robustness, and density. Typical optimal baseline, or viewing
angle differences from a 3D point to input camera locations are in the
range of 5-15 degrees.
Number of images: MVS algorithms are extremely good at exploiting
a large number of images together with the view selection, as long as the
following two conditions are met: 1) SfM reprojection error is small; and
2) The image set contains high quality images at minimal baselines. In
other words, the more images are available, the better MVS works. The
Middlebury MVS evaluation results [165] clearly show a relationship
between the number of images and the quality. However, if one has
to choose between image resolution and number of images, there is no
easy decision. MVS algorithms reconstruct more details from higher
resolution images, as MVS suffers little from ambiguous matches. On
the other hand, high resolution images become an issue for SfM, as the
so-called ratio-test [136] would reject many feature matches. Therefore,
120 Software, Best Practices, and Successful Applications
The success of MVS technologies has changed the entire shape of the
3D reconstruction industry by replacing laser range sensors, which are
costly to build and maintain, with image based solutions. As a va-
riety of successful MVS algorithms presented high quality results in
research, industrial applications followed with successful implementa-
tions of MVS for real products.
The visual effects industry, for example, relies more and more on
image-based dense 3D reconstruction, in particular, for close range 3D
scanning such as objects and human faces [37] (See Figure 5.1). For
small scale object or scene reconstruction, there exist commercial 3D
reconstruction systems by major software companies such as “123D
Catch” by Autodesk [31], which can turn a collection of photographs
into a 3D model. With the advancement of mobile devices, the software
can be fully utilized just by a cell phone or a tablet, and its GUI inter-
face lets anybody enjoy image-based 3D reconstruction capabilities.
Digital mapping is probably the biggest industry, in which MVS
plays a crucial role, where their ultimate goal is to digitally map the
entire world in 3D (See Figure 5.2). Apple Maps “Flyover” feature pro-
vides views of high quality texture mapped city building models. Google
Maps “Earth View” also provides stunning 3D views of cities. Microsoft
5.3. Successful Applications 121
Figure 5.1: Passive facial motion capture system developed by Beeler, Hahn,
Bradley, Bickel, Beardsley, Gotsman, Summer, and Gross at Disney Research [37].
The left column shows a sample input image of a subject, and the right two columns
show reconstructed face mesh models in two different expressions. (Figure courtesy
of Beeler et al.)
Bing Map will have a new 3D mode, too. All these maps provide fully
interactive exploration and navigation experiences for effective map-
ping applications. These companies fly planes over the cities to take
photographs on a massive scale but in a controlled manner, which are
turned into high quality 3D city models.
Online cloud has become the standard storage for photographs.
Ever-growing number of photographs are uploaded to online photo
sharing websites such as Flickr [7], Panoramio [14], and Picasa [17],
122 Software, Best Practices, and Successful Applications
Figure 5.2: The use of MVS techniques in digital mapping products. From top to
bottom, Apple Maps, Google Maps and Bing Maps.
everyday from all over the world. While one can easily access such
images, for example, via a simple keyword search [23, 64], commu-
nity photo collections pose additional challenges to 3D reconstruction.
These online photographs are acquired by different cameras, at dif-
5.3. Successful Applications 123
Figure 5.3: Google Maps Photo Tours provide with tens of thousands of 3D photo
tours of famous landmarks all over the world with fluid 3D transition renderings.
The top row shows such landmarks shown as red dots, across the globe, entire
Europe, and one neighborhood in Paris, from left to right. The middle row shows
the sequence of views in the photo tour for Sacre-Couer near Paris. The bottom row
is a screenshot of Google Maps showing a photo tour of St. Peter’s Basilica.
Figure 5.4: Acute3D develops Smart3DCaptureTM system for high quality 3D re-
constructions that enable various applications [1].
5.3. Successful Applications 125
Figure 5.5: Pix4D is a company that turns thousands of aerial photographs from
UAV into high quality 3D models [18].
6
Limitations and Future Directions
MVS success stories are never-ending, but there are still numerous set-
tings in which the MVS algorithms perform poorly. This chapter dis-
cusses the limitations of the current state-of-the-art and unexplored
new research directions, then concludes the article.
6.1 Limitations
The major failure modes of the current MVS algorithms are classified
into the following three types (see Figure 6.1).
Lack of texture: Homogeneous surfaces pose challenges to MVS algo-
rithms, which are unfortunately prevalent in indoor environments. The
Middlebury multi-view stereo benchmark [165] was a surprise to MVS
researchers in that many MVS algorithms successful reconstructed an
apparently textureless object “Dino” with high accuracy. It turns out
that MVS algorithms are able to pick up very weak and intricate image
textures, most of which come from shading and/or shadowing effects
(the lighting was fixed during the dataset capture). However, these
texture cues are weak and delicate, and images need to have very high
quality.
127
128 Limitations and Future Directions
Figure 6.1: Despite successes in MVS algorithms, there still exist many failure
cases. Standard MVS algorithms cannot handle (left) non-Lambertian surfaces such
as metallic or transparent objects, (middle) weakly textured surfaces, and (right)
thin structures. (Image source: http://www.ikea.com)
6.3 Conclusions
This article provided a tutorial on multi view stereo, from data collec-
tion, algorithmic details to successful applications in the industry. MVS
has undoubtedly been one of the most successful fields in Computer Vi-
sion in the last decade. It enables high quality 3D reconstruction from a
handful of images taken by consumer grade cameras. In industry, with
the explosion of cell phones and mobile devices equipped with cameras,
we are capturing millions of photographs everyday all over the world at
an ever growing pace. All these photos are a rich source of input data
for MVS, with the ultimate goal of reconstructing the entire world.
Acknowledgements
We would like to thank Richard Szeliski and Steve Seitz for fruitful
discussion and advices at the early stage of the article writing. We
would also like to thank all of our colleagues for providing us with
images and figures for the article.
135
References
137
138 References
[65] Simon Fuhrmann and Michael Goesele. Fusion of depth maps with
multiple scales. ACM Transactions on Graphics, 30(6):148:1–148:8, De-
cember 2011.
[66] Simon Fuhrmann and Michael Goesele. Floating scale surface recon-
struction. ACM Transactions on Graphics, 33(4):46, 2014.
[67] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard
Szeliski. Manhattan-world stereo. In IEEE Conference on Computer
Vision and Pattern Recognition, 2009.
[68] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard
Szeliski. Reconstructing building interiors from images. In IEEE Inter-
national Conference on Computer Vision, 2009.
[69] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard
Szeliski. Towards Internet-scale multiview stereo. In IEEE Conference
on Computer Vision and Pattern Recognition, 2010.
[70] Yasutaka Furukawa and Jean Ponce. Carved visual hulls for image-
based modeling. International Journal of Computer Vision, 81(1):53–
67, 2008.
[71] Yasutaka Furukawa and Jean Ponce. Dense 3d motion capture from
synchronized video streams. In IEEE Conference on Computer Vision
and Pattern Recognition, 2008.
[72] Yasutaka Furukawa and Jean Ponce. Accurate camera calibration from
multi-view stereo and bundle adjustment. International Journal of
Computer Vision, 84(3):257–268, September 2009.
[73] Yasutaka Furukawa and Jean Ponce. Dense 3d motion capture for hu-
man faces. In IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 1674–1681. IEEE, 2009.
[74] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-
view stereopsis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(8):1362–1376, August 2010.
[75] David Gallup, J-M Frahm, Philippos Mordohai, and Marc Pollefeys.
Variable baseline/resolution stereo. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
[76] David Gallup, Jan-Michael Frahm, Philippos Mordohai, Qingxiong
Yang, and Marc Pollefeys. Real-time plane-sweeping stereo with multi-
ple sweeping directions. In IEEE Conference on Computer Vision and
Pattern Recognition, 2007.
References 143
[77] David Gallup, Jan-Michael Frahm, and Marc Pollefeys. Piecewise pla-
nar and non-planar stereo for urban scene reconstruction. In IEEE
Conference on Computer Vision and Pattern Recognition, 2010.
[78] P. Gargallo, E. Prados, and P. Sturm. Minimizing the reprojection
error in surface reconstruction from images. In IEEE International
Conference on Computer Vision, pages 1–8, 2007.
[79] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S.M. Seitz. Multi-
view stereo for community photo collections. In IEEE International
Conference on Computer Vision, pages 1–8, 2007.
[80] Michael Goesele, Brian Curless, and Steven M. Seitz. Multi-view stereo
revisited. In IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2402–2409, 2006.
[81] Google. Google maps. http://www.google.com/maps/views/u/0/
streetview.
[82] Brian Gough. GNU Scientific Library Reference Manual - Third Edi-
tion. Network Theory Ltd., 3rd edition, 2009.
[83] Markus Gross and Hanspeter Pfister. Point-Based Graphics. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 2007.
[84] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik.
Learning rich features from rgb-d images for object detection and seg-
mentation. In European Conference on Computer Vision, pages 345–
360. Springer, 2014.
[85] Martin Habbecke and Leif Kobbelt. A surface-growing approach to
multi-view stereo reconstruction. In IEEE Conference on Computer
Vision and Pattern Recognition, 2007.
[86] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik.
Simultaneous detection and segmentation. In European Conference on
Computer Vision, pages 297–312. Springer, 2014.
[87] C. Harris and M. Stephens. A combined corner and edge detector. In
Alvey Vision Conference, pages 147–151, 1988.
[88] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer
Vision. Cambridge University Press, 2004.
[89] Samuel W. Hasinoff and Kiriakos N. Kutulakos. Confocal stereo. In-
ternational Journal of Computer Vision, 81(1):82–104, 2009.
[90] Kaiming He, Jian Sun, and Xiaoou Tang. Guided image filtering. In Eu-
ropean Conference on Computer Vision, ECCV’10, pages 1–14, Berlin,
Heidelberg, 2010.
144 References
[148] Carl Olsson, Johannes Ulen, and Yuri Boykov. In defence of 3d-label
stereo. In IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2013.
[149] G. P. Otto and T. K. W. Chau. ‘region-growing’ algorithm for matching
of terrain images. Image Vision Computing, 7(2):83–94, 1989.
[150] Geoffrey Oxholm and Ko Nishino. Multiview shape and reflectance
from natural illumination. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 2163–2170. IEEE, 2014.
[151] Emanuel Parzen. On estimation of a probability density function and
mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
[152] M. Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric
reconstruction in spite of varying and unknown internal camera param-
eters. International Journal of Computer Vision, 32(1):7–25, 1999.
[153] M. Pollefeys, D. Nister, J.M. Frahm, A. Akbarzadeh, P. Mordohai,
B. Clipp, C. Engels, D. Gallup, S.J. Kim, P. Merrell, et al. Detailed
real-time urban 3d reconstruction from video. International Journal of
Computer Vision, 78(2):143–167, 2008.
[154] Jean-Philippe Pons, Renaud Keriven, and Olivier Faugeras. Multi-
view stereo reconstruction and scene flow estimation with a global
image-based matching score. International Journal of Computer Vi-
sion, 72(2):179–193, April 2007.
[155] Soren Ragsdale. Airplane prop + cmos rolling shutter, 2009.
[156] X. Ren and J. Malik. Learning a classification model for segmentation.
In IEEE International Conference on Computer Vision, 2003.
[157] Christian Richardt, Douglas Orr, Ian Davies, Antonio Criminisi, and
Neil A. Dodgson. Real-time spatiotemporal stereo matching using the
dual-cross-bilateral grid. In Kostas Daniilidis, Petros Maragos, and
Nikos Paragios, editors, European Conference on Computer Vision, vol-
ume 6313 of Lecture Notes in Computer Science, page 510âĂŞ523,
September 2010.
[158] Sébastien Roy and Ingemar J. Cox. A maximum-flow formulation of
the n-camera stereo correspondence problem. In IEEE International
Conference on Computer Vision, 1998.
[159] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient
alternative to sift or surf. In IEEE International Conference on Com-
puter Vision, pages 2564–2571, Nov 2011.
[160] Szymon Rusinkiewicz and Marc Levoy. Qsplat: A multiresolution point
rendering system for large meshes. In ACM SIGGRAPH, 2000.
150 References
[161] S. Savarese, M. Chen, and P. Perona. Local shape from mirror reflec-
tions. International Journal of Computer Vision, 64(1):31–67, 2005.
[162] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms. International Journal of
Computer Vision, 47(1/2/3):7–42, 2002.
[163] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of
dense two-frame stereo correspondence algorithms. International Jour-
nal of Computer Vision, 47:7–42, 2002.
[164] Christopher Schroers, Henning Zimmer, Levi Valgaerts, AndrÃľs Bruhn,
Oliver Demetz, and Joachim Weickert. Anisotropic range image inte-
gration. In DAGM/OAGM Symposium’12, pages 73–82, 2012.
[165] Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and
Richard Szeliski. A comparison and evaluation of multi-view stereo
reconstruction algorithms. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 519–528, 2006.
[166] Steven M. Seitz and Charles R. Dyer. Photorealistic scene reconstruc-
tion by voxel coloring. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 1067–, Washington, DC, USA, 1997. IEEE
Computer Society.
[167] Ben Semerjian. A new variational framework for multiview surface re-
construction. In European Conference on Computer Vision, pages 719–
734. Springer, 2014.
[168] Qi Shan, Changchang Wu, Brian Curless, Yasutaka Furukawa, Carlos
Hernandez, and Steven M Seitz. Accurate geo-registration by ground-
to-aerial image matching. In 3D Vision (3DV), 2014 2nd International
Conference on, volume 1, pages 525–532. IEEE, 2014.
[169] S.N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph
cuts on the dual of an adaptive tetrahedral mesh. In IEEE International
Conference on Computer Vision, pages 1–8, 2007.
[170] Sudipta Sinha, Drew Steedly, and Richard Szeliski. Piecewise planar
stereo for image-based rendering. In IEEE International Conference on
Computer Vision, 2009.
[171] Sudipta N Sinha and Marc Pollefeys. Multi-view reconstruction using
photo-consistency and exact silhouette constraints: A maximum-flow
formulation. In IEEE International Conference on Computer Vision,
2005.
References 151
[184] R.Y. Tsai. Multiframe image point matching and 3-d surface reconstruc-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI-5(2):159–174, March 1983.
[185] Ali Osman Ulusoy, Octavian Biris, and Joseph L Mundy. Dynamic
probabilistic volumetric models. In IEEE International Conference on
Computer Vision, pages 505–512. IEEE, 2013.
[186] Ali Osman Ulusoy and Joseph L Mundy. Image-based 4-d reconstruc-
tion using 3-d change detection. In European Conference on Computer
Vision, 2014.
[187] Julien PC Valentin, Sunando Sengupta, Jonathan Warrell, Ali
Shahrokni, and Philip HS Torr. Mesh based semantic modelling for
indoor and outdoor scenes. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 2067–2074. IEEE, 2013.
[188] Yuriy Vasilyev, Todd Zickler, Steven Gortler, and Ohad Ben-Shahar.
Shape from specular flow: Is one flow enough? In IEEE Conference on
Computer Vision and Pattern Recognition, pages 2561–2568, 2011.
[189] P. Viola and W.M.I.I.I. Wells. Alignment by maximization of mutual
information. In IEEE International Conference on Computer Vision,
pages 16–23, 1995.
[190] G. Vogiatzis, C. Hernández, P. H S Torr, and R. Cipolla. Multi-
view stereo via volumetric graph-cuts and occlusion robust photo-
consistency. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 29(12):2241–2246, 2007.
[191] George Vogiatzis, P.H.S. Torr, and Roberto Cipolla. Multi-view stereo
via volumetric graph-cuts. In IEEE Conference on Computer Vision
and Pattern Recognition, 2005.
[192] George Vogiazis and Carlos Hernández. Automatic camera pose esti-
mation from dot pattern. http://george-vogiatzis.org/calib/.
[193] H-H. Vu, P. Labatut, R. Keriven, and J.-P Pons. High accuracy and
visibility-consistent dense multi-view stereo. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 34(5):889–901, May 2012.
[194] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Map
estimation via agreement on trees: message-passing and linear program-
ming. IEEE Transactions on Information Theory, 51:2005, 2005.
[195] Sven Wanner and Bastian Goldluecke. Spatial and angular variational
super-resolution of 4d light fields. In European Conference on Computer
Vision, pages 608–621, 2012.
References 153