Computer Vision For Visual Effects PDF
Computer Vision For Visual Effects PDF
RICHARD J. RADKE
Rensselaer Polytechnic Institute
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Mexico City
www.cambridge.org
Information on this title: www.cambridge.org/9780521766876
A catalog record for this publication is available from the British Library.
Cambridge University Press has no responsibility for the persistence or accuracy of URLs
for external or third-party Internet Web sites referred to in this publication and does not
guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
You’re here because we want the best and you are it.
So, who is ready to make some science?
– Cave Johnson
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Computer Vision for Visual Effects 2
1.2 This Book’s Organization 4
1.3 Background and Prerequisites 6
1.4 Acknowledgments 7
2 Image Matting . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Matting Terminology 10
2.2 Blue-Screen, Green-Screen, and Difference Matting 13
2.3 Bayesian Matting 16
2.4 Closed-Form Matting 20
2.5 Markov Random Fields for Matting 29
2.6 Random-Walk Methods 30
2.7 Poisson Matting 35
2.8 Hard-Segmentation-Based Matting 36
2.9 Video Matting 40
2.10 Matting Extensions 42
2.11 Industry Perspectives 45
2.12 Notes and Extensions 50
2.13 Homework Problems 51
6 Matchmoving . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.1 Feature Tracking for Matchmoving 208
6.2 Camera Parameters and Image Formation 211
6.3 Single-Camera Calibration 216
6.4 Stereo Rig Calibration 221
6.5 Image Sequence Calibration 225
6.6 Extensions of Matchmoving 241
6.7 Industry Perspectives 244
6.8 Notes and Extensions 248
6.9 Homework Problems 250
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
1 Introduction
43 of the top 50 films of all time are visual effects driven. Today, visual effects are
the “movie stars” of studio tent-pole pictures — that is, visual effects make con-
temporary movies box office hits in the same way that big name actors ensured the
success of films in the past. It is very difficult to imagine a modern feature film or TV
program without visual effects.
The Visual Effects Society, 2011
Neo fends off dozens of Agent Smith clones in a city park. Kevin Flynn confronts a
thirty-years-younger avatar of himself in the Grid. Captain America’s sidekick rolls
under a speeding truck in the nick of time to plant a bomb. Nightcrawler “bamfs” in
and out of rooms, leaving behind a puff of smoke. James Bond skydives at high speed
out of a burning airplane. Harry Potter grapples with Nagini in a ramshackle cottage.
Robert Neville stalks a deer in an overgrown, abandoned Times Square. Autobots
and Decepticons battle it out in the streets of Chicago. Today’s blockbuster movies
so seamlessly introduce impossible characters and action into real-world settings that
it’s easy for the audience to suspend its disbelief. These compelling action scenes are
made possible by modern visual effects.
Visual effects, the manipulation and fusion of live and synthetic images, have
been a part of moviemaking since the first short films were made in the 1900s. For
example, beginning in the 1920s, fantastic sets and environments were created using
huge, detailed paintings on panes of glass placed between the camera and the actors.
Miniature buildings or monsters were combined with footage of live actors using
forced perspective to create photo-realistic composites. Superheroes flew across the
screen using rear-projection and blue-screen replacement technology.
These days, almost all visual effects involve the manipulation of digital and
computer-generated images instead of in-camera, practical effects. Filmgoers over
the past forty years have experienced the transition from the mostly analog effects of
movies like The Empire Strikes Back to the early days of computer-generated imagery
in movies like Terminator 2: Judgment Day to the almost entirely digital effects of
movies like Avatar. While they’re often associated with action and science fiction
movies, visual effects are now so common that they’re imperceptibly incorporated
into virtually all TV series and movies — even medical shows like Grey’s Anatomy and
period dramas like Changeling.
1
2 Chapter 1. Introduction
Like all forms of creative expression, visual effects have both an artistic side and
a technological side. On the artistic side are visual effects artists: extremely tal-
ented (and often underappreciated) professionals who expertly manipulate software
packages to create scenes that support a director’s vision. They’re attuned to the film-
making aspects of a shot such as its composition, lighting, and mood. In the middle
are the creators of the software packages: artistically minded engineers at companies
like The Foundry, Autodesk, and Adobe who create tools like Nuke, Maya, and After
Effects that the artists use every day. On the technological side are researchers, mostly
in academia, who conceive, prototype, and publish new algorithms, some of which
eventually get incorporated into the software packages. Many of these algorithms are
from the field of computer vision, the main subject of this book.
Computer vision broadly involves the research and development of algorithms
for automatically understanding images. For example, we may want to design an
algorithm to automatically outline people in a photograph, a job that’s easy for a
human but that can be very difficult for a computer. In the past forty years, computer
vision has made great advances. Today, consumer digital cameras can automatically
identify whether all the people in an image are facing forward and smiling, and smart-
phone camera apps can read bar codes, translate images of street signs and menus,
and identify tourist landmarks. Computer vision also plays a major role in image
analysis problems in medical, surveillance, and defense applications. However, the
application in which the average person most frequently comes into contact with the
results of computer vision — whether he or she knows it or not — is the generation
of visual effects in film and television production.
To understand the types of computer vision problems that are “under the hood”
of the software packages that visual effects artists commonly use, let’s consider a
scene of a human actor fighting a computer-generated creature (for example, Rick
O’Connell vs. Imhotep, Jack Sparrow vs. Davy Jones, or Kate Austen vs. The Smoke
Monster). First, the hero actor is filmed on a partially built set interacting with a
stunt performer who plays the role of the enemy. The built set must be digitally
extended to a larger environment, with props and furniture added and removed after
the fact. The computer-generated enemy’s actions may be created with the help of
the motion-captured performance of a second stunt performer in a separate location.
Next, the on-set stunt performer is removed from the scene and replaced by the digital
character. This process requires several steps: the background pixels behind the stunt
performer need to be recreated, the camera’s motion needs to be estimated so that
the digital character appears in the right place, and parts of the real actor’s body need
to appropriately pass in front of and behind the digital character as they fight. Finally,
the fight sequence may be artificially slowed down or sped up for dramatic effect. All
of the elements in the final shot must seamlessly blend so they appear to “live” in the
same frame, without any noticeable visual artifacts. This book describes many of the
algorithms critical for each of these steps and the principles behind them.
This book, Computer Vision for Visual Effects, explores the technological side of visual
effects, and has several goals:
1.1. Computer Vision for Visual Effects 3
Finally, while this book uses Hollywood movies as its motivation, not every visual
effects practitioner is working on a blockbuster film with a looming release date
and a rigid production pipeline. It’s easier than ever for regular people to acquire and
manipulate their own high-quality digital images and video. For example, an amateur
filmmaker can now buy a simple green screen kit for a few hundred dollars, down-
load free programs for image manipulation (e.g., GIMP or IrfanView) and numerical
computation (e.g., Python or Octave), and use the algorithms described in this book
to create compelling effects at home on a desktop computer.
Each chapter in this book covers a major topic in visual effects. In many cases, we
can deal with a video sequence as a series of “flat” 2D images, without reference to
the three-dimensional environment that produced them. However, some problems
require a more precise knowledge of where the elements in an image are located in a
3D environment. The book begins with the topics for which 2D image processing is
sufficient, and moves to topics that require 3D understanding.
We begin with the pervasive problem of image matting — that is, the separation
of a foreground element from its background (Chapter 2). The background could be
a blue or green screen, or it could be a real-world natural scene, which makes the
problem much harder. A visual effects artist may semiautomatically extract the fore-
ground from an image sequence using an algorithm for combining its color channels,
or the artist may have to manually outline the foreground element frame by frame.
In either case, we need to produce an alpha matte for the foreground element that
indicates the amount of transparency in challenging regions containing wisps of hair
or motion blur.
Next, we discuss many problems involving image compositing and editing, which
refer to the manipulation of a single image or the combination of multiple images
(Chapter 3). In almost every frame of a movie, elements from several different sources
need to be merged seamlessly into the same final shot. Wires and rigging that support
stunt performers must be removed without leaving perceptible artifacts. Removing
a very large object may require the visual effects artist to create complex, realistic
texture that was never observed by any camera, but that moves undetectably along
with the real background. The aspect ratio or size of an image may also need to be
changed for some shots (for example, to view a wide-aspect ratio film on an HDTV or
mobile device).
We then turn our attention to the detection, description, and matching of image
features, which visual effects artists use to associate the same point in different views
of a scene (Chapter 4). These features are usually corners or blobs of different sizes.
Our strategy for reliably finding and describing features depends on whether the
images are closely separated in space and time (such as adjacent frames of video
spaced a fraction of a second apart) or widely separated (such as “witness” cameras
that observe a set from different perspectives). Visual effects artists on a movie set also
commonly insert artificial markers into the environment that can be easily recognized
in post-production.
1.2. T h i s B o o k ’ s O r g a n i z a t i o n 5
Each chapter also includes several homework problems. The goal of each problem
is to verify understanding of a basic concept, to understand and apply a formula,
or to fill in a derivation skipped in the main text. Most of these problems involve
simple linear algebra and calculus as a means to exercise these important muscles
in the service of a real computer vision scenario. Often, the derivations, or at least a
start on them, are found in one of the papers referenced in the chapter. On the other
hand, this book doesn’t have any problems like “implement algorithm X,” although
it should be easy for an instructor to specify programming assignments based on
the material in the main text. The emphasis here is on thoroughly understand-
ing the underlying mathematics, from which writing good code should (hopefully)
follow.
As a companion to the book, the website cvfxbook.com will be continually
updated with links and commentary on new visual effects algorithms from academia
and industry, examples from behind the scenes of television and films, and demo
reels from visual effects artists and companies.
This book assumes the reader has a basic understanding of linear algebra, such as
setting up a system of equations as a matrix-vector product and solving systems
of overdetermined equations using linear least-squares. These key concepts occur
repeatedly throughout the book. Less frequently, we refer to the eigenvalues and
eigenvectors of a square matrix, the singular value decomposition, and matrix proper-
ties like positive definiteness. Strang’s classic book [469] is an excellent linear algebra
reference.
We also make extensive use of vector calculus, such as forming a Taylor series
and taking the partial derivatives of a function with respect to a vector of parameters
and setting them equal to zero to obtain an optimum. We occasionally mention
continuous partial differential equations, most of the time en route to a specific
discrete approximation. We also use basic concepts from probability and statistics
such as mean, covariance, and Bayes’ rule.
Finally, the reader should have working knowledge of standard image process-
ing concepts such as viewing images as grids of pixels, computing image gradients,
creating filters for edge detection, and finding the boundary of a binary set of pixels.
On the other hand, this book doesn’t assume a lot of prior knowledge about com-
puter vision. In fact, visual effects applications form a great backdrop for learning
about computer vision for the first time. The book introduces computer vision con-
cepts and algorithms naturally as needed. The appendixes include details on the
implementation of several algorithms common to many visual effects problems,
including dynamic programming, graph-cut optimization, belief propagation, and
numerical optimization. Most of the time, the sketches of the algorithms should
enable the reader to create a working prototype. However, not every nitty-gritty
implementation detail is provided, so many references are given to the original
research papers.
1.4. A c k n o w l e d g m e n t s 7
1.4 ACKNOWLEDGMENTS
I wrote most of this book during the 2010-11 academic year while on sabbatical
from the Department of Electrical, Computer, and Systems Engineering at Rensselaer
Polytechnic Institute. Thanks to Kim Boyer, David Rosowsky, and Robert Palazzo for
their support. Thanks to my graduate students at the time — Eric Ameres, Siqi Chen,
David Doria, Linda Rivera, and Ziyan Wu — for putting up with an out-of-the-office
advisor for a year.
Many thanks to the visual effects artists and practitioners who generously shared
their time and expertise with me during my trip to Los Angeles in June 2011. At LOOK
Effects, Michael Capton, Christian Cardona, Jenny Foster, David Geoghegan, Buddy
Gheen, Daniel Molina, and Gabriel Sanchez. At Rhythm & Hues, Shish Aikat, Peter
Huang, and Marty Ryan. At Cinesite, Shankar Chatterjee. At Digital Domain, Nick
Apostoloff, Thad Beier, Paul Lambert, Rich Marsh, Som Shankar, Blake Sloan, and
Geoff Wedig. In particular, thanks to Doug Roble at Digital Domain for taking so much
time to discuss his experiences and structure my visit. Special thanks to Pam Hogarth
at LOOK Effects and Tim Enstice at Digital Domain for organizing my trip. Extra
special thanks to Steve Chapman at Gentle Giant Studios for his hospitality during
my visit, detailed comments on Chapter 8, and many behind-the-scenes images of
3D scanning.
This book contains many behind-the-scenes images from movies, which wouldn’t
have been possible without the cooperation and permission of several people. Thanks
to Andy Bandit at Twentieth Century Fox, Eduardo Casals and Shirley Manusiwa
at adidas International Marketing, Steve Chapman at Gentle Giant Studios, Erika
Denton at Marvel Studios, Tim Enstice at Digital Domain, Alexandre Lafortune at
Oblique FX, Roni Lubliner at NBC/Universal, Larry McCallister and Ashelyn Valdez
at Paramount Pictures, Regan Pederson at Summit Entertainment, Don Shay at Cine-
fex, and Howard Schwartz at Muhammad Ali Enterprises. Thanks also to Laila Ali,
Muhammad Ali, Russell Crowe, Jake Gyllenhaal, Tom Hiddleston, Ken Jeong, Dar-
ren Kendrick, Shia LaBeouf, Isabel Lucas, Michelle Monaghan, and Andy Serkis for
approving the use of their likenesses.
At RPI, thanks to Jon Matthis for his time and assistance with my trip to the motion
capture studio, and to Noah Schnapp for his character rig. Many thanks to the stu-
dents in my fall 2011 class “Computer Vision for Visual Effects” for commenting
on the manuscript, finding errors, and doing all of the homework problems: Nimit
Dhulekar, David Doria, Tian Gao, Rana Hanocka, Camilo Jimenez Cruz, Daniel Kruse,
Russell Lenahan, Yang Li, Harish Raviprakash, Jason Rock, Chandroutie Sankar, Evan
Sullivan, and Ziyan Wu.
Thanks to Lauren Cowles, David Jou, and Joshua Penney at Cambridge University
Press and Bindu Vinod at Newgen Publishing and Data Services for their support and
assistance over the course of this book’s conception and publication. Thanks to Alice
Soloway for designing the book cover.
Special thanks to Aaron Hertzmann for many years of friendship and advice,
detailed comments on the manuscript, and for kindling my interest in this area.
Thanks also to Bristol-Myers Squibb for developing Excedrin, without which this
book would not have been possible.
8 Chapter 1. Introduction
During the course of writing this book, I have enjoyed interactions with Sterling
Archer, Pierre Chang, Phil Dunphy, Lester Freamon, Tony Harrison, Abed Nadir, Kim
Pine, Amelia Pond, Tim Riggins, Ron Swanson, and Malcolm Tucker.
Thanks to my parents for instilling in me interests in both language and engi-
neering (but also an unhealthy perfectionism). Above all, thanks to Sibel, my partner
in science, for her constant support, patience, and love over the year and a half
that this book took over my life and all the flat surfaces in our house. This book is
dedicated to her.
Separating a foreground element of an image from its background for later com-
positing into a new scene is one of the most basic and common tasks in visual effects
production. This problem is typically called matting or pulling a matte when applied
to film, or keying when applied to video.1 At its humblest level, local news stations
insert weather maps behind meteorologists who are in fact standing in front of a
green screen. At its most difficult, an actor with curly or wispy hair filmed in a com-
plex real-world environment may need to be digitally removed from every frame of a
long sequence.
Image matting is probably the oldest visual effects problem in filmmaking, and the
search for a reliable automatic matting system has been ongoing since the early 1900s
[393]. In fact, the main goal of Lucasfilm’s original Computer Division (part of which
later spun off to become Pixar) was to create a general-purpose image processing
computer that natively understood mattes and facilitated complex compositing [375].
A major research milestone was a family of effective techniques for matting against a
blue background developed in the Hollywood effects industry throughout the 1960s
and 1970s. Such techniques have matured to the point that blue- and green-screen
matting is involved in almost every mass-market TV show or movie, even hospital
shows and period dramas.
On the other hand, putting an actor in front of a green screen to achieve an effect
isn’t always practical or compelling, and situations abound in which the foreground
must be separated from the background in a natural image. For example, movie
credits are often inserted into real scenes so that actors and foreground objects
seem to pass in front of them, a combination of image matting, compositing, and
matchmoving. The computer vision and computer graphics communities have only
recently proposed methods for semi-automatic matting with complex foregrounds
and real-world backgrounds. This chapter focuses mainly on these kinds of algo-
rithms for still-image matting, which are still not a major part of the commercial
visual effects pipeline since effectively applying them to video is difficult. Unfortu-
nately, video matting today requires a large amount of human intervention. Entire
teams of rotoscoping artists at visual effects companies still require hours of tedious
work to produce the high-quality mattes used in modern movies.
1 The computer vision and graphics communities typically refer to the problem as matting, even
though the input is always digital video.
9
10 Chapter 2. Image Matting
where α(x, y) is a number in [0, 1]. That is, the color at (x, y) in I is a mix between the
colors at the same position in F and B, where α(x, y) specifies the relative proportion
of foreground versus background. If α(x, y) is close to 0, the pixel gets almost all of its
color from the background, while if α(x, y) is close to 1, the pixel gets almost all of its
color from the foreground. Figure 2.1 illustrates the idea. We frequently abbreviate
Equation (2.1) to
I = αF + (1 − α)B (2.2)
with the understanding that all the variables depend on the pixel location (x, y). Since
α is a function of (x, y), we can think of it like a grayscale image, which is often called
a matte, alpha matte, or alpha channel. Therefore, in the matting problem, we are
given the image I and want to obtain the images F , B, and α.
At first, it may seem like α(x, y) should always be either 0 (that is, the pixel is entirely
background) or 1 (that is, the pixel is entirely foreground). However, this isn’t the case
for real images, especially around the edges of foreground objects. The main reason
is that the color of a pixel in a digital image comes from the total light intensity falling
on a finite area of a sensor; that is, each pixel contains contributions from many real-
world optical rays. In lower resolution images, it’s likely that some scene elements
project to regions smaller than a pixel on the image sensor. Therefore, the sensor area
receives some light rays from the foreground object and some from the background.
Even high resolution digital images (i.e., ones in which a pixel corresponds to a very
small sensor area) contain fractional combinations of foreground and background
in regions like wisps of hair. Fractional values of α are also generated by motion
of the camera or foreground object, focal blur induced by the camera aperture, or
2.1. M a t t i n g T e r m i n o l o g y 11
= • + •
I = α F + 1⫺α B
Figure 2.1. An illustration of the matting equation I = αF + (1 − α)B. When α is 0, the image
pixel color comes from the background, and when α is 1, the image pixel color comes from the
foreground.
Figure 2.2. Image segmentation is not the same as image matting. (a) An original image, in
which the foreground object has fuzzy boundaries. (b) (top) binary and (bottom) continuous alpha
mattes for the foreground object. (c) Composites of the foreground onto a different background
using the mattes. The hard-segmented result looks bad due to incorrect pixel mixing at the soft
edges of the object, while using the continuous alpha matte results in an image with fewer visual
artifacts. (d) Details of the composites in (c).
12 Chapter 2. Image Matting
I = +
= +
= +
I = α F + 1⫺α B
Figure 2.3. The matting problem can’t be uniquely solved. The three (alpha, foreground, back-
ground) combinations at right are all mathematically consistent with the image at left. The
bottom combination is most similar to what a human would consider a natural matte.
in seven unknowns at each pixel (the RGB values of F and B as well as the mixing pro-
portion α). One result of this ambiguity is that for any values of I and a user-specified
value of F , we can find values for B and α that satisfy Equation (2.2), as illustrated in
Figure 2.3. Clearly, we need to supply a matting algorithm with additional assump-
tions or guides in order to recover mattes that agree with human perception about
how a scene should be separated. For example, as we will see in the next section,
the assumption that the background is known (e.g., it is a constant blue or green),
removes some of the ambiguity. However, this chapter focuses on methods in which
the background is complex and unknown and there is little external information other
than a few guides specified by the user.
In modern matting algorithms, these additional guides frequently take one of two
forms. The first is a trimap, defined as a coarse segmentation of the input image into
regions that are definitely foreground (F), definitely background (B), or unknown
(U). This segmentation can be visualized as an image with white foreground, black
background, and gray unknown regions (Figure 2.4b). An extreme example of a trimap
is a garbage matte, a roughly drawn region that only specifies certain background B
and assumes the rest of the pixels are unknown. An alternative is a set of scribbles,
which can be quickly sketched by a user to specify pixels that are definitely foreground
and definitely background (Figure 2.4c). Scribbles are generally easier for a user to
create, since every pixel of the original image doesn’t need to explicitly labeled. On
the other hand, the matting algorithm must determine α for a much larger number of
pixels. Both trimaps and scribbles can be created using a painting program like GIMP
or Adobe Photoshop.
As mentioned earlier, matting usually precedes compositing, in which an esti-
mated matte is used to place a foreground element from one image onto the
background of another. That is, we estimate α, F , and B from image I , and want
to place F on top of a new background image B̂ to produce the composite Î . The
corresponding compositing equation is:
Î = αF + (1 − α)B̂ (2.3)
2.2. Blue-Screen, Green-Screen, and Difference Matting 13
Figure 2.4. Several examples of natural images, user-drawn trimaps, and user-drawn scribbles.
(a) The original images. (b) Trimaps, in which black pixels represent certain background, white
pixels represent certain foreground, and gray pixels represent the unknown region for which
fractional α values need to be estimated. (c) Scribbles, in which black scribbles denote back-
ground pixels, and white scribbles denote foreground regions. α must be estimated for the rest
of the image pixels.
No matter what the new background image is, the foreground element F
always appears in Equation (2.3) in the form αF . Therefore, the foreground image
and estimated α matte are often stored together in the pre-multiplied form
(αFr , αFg , αFb , α), to save multiplications in later compositing operations [373].
We’ll talk more about the compositing process in the context of image editing in
Chapter 3.
The most important special case of matting is the placement of a blue or green screen
behind the foreground to be extracted, which is known as chromakey. The shades
of blue and green are selected to have little overlap with human skin tones, since in
filmmaking the foreground usually contains actors. Knowing the background color
also reduces the number of degrees of freedom in Equation (2.2), so we only have
four unknowns to determine at each pixel instead of seven.
14 Chapter 2. Image Matting
Figure 2.5. Blue-screen matting using Equation (2.4) with a1 = 12 and a2 = 1. We can see several
errors in the estimated mattes, including in the interiors of foreground objects and the boundaries
of fine structures.
Vlahos [518] proposed many of the early heuristics for blue-screen matting; one
proposed solution was to set
α = 1 − a1 (Ib − a2 Ig ) (2.4)
where Ib and Ig are the blue and green channels of the image normalized to the range
[0, 1], and a1 and a2 are user-specified tuning parameters. The resulting α values are
clipped to [0, 1]. The general idea is that when a pixel has much more blue than green,
α should be close to 0 (e.g., a pure blue pixel is very likely to be background but a
pure white pixel isn’t). However, this approach only works well for foreground pixels
with certain colors and doesn’t have a strong mathematical basis. For example, we
can see in Figure 2.5 that applying Equation (2.4) results in a matte with several visual
artifacts that would need to be cleaned up by hand.
In general, when the background is known, Equation (2.2) corresponds to three
equations at each pixel (one for each color channel) in four unknowns (the fore-
ground color F and the α value). If we had at least one more consistent equation, we
could solve the equations for the unknowns exactly. Smith and Blinn [458] suggested
several special cases that correspond to further constraints — for example, that the
foreground is known to contain no blue or to be a shade of gray — and showed how
these special cases resulted in formulae for α similar to Equation (2.4). However, the
special cases are still fairly restrictive.
Blue-screen and green-screen matting are related to a common image processing
technique called background subtraction or change detection [379]. In the visual
effects world, the idea is called difference matting and is a common approach when
a blue or green screen is not practical or available. We first take a picture of the empty
background (sometimes known as a clean plate) B, perhaps before a scene is filmed.
We then compare the clean plate to the composite image I given by Equation (2.2).
It seems reasonable that pixels of I whose color differs substantially from B can be
classified as parts of the foreground. Figure 2.6 shows an example in which pixels with
I − B greater than a threshold are labeled as foreground pixels with α = 1. However,
2.2. Blue-Screen, Green-Screen, and Difference Matting 15
Figure 2.6. Difference matting. The difference between the image with foreground (a) and clean
plate (b) can be thresholded to get a hard segmentation (c). Even prior to further estimation of
fractional α values, the rough matte has many tiny errors in places where the foreground and
background have similar colors.
Figure 2.7. (a),(b) Static objects are photographed in front of two backgrounds that differ in
color at every pixel (here, two solid-color backgrounds). (c) Triangulation produces a high-quality
matte. (d) Detail of matte.
since there are still three equations in four unknowns, the matte and foreground
image can’t be determined unambiguously. In particular, since the clean plate may
contain colors similar to the foreground, mattes created in this way are likely to
contain more errors than mattes created using blue or green screens.
Smith and Blinn observed that if the foreground F was photographed in front of
two different backgrounds B1 and B2 , producing images I1 and I2 , we would have six
equations in four unknowns:
I1 = αF + (1 − α)B1
(2.5)
I2 = αF + (1 − α)B2
difficult to obtain exact knowledge of each background image, to ensure that these
don’t change, and to ensure that F is exactly the same (both in terms of intensity and
position) in front of both backgrounds. Therefore, triangulation is typically limited
to extremely controlled circumstances (for example, a static object in a lab setting).
If Equation (2.5) does not hold exactly due to differences in F and α between back-
grounds or incorrect values of B, the results will be poor. For example, we can see
slight errors in the toy example in Figure 2.7 due to “spill” from the background onto
the foreground, and slight ghosting in the nest example due to tiny registration errors.
Blue-screen, green-screen, and difference matting are pervasive in film and TV
production. A huge part of creating a compelling visual effects shot is the creation
of a matte for each element, which is often a manual process that involves heuristic
combinations and manipulations of color channels, as described in Section 2.11.
These heuristics vary from shot to shot and even vary for different regions of the
same element. For more discussion on these issues, a good place to start is the book
by Wright [553]. The book by Foster [151] gives a thorough discussion of practical
considerations for setting up a green-screen environment.
In the rest of this chapter, we’ll focus on methods where only one image is obtained
and no knowledge of the clean plate is assumed. This problem is called natural
image matting. The earliest natural image matting algorithms assumed that the user
supplied a trimap along with the image to be matted. This means we have two large
collections of pixels known to be background and foreground. The key idea of the
algorithms in this section is to build probability density functions (pdfs) from these
labeled sets, which are used to estimate the α, F , and B values of the set of unknown
pixels in the region U.
We’ll show how to solve this problem using a simple iterative method that results
from making some assumptions about the form of this probability. First, by Bayes’
rule, Equation (2.7) is equal to
1
arg max P(I |F , B, α)P(F , B, α) (2.8)
F ,B,α P(I )
We can disregard P(I ) since it doesn’t depend on the parameters to be estimated,
and we can assume that F , B, and α are independent of each other. This reduces
Equation (2.8) to:
arg max P(I |F , B, α)P(F )P(B)P(α) (2.9)
F ,B,α
2.3. B a y e s i a n M a t t i n g 17
arg max log P(I |F , B, α) + log P(F ) + log P(B) + log P(α) (2.10)
F ,B,α
The first term in Equation (2.10) is a data term that reflects how likely the image
color is given values for F , B, and α. Since for a good solution the matting equation
(2.2) should hold, the first term can be modeled as:
1
P(I |F , B, α) ∝ exp − I − (αF + (1 − α)B)22 (2.11)
σd2
where σd is a tunable parameter that reflects the expected deviation from the matting
assumption. Thus,
1
log P(I |F , B, α) = − I − (αF + (1 − α)B)22 (2.12)
σd2
The other terms in Equation (2.10) are prior probabilities on the foreground,
background, and α distributions. This is where the trimap comes in. Figure 2.8 illus-
trates an example of a user-created trimap and scatterplots of pixel colors in RGB
space corresponding to the background and foreground. In this example, since the
background colors are very similar to each other and the foreground mostly contains
shades of gray, we can fit Gaussian distributions to each collection of intensities.
That is, for a color B, we estimate a pdf for the background given by:
1 1 −1
fB (B) = exp − (B − µB ) (B − µB ) (2.13)
(2π)3/2 |B |1/2 2 B
blue
green
red
(a) (b)
Figure 2.8. (a) A user-created trimap corresponding to the upper left image in Figure 2.5, and
(b) a scatterplot of the colors in the labeled foreground and background regions. Black dots
represent background and white dots represent foreground. Since the image was taken against
a blue screen, the background colors are tightly clustered in one corner of RGB space. Both the
foreground and background color distributions are well approximated by Gaussians (ellipses).
18 Chapter 2. Image Matting
The mean µB and covariance matrix B can can computed from the collection of
NB background sample locations {Bi } in B using:
1
NB
µB = I (Bi )
NB
i=1
(2.14)
1
NB
B = (I (Bi ) − µB )(I (Bi ) − µB )
NB
i=1
We can do the same thing for the foreground pixels in the trimap. Therefore, we
can obtain estimates for the prior distributions in Equation (2.10) as:
where we’ve omitted constants that don’t affect the optimization. For the moment,
let’s also assume P(α) is constant (we’ll relax this assumption shortly). Then sub-
stituting Equation (2.12) and Equation (2.15) into Equation (2.10) and setting the
derivatives with respect to F , B, and α equal to zero, we obtain the following
simultaneous equations:
F−1 + α 2 /σd2 I3×3 α(1 − α)/σd2 I3×3 F
α(1 − α)/σd2 I3×3 B−1 + (1 − α)2 /σd2 I3×3 B
F−1 µF + α/σd2 I
= (2.16)
B−1 µB + (1 − α)/σd2 I
(I − B) · (F − B)
α= (2.17)
(F − B) · (F − B)
Equation (2.16) is a 6 × 6 linear system for determining the optimal F and B for
a given α; I3×3 denotes the 3 × 3 identity matrix. Equation (2.17) is a direct solu-
tion for the optimal α given F and B. This suggests a simple strategy for solving the
Bayesian matting problem. First, we make a guess for α at each pixel (for example,
using the input trimap). Then, we alternate between solving Equation (2.16) and
Equation (2.17) until the estimates for F , B, and α converge.
blue
blue
green green
red red
(a) (b)
Figure 2.9. (a) A tougher example of a scatterplot of the colors in labeled foreground and back-
ground regions. Black dots represent background and white dots represent foreground. In this
case, the foreground and background densities are neither well separated nor well represented
by a single Gaussian. (b) Gaussian mixture models fit to the foreground and background samples
do a better job of separating the distributions.
F
U Figure 2.10. The local foreground and back-
ground samples in a window around each
pixel can be used to compute the distribu-
tions for Bayesian matting.
distributions remains, but the Gaussian mixture components are better separated
and model the data more tightly.
In the multiple-Gaussian case, solving Equation (2.10) directly is no longer
straightforward, but Chuang et al. [99] suggested a simple approach. We consider
each possible pair of (foreground, background) Gaussians independently, and solve
for the best F , B, and α by alternating Equations (2.16)–(2.17). Then we compute the
log likelihood given by the argument of Equation (2.10) for each result. We need to
include the determinants of F and B when evaluating log P(F ) and log P(B) for each
pair, since they are not all the same — these factors were ignored in Equation (2.15).
Finally, we choose the estimates for F , B, and α that produce the largest value of
Equation (2.10).
For complicated foregrounds and backgrounds, it makes sense to determine the
foreground and background distributions in Equation (2.15) locally at a pixel, rather
than globally across the whole image. This can be accomplished by creating a small
(relative to the image size) window around the pixel of interest and using the colors of
F and B inside the window to build the local pdfs (Figure 2.10). As F , B, and α for pixels
inside both the window and the unknown region are estimated, they can supplement
the samples. Generally, the estimation begins at the edges of the unknown area and
20 Chapter 2. Image Matting
0.7 0.35
0.5 0.25
frequency
frequency
0.3 0.15
0.1 0.05
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 α 0.6 0.8 1
α
(a) (b)
Figure 2.11. (a) The normalized histogram of α values for the ground-truth matte for the middle
example in Figure 2.4. (b) The normalized histogram of α values just over the trimap’s unknown
region, superimposed by a beta distribution with η = τ = 14 .
proceeds toward its center. We’ll say more about the issue of local pixel sampling in
Section 2.6.1.
While the original Bayesian matting algorithm treated the prior term P(α) as a
constant, later researchers observed that P(α) is definitely not a uniform distribution.
This stands to reason, since there are a relatively large number of pixels that are
conclusively foreground (α = 1) or background (α = 0) compared to mixed pixels,
which typically occur along object boundaries. Figure 2.11 illustrates the distributions
of α for a real image; the left panel shows that over the whole image the distribution is
highly nonuniform, and the right panel shows that even over the trimap’s uncertain
region, the distribution is biased toward α values close to 0 and 1. Wexler et al. [544]
and Apostoloff and Fitzgibbon [16] suggested modeling this behavior with a beta
distribution of the form
(η + τ ) η−1
P(α) = α (1 − α)τ −1 (2.18)
(η)(τ )
in Figures 2.8 and 2.9 — the fitted Gaussians are generally long and skinny. Levin
et al. [271] exploited this observation in an elegant algorithm called closed-form
matting.
Fi = βi F1 + (1 − βi )F2
(2.19)
Bi = γi B1 + (1 − γi )B2
Here, F1 and F2 are two points on the line of foreground colors, and βi represents
the fraction of the way a given foreground color Fi is between these two points.
The same idea applies to the background colors. This idea, called the color line
assumption, is illustrated in Figure 2.12.
Levin et al.’s first observation was that under the color line assumption, the α value
for every pixel in the window was simply related to the intensity by
α i = a Ii + b (2.20)
where a is a 3 × 1 vector, b is a scalar, and the same a and b apply to every pixel
in the window. That is, we can compute α for each pixel in the window as a linear
combination of the RGB values at that pixel, plus an offset. While this may not be
intuitive, let’s show why Equation (2.20) is algebraically true.
First we plug Equation (2.19) into the matting equation (2.2) to obtain:
blue B1 blue
F2 Bi
Fi
F1 βi
γi
red red B2 green
green
Figure 2.12. The color line assumption says that each pixel Ii in a small window of the image
is a mix of a foreground color Fi and a background color Bi , where each of these colors lies on a
straight line in RGB space.
22 Chapter 2. Image Matting
matrix on both sides, and denote the rows of this inverse by r’s:
αi
−1
α i βi = [F2 − B2 F1 − F2 B1 − B2 ] (Ii − B2 )
(1 − αi )γi
(2.23)
r1
= r2 (Ii − B2 )
r3
Taking just the first element of the vector on both sides, we see that
N
J ({αi , ai , bi , i = 1, . . . , N }) = (αi − (aj Ii + bj ))2 (2.25)
j=1 i∈wj
This cost function expresses the total error of the linearity assumption in
Equation (2.20) over each window. We want to minimize J to find αi at each pixel
as well as the coefficients ai and bi for every window wi around pixel i. For brevity,
we’ll write the left-hand side as J (α, a, b), where α is an N ×1 vector that collects all the
α values in the image, and a and b represent the collections of affine coefficients for
each window. Since the windows between adjacent pixels overlap, the α estimates at
each pixel are not independent. We also add a regularization term to Equation (2.25):
N
J (α, a, b) = (αi − (aj Ii + bj ))2 + εaj 22 (2.26)
j=1 i∈wj
This term acts to bias the mattes toward being constant, since if a = 0, αi = bi
within the whole window wi . Usually ε is chosen to be on the order of 10−7 if each
color channel is in the range [0, 1].
On first glance, this formulation doesn’t seem to help us solve the matting problem,
since we still have many more equations than unknowns (i.e., the five values of α, a,
and b at each pixel). However, by a clever manipulation, we can reduce the number
of unknowns to exactly the number of pixels. First, we rearrange Equation (2.26) as a
matrix equation:
j 2
I1 1 α1
j
N .. ..
..
aj
J (α, a, b) = . . − . (2.27)
j b
j=1 I 1 j αW
j
√εI
W
0 03×1
3×3
2.4. C l o s e d - F o r m M a t t i n g 23
j j j j
where W is the number of pixels in the window and {I1 , . . . , IW } and {α1 , . . . , αW } rep-
resent the ordered list of image colors and α values inside window j. More compactly,
we can write Equation (2.27) as
N
2
aj
J (α, a, b) = G j − ᾱj (2.28)
bj
j=1
That is, the optimal a and b in each window for a given matte α are linear functions
of the α values. This means we can substitute Equation (2.29) into Equation (2.26)
to get
N
2
= Gj (Gj Gj )−1 Gj ᾱj − ᾱj (2.33)
j=1
N
= ᾱj I(W +3)×(W +3) − Gj (Gj Gj )−1 Gj ᾱj (2.34)
j=1
= α Lα (2.35)
In the last equation, we’ve collected all of the equations for the windows into a
single matrix equation for the N × 1 vector α. The N × N matrix L is called the matting
Laplacian. It is symmetric, positive semidefinite, and quite sparse if the window size
is small. This matrix plays a key role in the rest of the chapter.
Working out the algebra in Equation (2.34), one can compute the elements of the
matting Laplacian as:
1
ε −1
L(i, j) = δij − 1 + (Ii − µk ) k + I3×3 (Ij − µk ) (2.36)
W W
k|(i,j)∈wk
where µk and k are the mean and covariance matrix of the colors in window k and
δij is the Kronecker delta. Frequently, the windows are taken to be 3 × 3, so W = 9. The
notation k|(i, j) ∈ wk in Equation (2.36) means that we only sum over the windows k
that contain both pixels i and j; depending on the configuration of the pixels, there
could be from 0 to 6 windows in the sum (see Problem 2.11).
24 Chapter 2. Image Matting
where
−1
1 ε
A(i, j) = 1 + (Ii − µk ) k + I3×3 (Ij − µk ) (2.38)
W W
k|(i,j)∈wk
The matrix A specified by Equation (2.38) is sometimes called the matting affinity.
From Equation (2.35) we can see that minimizing J (α) corresponds to solving the
linear system Lα = 0. That is, we must simply find a vector in the nullspace of L.
min α Lα
s.t. αi = 1 i ∈ F (2.39)
αi = 0 i ∈ B
• the color line model was satisfied exactly in every pixel window,
• the image was formed by exactly applying the matting equation to some
foreground and background images,
• the user scribbles were consistent with the ground-truth matte, and
• ε = 0 in Equation (2.26),
then the ground-truth matte will solve Equation (2.41). However, it’s important to
realize that the user might need to experiment with scribble quantity and placement
to ensure that the solution of Equation (2.41) is acceptable, since the nullspace of the
left-hand side may be non-trivial (see more in Section 2.4.5). Figure 2.13 illustrates an
example of using closed-form matting using only a few scribbles on a natural image.
2.4. C l o s e d - F o r m M a t t i n g 25
Figure 2.13. (a) An image with (b) foreground and background scribbles. (c) The α matte com-
puted using closed-form matting, showing that good estimates are produced in fine detail
regions.
Choosing the right window size for closed-form matting can be a tricky problem
depending on the resolution of the image and the fuzziness of the foreground object
(which may not be the same in all parts of the image). He et al. [192] considered
this issue, and showed how the linear system in Equation (2.41) could be efficiently
solved by using relatively large windows whose sizes depend on the local width of the
uncertain region U in the trimap. The advantage of using large windows is that many
distant pixels are related to each other, and the iterative methods typically used to
solve large systems like Equation (2.41) converge more quickly.
N
min Ii − (αi Fi + (1 − αi )Bi )2
Fi ,Bi
i=1 (2.42)
+ |∇x αi | ∇x Fi 2 + ∇x Bi 2 + |∇y αi | ∇y Fi 2 + ∇y Bi 2
(a) (b)
Figure 2.14. (a) An original image and (b) the eight eigenvectors corresponding to the smallest
eigenvalues of its matting Laplacian.
smallest eigenvalues of an input image. We can see that these eigenvector images
tend to be locally constant in large regions of the image and seem to follow the con-
tours of the foreground object. Any single eigenvector is generally unsuitable as a
matte, because mattes should be mostly binary (i.e., solid white in the foreground
and solid black in the background). On the other hand, since any linear combination
of null vectors is also a null vector, we can try to find combinations that are as binary
as possible in the hopes of creating “pieces” useful for matting.
Levin et al. [272] subsequently proposed an algorithm based on this natural idea
called spectral matting. We begin by computing the matting Laplacian L and its
eigenvectors E = [e1 , ..., eK ] corresponding to the K smallest eigenvalues (since the
matrix is positive semidefinite, none of the eigenvalues are negative). Each ei thus
roughly satisfies ei Lei = 0 and thus roughly minimizes Equation (2.30), despite
being a poor matte. We then try to find K linear combinations of these eigenvectors
called matting components that are as binary as possible by solving the constrained
optimization problem
K N
min |αik |ρ + |1 − αik |ρ (2.43)
y k ∈R ,k=1,...,K k=1 i=1
K
s.t. α k = Ey k (2.44)
K
αik = 1, i = 1, . . . , N (2.45)
k=1
That is, Equation (2.44) says that each matting component must be a linear com-
bination of the eigenvectors, while Equation (2.45) says that the matting components
must sum to 1 at each pixel. Figure 2.15 illustrates the score function in Equation (2.43)
with ρ = 0.9; we can see that it’s lowest when α is either 0 or 1.
The result of applying this process to the eigenvectors in Figure 2.14 is illustrated
in Figure 2.16a. At this point, the user can simply view a set of matting components
and select the ones that combine to create the desired foreground (this step takes the
place of the conventional trimap or scribbles). For example, selecting the highlighted
components in Figure 2.16a results in the good initial matte in Figure 2.16b. User
scribbles can be used to further refine the matte by forcing certain components to
contribute to the foreground or the background.
2.4. C l o s e d - F o r m M a t t i n g 27
α0.9 + (1−α)0.9
1.09
1.07
1.05
Cost 1.03
1.01
Figure 2.15. The cost function in Equation (2.43) as a function of α, with ρ = 0.9.
(a) (b)
Figure 2.16. (a) The eight nearly binary matting components computed using spectral matting
for the image in Figure 2.14a. (b) The four selected matting components are summed to give an
estimate of the full matte.
α i = a Ii + b (2.46)
arg min (αi − (a Ii + b))2 (2.47)
a,b
i∈wj
28 Chapter 2. Image Matting
2
a
arg min α j − Xj + ε(a2 + b2 ) (2.48)
a,b b
where α j collects all of the α values in wj into a vector, and Xj is a W × 4 matrix con-
taining image colors in the window. As we have seen, the solution to Equation (2.48) is
a∗
= (Xj Xj + εI4×4 )Xj α j (2.49)
b∗
which, plugging back into Equation (2.46), gives a mutual relationship between the α
at the center of the window and all the α’s in the window by way of the colors in Xi :
That is, Equation (2.50) says that the α in the center of the window can be linearly
predicted by its neighbors in the window; the term multiplying α i can be thought of
as a 1 × W vector of linear coefficients. If we compute this vector for every window,
we get a large, sparse linear system mutually relating all the α’s in the entire image;
that is,
α = F α (2.51)
where as before, α is an N × 1 vector of all the α’s. Just like in closed-form matting,
we want to determine α’s that satisfy this relationship while also satisfying user con-
straints specified by foreground and background scribbles. This leads to the natural
optimization problem
αi = a (Ii ) + b (2.53)
where is a nonlinear map from three color dimensions to a larger number of features
(say, p) and a becomes a p × 1 vector. The Ii and Xi entries in Equation (2.50) are
replaced by kernel functions between image colors (e.g., Gaussian kernels) that reflect
the relationship in high-dimensional space.
2.5. M a r k o v R a n d o m F i e l d s f o r M a t t i n g 29
Many matting algorithms use the basic structure of a Markov Random Field (MRF)
to measure the quality of an alpha matte, based on two premises: (1) the estimated
alpha, foreground, and background values should agree with the matting equation,
and (2) alpha values at adjacent pixels should be similar. These assumptions result in
an energy function of the form
E(α) = Edata (αi ) + Esmoothness (αi , αj ) (2.54)
i∈V (i,j)∈E
Here, V is the set of pixels in the image and E is the set of all adjacent pixels (for
example, 4-neighbors). This formulation is also known as a Gibbs energy. We want to
minimize E to find an optimal α given the user-specified information (i.e., scribbles
or trimap).
If we knew the foreground and background pixel values at i, then a natural choice
for the data energy Edata is a function like the one we used for the data term in
Bayesian matting (2.11):
1
− Ii −(αi Fi +(1−αi )Bi )22
σ2
Edata (αi ) = 1 − e d (2.55)
Note that we negated the term from Equation (2.11) since we want the Edata term
to be small when the fit to the data is good. Along the same lines, a natural choice for
the smoothness energy Esmoothness is
1
− (αi −αj )2
Esmoothness (αi , αj ) = 1 − e σs2 (2.56)
Recall that in Bayesian matting (Section 2.3), samples of the known F and B pixels
in the local neighborhood of an unknown pixel were used to build Gaussian mixture
models for the foreground and background. Instead, Wang and Cohen proposed
a non-parametric approach similar to a kernel density estimate. The basic idea is
to determine a set of candidate foreground {F1 , . . . , FC } and background {B1 , . . . , BC }
samples for each pixel, which could be either original scribbles or estimates filled in
on a previous iteration. Then we compute a likelihood that pixel i has α value αk as
Lk (i) = w w
m n e d (2.57)
C2
m=1 n=1
Here, the weights w of each foreground and background sample are related to the
spatial distance from the sample to the pixel under consideration and the uncertainty
of the sample, and the covariance σd2 is related to the variances of the foreground and
background samples. Then the final formula for the Edata term is
L (i)
Edata (αik ) = 1 − K k (2.58)
k=1 Lk (i)
After minimizing Equation (2.54), we obtain α values but not estimates of F and
B. We discussed one method for getting such estimates in Section 2.4.4. Wang and
Cohen proposed a slightly similar approach based on the foreground samples {Fm }
and background samples {Bn } generated at pixel i during the creation of the data
energy. The idea is simply to estimate the foreground and background values as:
That is, we select the pair of foreground and background samples that gives the best
fit to the matting equation for the given αi and Ii . The uncertainty of the pixel is
updated based on the weights w of the selected pair {Fi∗ , Bi∗ }.
Guan et al. [182] proposed an algorithm called easy matting that uses the same
MRF model with a few differences. They create Edata and Esmoothness using log likeli-
hoods instead of the exponential forms in Equations (2.55)–(2.56). The smoothness
term Esmoothness is also modulated by the image gradient; that is,
(αi − αj )2
Esmoothness (αi , αj ) = (2.60)
Ii − Ij
The balance between Edata and Esmoothness is updated dynamically, so that the
smoothness term is weighted less as the iterations proceed. Finally, instead of using
belief propagation to solve for the matte, the minimization of Equation (2.54) with
respect to the user scribble constraints is posed as a variational problem that can be
solved directly as a linear system.
F
U Figure 2.17. Random-walk matting methods are
based on estimating the probability that a random
walk starting at a pixel in the unknown region (black
pixel) ends up in the foreground region. The illus-
trated instance of the random walk ends up in the
foreground (white pixel).
and the set of undirected edges E represents connections between pixels (typically,
4-neighbor adjacency). Each edge eij ∈ E is associated with a nonnegative weight wij .
As discussed later, different random-walk-based algorithms use different formu-
lations for the weights wij , but the common intuition is that wij should be large
for pixels that are “similar” and near zero for pixels that are dissimilar. As in the
algorithms shown earlier, the user provides prior information about foreground and
background regions in the form of a trimap or scribbles. Random walk algorithms
estimate αi at a pixel i in the unknown region as the probability that a random walker
starting at i and choosing edges according to the weights wij will first encounter a
foreground pixel rather than a background pixel, as illustrated in Figure 2.17.
While this approach lacks the mathematical model for how intensities and α’s
are related through the matting equation that underlies Bayesian and closed-form
matting, it turns out to work well in practice and be computationally efficient. It
additionally matches our intuition; if there exists a path containing similar intensities
between an unknown pixel i to the labeled foreground region F, while paths from
i to the background region B need to cross dissimilar pixels, pixel i is more likely
to be foreground. However, we should note that the random-walk algorithm isn’t
evaluating the shortest or most likely path, it’s evaluating the probability over all
possible paths the random walker may take. It may seem that this probability is
intractable to estimate; however, Grady showed how it could be computed using a
similar linear system.
Let the degree di of node i be the sum of edge weights coming into it, that is:
di = wij (2.61)
j|eij ∈E
di if i = j
Lij = −wij if eij ∈ E (2.62)
0 otherwise
32 Chapter 2. Image Matting
Since some pixels in the image have been labeled by the trimap or a scribble, we
can re-index the pixels into a known set K and an unknown set U and partition the
graph Laplacian as
LK R
L= (2.63)
R LU
where the LK block corresponds to the set of known pixels and the LU block to the
unknown set.
Grady showed that the desired random walker probabilities described earlier
correspond to minimizing the functional
LK R αk
α Lα = [α α
u] (2.64)
k R LU αu
using results from combinatorial graph theory. Taking the gradient of Equation (2.64)
with respect to the unknown values α u and setting it equal to 0 leads to the linear
system
LU α u = −R α k (2.65)
This is generally an extremely sparse system; for example, if 4-neighbors are used
for adjacency there are only five nonzero elements per row. As a bonus, all elements
of the solution of Equation (2.65) are guaranteed to be in the range [0,1] by the
maximum modulus principle (i.e., the interpolated harmonic function must take its
minimum and maximum values on its boundary, which are 0 and 1 respectively from
the trimap/scribbles).
The key issue is thus how to choose the weights for random-walk matting. Grady
[176] originally proposed to simply use
with β = 900 assuming the images are normalized so Ii − Ij 2 ∈ [0, 1], and later
proposed a more general weight
Q Q(I −I )
wij = e −β(Ii −Ij ) i j (2.67)
−1
1 ε
wij = 1 + (Ii − µk ) k + I3×3 (Ij − µk ) (2.68)
W W
k|(i,j)∈wk
where µk and k are the mean and covariance matrix of the colors in the window
wk centered around pixel k. These are exactly the values of the matting affinity in
Equation (2.38).
2.6. R a n d o m - W a l k M e t h o d s 33
(Ii − Bi ) · (Fi − Bi )
α̂i = (2.69)
Fi − Bi 2
We can also compute a confidence ci for how much we trust the estimate α̂i based
on several factors. First, the quality of fit based on the matting equation should
be high. Also, Wang and Cohen argued that the selected foreground and back-
ground samples should be widely spread in color space, so that the denominator
of Equation (2.69) is not close to zero (this could result in a sensitive estimate of α).
This results in what they called the distance ratio
The distance ratio, combined with terms that measure how similar the foreground
and background samples are to Ii , is used to form a confidence ci that measures how
certain we are of the α̂i estimate, and these quantities are combined to produce two
weights wF (i) and wB (i) for connecting each pixel to the foreground and background
terminals.
F
U Figure 2.18. The sampling strategy in
robust matting spreads the potential fore-
ground and samples along the boundaries
of the known regions, compared to the
nearest-neighbors approach from Bayesian
matting (Figure 2.10).
34 Chapter 2. Image Matting
with
L + diag(wF ) + diag(wB ) −wF −wB
M = −wF wF 0 (2.72)
−wB 0 wB
where L is the standard matting Laplacian and wF and wB are N ×1 vectors of terminal
weights. Expanding Equation (2.71) results in the equivalent objective function
α Lα + (α − 1N ×1 ) WF (α − 1N ×1 ) + α WB α (2.73)
where WF and WB are N × N diagonal matrices with the vectors wF and wB on the
diagonals, respectively. This objective function is quadratic in α and thus results in a
slightly modified linear system from the one used in closed-form matting.
Rhemann et al. [389] suggested a modification to the objective function that more
explicitly involves the estimates α̂i and confidences ci :
where α̂ is an N × 1 vector that acts as a prior estimate of α at every pixel in the matte,
λ is a tunable parameter, and D is a diagonal matrix with the confidences ci on the
diagonal. In this way, it’s clear that when the confidence in the α̂i estimate is high,
the objective function puts a higher weight on the prior that αi = α̂i , and when the
confidence is low, the usual neighborhood constraints from the matting Laplacian
have a stronger effect.2
Robust matting was later refined into an algorithm called soft scissors [529] that
solves the matting problem incrementally based on real-time user input. That is, a
local trimap is generated on the fly as a user paints a wide stroke near the boundary
of a foreground object. The pixels on either edge of the stroke are used to build local
foreground and background models, and the stroke automatically adjusts its width
based on the local image properties. The pixels interior to the stroke are treated as
unknown and their α’s are estimated with the robust matting algorithm. Since this
region is relatively small, the drawing and matte estimation can proceed at interactive
rates.
Rhemann et al. [391] also extended robust matting by incorporating a sparsity
prior on α that presumes the observed α is created from an underlying sharp-edged
(nearly binary) matte with a constant point-spread function (PSF) induced by the
camera. The underlying matte and PSF are iteratively estimated and used to bias the
matting result to be less blurry. They later extended their technique to allow the PSF
to spatially vary [390].
2 Rhemann et al. [389] also defined the α̂i estimates and confidence terms slightly differently from
robust matting, and generated the foreground samples based on a geodesic-distance approach
instead of a Euclidean-distance one.
2.7. P o i s s o n M a t t i n g 35
Gastal and Oliviera [163] proposed an objective function of the same form
as Equation (2.74), with yet another approach toward computing the fore-
ground/background samples, α̂i estimates, and confidences. The key observation
is that nearby pixels are likely to have very similar F , B, and α values, and thus that
the sets of foreground and background samples considered for nearby pixels are likely
to have many common elements. Much unnecessary computation can be avoided
by creating disjoint sample sets for each pair of adjacent pixels and then asking adja-
cent pixels to share their choices for the best samples to come up with estimates
for Fi and Bi . This method for computing α̂i is extremely efficient, and the α̂i ’s are
already quite good even without minimizing Equation (2.74), potentially leading to a
real-time video matting algorithm.
fF (I )
LF (I ) = (2.76)
fF (I ) + fB (I )
In this case, the weight is small if the two pixels have similar foreground likeli-
hoods (or equivalently, similar background likelihoods). The weighted shortest paths
between an unknown pixel i and both the foreground and background scribbles are
computed using a fast marching algorithm [565]; let these distances be DF (i) and
DB (i). Then Bai and Sapiro proposed to estimate α as
DF (i)−r LF (Ii )
αi = (2.77)
DB (i)−r (1 − LF (Ii ))
Finally, we mention Poisson matting [478], one form of gradient-based image editing.
We will discuss similar methods more extensively in Chapter 3. The user begins by
specifying a trimap. We first take the spatial gradient of the matting equation on both
sides:
∇I = (F − B)∇α + α∇F + (1 − α)∇B (2.78)
36 Chapter 2. Image Matting
This gradient is typically taken in the intensity channel of the image. If the foreground
and background are relatively smooth compared to α, then the first term dominates
the other two and we can make the approximation
1
∇α ≈ ∇I (2.79)
F −B
That is, the matte gradient is proportional to the image gradient. Interpreted as
a continuous problem, this gives a differential equation for α inside the unknown
region U with boundary conditions on ∂U given by the known values of α in the
foreground and background regions. That is, we want to minimize
!! 2
∇I (x, y)
∇α(x, y) − dx dy (2.80)
(x, −
(x,y)∈U F y) B(x, y)
As we’ll discuss in Section 3.2, minimizing Equation (2.80) turns out to be the same
as solving the Poisson equation with the same boundary conditions, i.e.,
∇I
∇ 2 α = div (2.82)
F −B
The Poisson equation can be solved quickly and uniquely, if we know F − B at
the pixel. In practice, this quantity is estimated using the nearest labeled foreground
and background pixels and smoothed before solving the equation. After α has been
computed, F − B can be refined using pixels that have been estimated to have very
high and very low α, and the process iterated.
This process works reasonably well when the foreground and background are both
smooth, justifying the approximation in Equation (2.79). If the matte fails in a region
where the foreground and/or background image has locally strong gradients, then
the user can try to apply further constraints and relax Equation (2.79) in just this
subregion.
It’s important to understand the relationship between the matting problem and
image segmentation. The key difference is that the goal of segmentation is to decom-
pose an image into disjoint pieces that fit together to form a whole. In traditional
segmentation, the edges of the pieces are hard, not fuzzy, and a segmentation can be
defined by an integer label for every pixel in the image. In the case where only two
pieces are desired, that is, foreground and background, we can label the pieces by 1
and 0 respectively and think of a segmentation as a coarse matting problem with no
fractional α values. These hard-edged pieces are unlikely to be acceptable for generat-
ing visual effects, but several researchers have proposed methods for turning a hard
segmentation into a “soft” segmentation or matte. The most well-known of these
methods, called GrabCut, is a highly competitive user-guided matting algorithm.
2.8. H a r d - S e g m e n t a t i o n - B a s e d M a t t i n g 37
Boykov and colleagues showed that the globally optimal minimum cut could
quickly be computed in low-order polynomial time [60], leading to an explosion
of interest in graph-cut methods in the computer vision community. Appendix A.3
gives more details on the basic algorithm. GPU-based [515] and multi-core [296]
algorithms have been proposed to further accelerate finding the minimum cut.
As with scribble-based matting, the user designates certain pixels to belong to the
foreground F and others to the background B. For a labeled foreground pixel i, the
weight on edge (i, B) is set to 0 and the weight on edge (i, F) is set to infinity (or a very
large number) to force the minimum cut to assign i to the foreground. The reverse
is true for labeled background pixels. The scribbles also serve to generate weights
for connecting the rest of the nodes to the terminals. Boykov and Jolly originally
F F
B B
(a) (b)
Figure 2.19. (a) The configuration of nodes and edges for graph-cut-based segmentation. Each
pixel is connected to its neighbors as well as to two special foreground and background terminals.
(b) A cut (dotted line) removes edges so that there is no path from the foreground terminal to
the background terminal.
38 Chapter 2. Image Matting
(a) (b)
Figure 2.20. (a) An original image with foreground/background scribbles. (b) A hard segmen-
tation produced with graph cuts.
For example, if fB (Ii ) is very low, then wi,F will be very high, making it much more
likely that the edge between i and B is cut. The inter-node weights are computed
using a simple similarity measure
1 Ii − Ij 2
wij = exp − (2.85)
dist(i, j) 2σ 2
Blake et al. [49] showed how the parameter σ could be estimated based on the local
contrast of an image sample. Figure 2.20 illustrates a segmentation of an image from
scribbles with this original graph-cut formulation. If the segmentation is incorrect in
a subregion, new foreground/background scribbles can be added and the solution
quickly updated without recomputing the minimum cut from scratch.
Finding the minimum cut is actually the same as minimizing a Gibbs energy of the
form of Equation (2.54) when α is restricted to be binary (i.e., 0 for background and
1 for foreground). The edge weights between pixels and the foreground/background
terminals make up the data energy term Edata and the inter-node weights make up
the smoothness energy Esmoothness . That is,
1 Ii − Ij 2
Esmoothness (αi , αj ) = |αi − αj | · exp − (2.87)
dist(i, j) 2σ 2
2.8. H a r d - S e g m e n t a t i o n - B a s e d M a t t i n g 39
Li et al. [280] proposed an algorithm called lazy snapping that speeds up the
graph-cut segmentation algorithm by operating on superpixels instead of pixels.
That is, the image pixels are clustered into small, roughly constant color regions
using the watershed algorithm [514]. These regions then become the nodes of the
graph-cut problem, since it’s assumed that all pixels within a superpixel have the
same label. Since there are typically about ten to twenty times fewer nodes and edges
in the superpixel problem, the cut can be computed at interactive rates. Liu et al. [296]
proposed an interactive algorithm called paint selection that uses an efficient multi-
core graph cut algorithm to progressively hard segment an image as the user drags
the mouse around an object boundary.
2.8.2 GrabCut
Rother et al. [405] were the first to extend graph-cut segmentation to the matting
problem. The basic idea of their GrabCut algorithm is to first compute a hard seg-
mentation using graph cuts (Figure 2.21a), and then to dilate the border around the
hard edge to effectively create a trimap (Figure 2.21b). Inside the unknown region of
the trimap, an α profile that transitions smoothly from 0 to 1 is fit (Figure 2.21c).
The process begins with user input in the form of a simple bounding box around
the foreground (i.e., a garbage matte). Everything outside the box is assumed to be
background with α = 0, and everything inside the box is assumed to be unknown, with
an initial estimate of α = 1. As in Bayesian matting, initial Gaussian mixture models
are fit to the foreground and background intensities inside and outside the box. The
GrabCut algorithm iterates three steps until the binary α labels have converged:
F F α
U 1
B B 0
profile width
Figure 2.21. (a) A hard segmentation. (b) A trimap is created by dilating the foreground-
background boundary. (c) Parameters of a smooth transition of α between 0 and 1 are fit inside
profiles of the unknown region (dotted lines).
40 Chapter 2. Image Matting
So far, this chapter has focused on the problem of natural image matting from a
single image. Of course, for visual effects, we must compute mattes that last several
seconds, resulting in hundreds of frames (Figure 2.22). The result is sometimes called
a traveling matte. Certainly, any of the methods outlined earlier may be applied
frame by frame, but the process would be extremely time-consuming, and it would
impractical to expect a user to provide a trimap or scribbles in each frame. Further-
more, there is no guarantee that the results from adjacent frames will vary smoothly,
which could lead to visually unacceptable “jitter.” In this section, we overview tech-
niques for video matting, which exploit the temporal coherence of the input video
and desired output mattes.
Just as single-image matting is related to image segmentation, video matting is
related to the well-known problem of visual tracking. The goal of visual tracking is to
estimate the location of one or more objects in a video sequence, preferably ensuring
that the estimated locations vary smoothly and that tracking is not lost in the presence
of occlusions or object crossings. However, despite the huge amount of research and
substantial advances in the field of tracking, most such methods are not immediately
applicable to video matting for the same reason that many segmentation results are
not immediately applicable to single-image matting. That is, the output of a typical
tracking algorithm is a bounding rectangle or ellipse for each object, which is much
too coarse to use for high-quality matting and composition. Even if the estimated
foreground pixels in each frame form a relatively tight fit around the object, we still
Figure 2.22. Video matting requires many similar matting problems to be solved over a large
number of frames.
2.9. V i d e o M a t t i n g 41
have the same problems as hard segmentation in the presence of wispy or semi-
transparent foreground objects. Nonetheless, many video matting algorithms begin
with the extraction of temporally consistent, hard-edged foreground pieces in each
frame of video.
Generally, video matting algorithms depend on the optical flow estimated from
the image sequence, which is defined as the dense correspondence field correspond-
ing to the apparent motion of brightness patterns. That is, we compute a vector at
pixel (x, y) at time t of the video sequence that points at the apparent location of that
pixel at time t + 1. This vector field can then be used to propagate the matte estimated
from time t to time t + 1. Section 5.3 discusses the optical flow problem in detail.
Layered motion techniques represented an early approach to the video matting
problem. For example, Wang and Adelson [528] proposed to cluster the pixels of a
video sequence into multiple layers by fitting multiple affine motions to its optical
flow field, while Ayer and Sawhney [23] proposed an expectation-maximization algo-
rithm to estimate such affine motions based on the change in pixels’ appearance and
a minimum-description-length formulation for finding the number of layers. Ke and
Kanade [234] observed that if the layers arise from planar patches in the scene, the
corresponding affine transformations lie in a low-dimensional subspace, which acts
as a strong constraint for robust layer extraction.
Several video matting methods are somewhat direct extensions of single-image
matting algorithms to video, incorporating a temporal consistency prior to produce
smoothly varying, non-jittery α mattes. For example, Chuang et al. [96] built upon
Bayesian matting by combining it with optical flow. That is, the trimap at time t is
estimated by “flowing” user-generated trimaps from keyframes on either side using
the estimated optical flow fields. The trimaps are modified to ensure the foreground
and background regions are reliable before being input to the standard Bayesian
matting algorithm. If the background is roughly planar, projective transformations
can be estimated as the camera moves to build a background mosaic that acts as
a clean plate, which significantly helps the speed and quality of pulling the matte.
Wexler et al. [544] and Apostoloff and Fitzgibbon [16] proposed a related Bayesian
approach, using a similar mosaicing method to obtain the background before esti-
mating the matte, and modeling the prior distribution for α with a beta distribution
as mentioned in Section 2.3.2. They also incorporated a spatiotemporal consistency
prior on α, using learned relationships between the gradients of α and the original
image. The observation was similar to the basic assumption of Poisson matting: that
the matte gradient is roughly proportional to the image gradient.
Another family of approaches is based on extending the graph-cut methods of
Section 2.8 to hard foreground/background segmentation in video. These approaches
can be viewed as methods for rotoscoping, or manually outlining contours of fore-
ground objects in each of many frames of film. Agarwala et al. [8] proposed a
well-known method for semi-automatic rotoscoping based on joint optimization
of contours over a full video sequence, using manually traced keyframes and incre-
mental user edits as hard constraints and image edges as soft constraints. While in
this work, contours were represented as splines, graph-cut algorithms would allow
the segmentation in each frame to be much more detailed, that is, an arbitrary binary
matte. The human-assisted motion annotation algorithm of Liu et al. [288] discussed
in Section 5.3.6 also can be viewed as an interactive rotoscoping tool.
42 Chapter 2. Image Matting
Li et al. [279] proposed a natural generalization of the Lazy Snapping work from
Section 2.8.1 to video. The Gibbs energy formulation is similar to the methods in
Section 2.8, but the nodes in the graph (here, image superpixels) are connected both in
space and time, with inter-frame edge weights estimated similarly to intra-frame edge
weights. Criminisi et al. [107] also posed video segmentation as a conditional random
field energy minimized with graph cuts, but added an explicit learned prior on the
foreground likelihood at a pixel based on its label in the previous two frames. Non-
binary α values in these techniques are typically obtained independently per frame by
applying a “border matting” algorithm similar to GrabCut (Figure 2.21b-c). Wang et al.
[530] also proposed a graph-cut-based video segmentation method, but extended
the superpixel formation, user stroking, and border matting algorithms to operate
natively in the space-time “video volume” formed by stacking the frames at each
time instant. Finally, Bai et al. [26] proposed to propagate and update local classifiers
applied at points distributed around the foreground boundary of the previous frame
to generate constraints for the graph cut at the current frame. This was followed by
a space-time version of robust matting (Section 2.6.1) that rewards consistency with
the α values from the previous frame.
video sequence from a fixed camera with fixed background and a moving foreground
object, so that a “clean plate” background image without the shadow can be created.
Finlayson et al. [141] proposed methods for removing shadows from images (e.g., the
unwanted shadow of a photographer) based on finding the edges of the shadow, esti-
mating shadow-free illumination-invariant images, and solving a Poisson equation.
Wu et al. [555] addressed a similar problem of shadow removal using a generalized
trimap (actually with four regions) in which the user specifies definitely-shadowed,
definitely-unshadowed, unknown, and shadow/object boundary regions. The algo-
rithm minimizes a Gibbs-like energy function built from the statistics of the regions.
Regardless of how a shadow is extracted, when composited into a new image it must
deform realistically with respect to the new background; some 3D information about
the scene is necessarily required for high-quality results (see Chapter 8).
Hillman [199] observed that in natural images, the subject is often illuminated
from behind, causing a highlight of bright pixels around the foreground boundary.
Like shadow pixels, these highlight pixels do not obey the matting equation’s assump-
tion and could be estimated by assuming a mixture of three colors (foreground,
background, highlight) rather than two.
Most of the algorithms in this chapter make the underlying assumption that the
foreground object is opaque, and that fractional α values arise from sub-pixel-sized
fine features combined with blur from the camera’s optics or motion. In this context,
it makes the most sense to interpret α as a measure of coverage by the foreground.
However, most of the methods in this chapter will fail in the presence of “optically
active” objects that are transparent, reflective, or refractive, such as a glass of water
(Figure 2.24). In this case, even though a pixel may be squarely in the foreground,
its color may arise from a distorted surface on the background. Pulling a coverage-
based matte of the foreground and compositing it on a new background will look
awful, since the foreground should be expected to distort the new background and
contain no elements of the old background. To address this issue, Zongker et al. [582]
proposed environment matting, a system that not only captures a coverage-based
matte of an optically active object but also captures a description of the way the
object reflects and refracts light. The method requires the object to be imaged in
the presence of different lighting patterns from multiple directions using a special
acquisition stage. The method was refined by Chuang et al. [98] and extended to
work in real-world environments by Wexler et al. [545].
(a) (b)
Figure 2.24. It’s challenging to pull mattes of foreground objects that are transparent, reflective,
or refractive; environment matting algorithms were designed for this purpose. (a) Image with
foreground, (b) clean plate.
2.11. I n d u s t r y P e r s p e c t i v e s 45
Figure 2.25. Two images of the same scene taken (a) with a flash and (b) without a flash. (c) The
difference of the flash and no-flash images results in a mostly-foreground image.
subtracting the without-flash image from the flash image, a “flash-only” image is
created that contains mostly foreground pixels and that can be used to generate a
good trimap. An extension of Bayesian matting is then used to generate the matte. In
a sense, flash matting reverses Smith and Blinn’s triangulation assumption: instead
of observing the same foreground in front of two different-color backgrounds, we
observe two different-color foregrounds in front of the same background. In practice,
it may be difficult to ensure that the foreground and background remain exactly the
same and that the flash doesn’t create strong shadows.
If an estimate of the depth at each pixel is available in addition to a color image, it
forms a valuable prior on whether pixels should be classified as foreground or back-
ground, and can help disambiguate situations where the foreground and background
have nearly the same color. We’ll discuss many methods for directly acquiring or indi-
rectly estimating scene depth in Chapter 8. In the context of this chapter, methods
for estimating depth to aid in creating digital mattes include using a color-filtered
aperture to slightly offset the color channels [28] or a time-of-flight sensor to directly
acquire depth [580].
Nick Apostoloff, senior software engineer, Paul Lambert, compositing supervisor, and
Blake Sloan, software engineer from Digital Domain in Venice, California, and David
Geoghegan, Flame operator from LOOK Effects in Los Angeles, California, discuss the
process of image and video matting in feature filmmaking.
RJR: How does academic work in matting compare with how it’s done in the visual
effects world?
things have to be very robust and give very predictable results almost all the time, so
they’d much rather have something that requires a little bit of user interaction but
gets you eighty to ninety percent of the way there all the time. In academia, it’s often
all about minimizing user interaction, giving you fantastic results for a subset of the
images you’d experience in reality. You look at many of these matting papers and
you see very similar types of test images. Thankfully academic research is coming
around to the point of view that interactive user input is considered to be good, such
as adding incremental foreground and background scribbles to refine a matte. My
experience was that academic papers sometimes used very complicated models that
tend to work very well for some datasets and extremely poorly on other datasets,
whereas the film industry tends to split the difference and go for something that
works reliably most of the time.
Another big issue with the film industry is that you’re always working on video.
A lot of these single-image techniques work really well given user input, but getting
something to work well on video is incredibly hard because you need that temporal
consistency. On the other hand, most algorithms that have temporal consistency
built into them are amazingly slow and will only work on very small, very short image
sequences. It’s hard to get the best of both worlds. As far as I understand, everything’s
very manual in the film industry when it comes to things like matting. People will
still roto (rotoscope) out stuff by hand if they don’t have a blue screen shot instead of
using some of the more advanced computer vision techniques. That’s disappointing
since I’d like to see some of the progress that’s made on the academic side come back
to the film industry — but getting that reliability is hard.
RJR: Can you describe how an artist does blue-screen or green-screen matting in
practice?
Lambert: Say I’m assigned a shot that has a green-screen background and I have to
pull a key to put over a particular new background. My keying program will have sev-
eral built-in algorithms to try: Ultimatte, Primatte, Knockout, channel differencing,
and so on. Each of these corresponds to a rule for processing the RGB colors of each
pixel to produce a matte. For example, you may subtract weighted combinations of
the red and blue from the green channel to make the alpha, changing the weights
of the red and green channels depending on the subject matter. Often as part of the
process you click around on the green screen to find a background color that will
produce a good matte. From experience, you know that if you pick a brighter color,
you’ll pick up a certain amount of foreground detail, and if you go with a darker color,
you’ll pick up another kind of detail.
I invented one algorithm called the IBK keyer that’s part of a software pack-
age called Nuke. It generates a background model on a per-pixel basis, so rather
than just feeding the keyer one background green color you actually feed it a
whole varying-green image. In essence, you’re expanding the green-screen col-
ors into the foreground to make a clean green-screen plate that has a nice color
gradient.
When I was first trying to work all of this stuff out, I was obsessed with finding a
perfect way to do keying, but over time you realize that you never pull a key without
2.11. I n d u s t r y P e r s p e c t i v e s 47
(a)
(b)
Figure 2.26. (a) Blue-screen and green-screen matting is pervasive in modern visual effects.
This shot from Source Code illustrates a matting problem with a wide range of non-binary alpha
values due to blowing, wispy hair. (b) This shot from Iron Man 2 illustrates a difficult natural
image matting problem. For example, the globe in the foreground contains many intricate, thin
structures that must be outlined to be able to composite it convincingly onto a different back-
ground. Source Code courtesy of Summit Entertainment, LLC. Iron Man 2 images appear courtesy
of Marvel Studios, TM & ©2010 Marvel and Subs. www.marvel.com.
48 Chapter 2. Image Matting
knowing what the final background will be. You can make temporary mattes for
people to use in between, but for a fully finished professional look to an edge in
a composite, you have to create the matte with the background in mind. The best
case is when the background has a similar luminance to the original plate. If I’m
compositing the foreground onto a darker background, I know there are going to be
certain problems around the edges. If I’m compositing onto a brighter background,
like a a flaring light or bright explosion, I know the background will show through the
foreground where the alphas aren’t exactly 1, and that I’m going to have to do extra
work to actually get the matte. So from just a single image you’re never going to be
able to pull a perfect key.
It’s also important for the overall pipeline to do things procedurally as opposed to
painting alphas in by hand, since if something else changes in the composite, your
paintstrokes may not be valid anymore. When you know you’re the very final step
and it’s got to go out because the client’s waiting to take the shot away on a hard
drive, then yes. If it’s the first comp and I know there are going to be fifty iterations of
the background with camera moves and changes in color correction, then I’m going
to do it procedurally because in the long run it’ll save me time.
RJR: What about the workflow when you need to pull a matte on a non green-screen,
natural background?
Lambert: A really good example was a movie set during World War II called Flags of
Our Fathers. The production made a conscious decision that they weren’t going to put
green or blue screens anywhere. We took on the show knowing that it was going to be
a bunch of roto — people running around, motion blur, and so on. If it’s a locked-off
shot, that is, the camera isn’t moving, you may be able to get some of the matte with
a difference matte, if you have a clean plate — but that will only get you to a certain
point. The only way to really extract it correctly is to roto it — actually draw Bezier
splines around the foreground. For example, if I had to roto a person off of a natural
background, I’d be drawing separate shapes for the person’s head, their arm, their
body, and so on for a bunch of keyframes spread across the sequence. I might put
keyframes every three frames — so go three frames, adjust the splines, three frames,
adjust, play it back and see if it’s actually matching, and then apply extra keyframes
just to make it a bit better. Then you can generate the alpha by adding a softer edge to
the roto curves or looking at their direction and speed of motion, and you may need to
animate that softer edge on a per-frame basis. It’s a very time-consuming process! You
would hope that if the production knows in advance that they’ll need to extract a per-
son with flowing hair from a plate that they’ll put a green screen behind them, since
to extract them otherwise it’ll be very labor-intensive and cost half a million dollars.
Scribble-based kinds of approaches tend to work really well on a still frame, but
it’s when you go to video imagery where it’s changing frame to frame and you have
film grain structure to it, you find that those kind of algorithms jump and around and
flicker.
Sloan: Very often they’ll end up having to use a clean plate that is either generated
or captured on set. But sometimes there will be people whose job it is to matte every
wisp of hair that has gotten outside of the blue screen area or for whatever reason
can’t be extracted very well. You typically go to a roto artist and they bring everything
they have to bear on the problem — which is typically patience and skill at drawing
those outlines. I have a feeling that whenever people say “we used our proprietary
software to solve this tough matting problem,” that means a bunch of hard-working
roto artists! Sometimes you end up having to make entirely computer-generated
versions of natural foreground elements that for some reason you didn’t get properly
in camera and you have to generate from scratch instead of trying to matte.
The natural matting problem is a huge issue for 3D conversion of movies filmed
with a single camera. They basically have to rotoscope everything, pulling out all the
foreground objects and assigning them different depths for the two eyes. Colorization
of old black-and-white movies is a similar problem. Often the heart of 3D conversion
algorithms is a planar tracker that allows you to interactively create and push forward
very precise roto shapes on each frame. The software package called mocha is one
very popular example of that kind of planar tracker. The output of it is something
similar to a trimap that the artist can use to synthesize motion blur or some kind of
fall-off to get the final alpha. The holy grail for 3D conversion is “decompositing” — to
take any foreground object, however nebulous or wispy or fragmented, and extract
only it from the scene.
50 Chapter 2. Image Matting
Again, there are some academic matting algorithms that work wonders on still
photographs, but the minute you have a video with wispy hair blowing in front of
a forest, you can get a solution but it’s not natural. It’s critical that the alphas have
that spatial softness and temporal continuity that you need for believability — you
can’t have the big thick line around Godzilla that everyone tolerated in the ‘60s! A
third-party vendor may come out with a software plug-in for natural image matting,
but unless it really nails every problem you throw at it, it’s not going to become a
standard.
Matting for Hollywood movies was pioneered by Petro Vlahos and his son, Paul Vla-
hos, who patented many techniques related to blue-screen and green-screen matting
and compositing from the late 1960s to the early 2000s. They won several Oscars
for their contributions and founded the Ultimatte Corporation, which produces an
industry-standard product for blue- and green-screen matte extraction.
Originally, mattes were strips of monochrome film that were exposed (i.e.,
transparent) in regions corresponding to the foreground and opaque (i.e., black)
elsewhere — an analogue of the alpha channel discussed in this chapter. For early
special effects, different elements and mattes were laboriously segmented, aligned
and sandwiched together to produce a final composite for such films as Mary Pop-
pins, Superman, and the original Clash of the Titans. For more on the early history of
matting in film, see the book by Rickitt [393].
While Chuang et al. were the first to put the matting problem in the Bayesian
context, Ruzon and Tomasi [412] had previously proposed a related algorithm that
involved mixtures of isotropic Gaussian distributions (i.e., diagonal i ) to model the
foreground and background. This is viewed as one of the first principled natural
matting algorithms from the computer vision/graphics community.
Singaraju et al. [454] showed that when the foreground or background intensities
of a window around a pixel are less general than the color line assumption (i.e.,
either or both is of constant color), then the closed-form matting equations permit
more degrees of freedom than necessary. They showed how to analyze the rank of
image patches to create an adaptive matting Laplacian that outperforms closed-form
matting in these situations.
We note that while most of the algorithms described here represented colors
using the usual RGB values, and measured similarity using Euclidean distance in
this space, some authors have recommended using a different color space that
better reflects when two colors are perceptually similar. For example, Ruzon and
Tomasi [412], Bai and Sapiro [25], and others proposed to use the CIE Lab color
space for matting operations, while Grady [178] used the linear transform of RGB
defined by the locality-preserving projections algorithm [193]. For Poisson matting,
Sun et al. [478] recommended a linear transform of RGB that is computed to mini-
mize the variance between the background samples. Another possibility would be to
use a higher-dimensional set of filter responses (e.g., a Gabor filter bank applied to
luminance as in [377]) which might be more sensitive to local texture in an image.
2.13. H o m e w o r k P r o b l e m s 51
Note: in the homework problems we often assume that image color channels and
intensities are in the range [0, 255] rather than [0, 1] for more direct compatibility
with image manipulation tools like GIMP and Matlab.
2.1 Take or find a photograph for which hard segmentation of the foreground
is likely to fail for high-quality compositing.
2.2 Suppose it was observed that the RGB values at pixel i were Ii =
[200, 100, 40] . Determine Bi and αi that are consistent with the hypotheses
that Fi = pure red, pure green, and pure blue, respectively. None of your α’s
should be 0 or 1.
2.3 Suppose we obtain the clean plate B given in Figure 2.27, and observe the
given image I . Determine two different values for the images F and α that are
consistent with the matting equation: one that conforms to human intuition
and one that is mathematically correct but is perceptually unusual. Assume
that the intensity of the middle circle is 128.
52 Chapter 2. Image Matting
150 20 5 5
µF = 150 F = 5 30 8 (2.88)
150 5 8 25
50 5 0 0
µB = 50 B = 0 5 0 (2.89)
200 0 0 15
If the observed pixel color is [120, 125, 170] , compute F , B, and α by alter-
nating Equation (2.16) and Equation (2.17), assuming σd = 2. Repeat the
experiment with σd = 10 and interpret the difference.
2.10 Continuing Problem 2.9, suppose the foreground is modeled with a mix-
ture of two Gaussian distributions: the one from Equation (2.88) and one
2.13. H o m e w o r k P r o b l e m s 53
given by
130 10 0 0
µF2 = 150 F2 = 0 10 0 (2.90)
180 0 0 10
Suppose we observed Ii = [100, 140, 110] and had estimated that αi = 0.5.
Estimate the optimal samples (Fi∗ , Bi∗ ) according to Equation (2.59). How
would the answer change if we instead used the distance ratio criterion in
Equation (2.70) from robust matting?
2.22 Show that minimizing the random walk objective function in
Equation (2.64) leads to the linear system in Equation (2.65).
2.23 Show that Equation (2.71) and Equation (2.73) have the same minimum
with respect to α (that is, the objective functions differ by a constant term).
2.24 A typical sigmoid function for border matting resembles Figure 2.21c and
can be parameterized using the profile
1
p(w) = (2.94)
1 + e −a(w−b)
where w ranges from 0 to 1. If we observe the samples {(wi , pi ), i = 1, . . . , s},
what would be good initial estimates for the values of a and b?
3 Image Compositing and Editing
wide array of image retouching that occurs in the world of commercial photogra-
phy — to the extent that it is now very difficult to judge whether a digital image
represents the true recording of an actual real-world scene.
We first discuss classical methods for blending image intensities across a hard
boundary or seam (Section 3.1). We then introduce gradient-domain blending
methods based on solving the Poisson equation, which typically produce a more
natural-looking transition (Section 3.2). We also consider an alternate approach
based on graph cuts: instead of fractionally weighting the source and target contribu-
tions near the seam, we try to find a seam such that a hard transition is imperceptible
(Section 3.3). We then address the problem of image inpainting — filling in “holes”
specified by the user with realistic texture (Section 3.4). We next introduce the con-
cept of image retargeting — changing the size, aspect ratio, or composition of an
image (Section 3.5). Finally, we discuss extensions of the various methods from still
images to video (Section 3.6).
As we mentioned earlier, high-quality mattes are essential for most visual effects
compositing problems, but these take substantial effort to create. It would be much
easier for the user to roughly outline the object to be extracted from one image and
placed in another, without interactively struggling to create a good matte. In this and
the following sections, we investigate methods for pasting a hard-edged foreground
region into a new background image, letting the algorithms do the work of creating a
pleasing (and hopefully imperceptible) blend.
Mathematically, we pose the problem as follows. We are given a source image
S(x, y), a binary mask image M (x, y) ∈ {0, 1} specifying the general outline of an object
or region in the source image, and a target image T (x, y). Our goal is to create a con-
vincing composite I (x, y) in which the source region is rendered on top of the target
image with minimal visible artifacts. This is sometimes called the “over” operation
for compositing two images [373]. We assume the two images are already aligned, so
that (x, y) corresponds to the same location in all the images.
Clearly, the matting equation (2.1) from the previous chapter encapsulates the
simplest way to composite the two images:
where the binary image M plays the role of the α matte. If we were to directly superim-
pose the source region on the target image using Equation (3.1) without any blending,
we would see a visible boundary — also known as a seam or matte line — between
the source and target (Figure 3.1).
In the early days of visual effects, this type of compositing was often used to insert
a special effects shot (e.g., a model spaceship) into a live-action plate. Similar tech-
niques are used to combine two videos of the same actor from multiple camera passes
3.1. C o m p o s i t i n g H a r d - E d g e d P i e c e s 57
S
seam
T S
M T
(a) (b)
Figure 3.1. The compositing problem with hard-edged pieces. (a) Source, target, and mask
images. (b) In the composite image, regions from the source and target images are separated
by a seam. We want to make the transition between source and target as imperceptible as
possible.
for “twinning” effects, such as in Friends, Back to the Future II, or Moon. For exam-
ple, an actor is filmed interacting with him/herself twice: once on the left side of the
screen and once on the right. In early versions of this effect, the seam between the
two shots was either very visible (e.g., a line down the middle of the screen) or hidden
by an obvious foreground object (e.g., a fence or tree). In this case, the problem is to
fuse two images I1 and I2 along a given seam, where neither image is naturally the
foreground or background and both are of equal importance.
Why were seams so visible using the simple technique of Equation (3.1)? Even
if the camera was locked down with identical location and exposure to take both
the source and target shots, lighting conditions between shots are extremely dif-
ficult to match exactly, and the human visual system is extremely sensitive to
the presence of edges, especially in constant-intensity, low-frequency regions (see
Figure 3.3a). The situation only becomes worse if one image is taken at a differ-
ent time or under different conditions than the other (for example, an actor shot
on a studio set is to be composited into an outdoor scene). Much of this chapter
is about the problem of hiding seams — both by choosing clever, non-straight-line
paths for the seams to take, and by more intelligently blending intensities across
the seam.
Target Source
1
Contribution fraction
0
Target side Transition region Source side
Figure 3.2. The composite contains a weighted average of source and target pixels across the
transition region.
Figure 3.3. Possible compositing strategies illustrating source weights (top) and composites
(bottom). (a) A hard seam produces a visible, distracting edge. (b) A narrow, linearly-weighted
transition region still creates a visible seam. (c) A wider, linearly-weighted transition region can
result in low-detail regions around the boundary where the two images are averaged, resulting
in a diffuse “halo.”
from Section 2.8.2, since we’re effectively creating a non-binary alpha matte for the
source based on a hard foreground segmentation.
However, deciding on the width of the transition region (that is, the region in which
pixels are a mix between source and target) is difficult. If the region is too narrow, the
seam will still be visible, but if the region is too wide, the averaging in the transition
region will remove details, as illustrated in Figure 3.3b-c.
3.1. C o m p o s i t i n g H a r d - E d g e d P i e c e s 59
K = [−0.05, 0.25, 0.6, 0.25, −0.05] [−0.05, 0.25, 0.6, 0.25, −0.05] (3.3)
For compositing, we’re interested in the edges that are significant at every scale,
which can be obtained by taking the difference of Gaussians at each scale:
Li = Gi − (K ∗ Gi ), i = 0, . . . , N − 1 (3.4)
The images Li form what is called a Laplacian pyramid, since the shape of the two-
dimensional Laplacian operator (also known as the “Mexican hat” function) is similar
to a difference of Gaussians at different scales (we’ll discuss this property more in
Section 4.1.4). As illustrated in the bottom row of Figure 3.4, each image in the Lapla-
cian pyramid can be viewed as a bandpass image at a different scale. The smallest
image LN in the pyramid is defined to be a small, highly blurred version of the original
image, given by GN , while the other images contain edges prevalent at different image
scales (for example, L0 contains the finest-detail edges). Therefore, we can write the
original image as the sum of the images of the pyramid:
N
I= Li ↑ (3.5)
i=0
60 Chapter 3. Image Compositing and Editing
G0
G1
G2
G3
G4
L4
L3
L2
L1
L0
Figure 3.4. The top row illustrates the Gaussian pyramid of an image I = G0 , generated by
filtering with the matrix in Equation (3.3) and downsampling by 2 at each step. Each image is
a smaller, blurrier version of its predecessor. The bottom row illustrates the Laplacian pyramid
for the image I , generated by successive differencing of the images in the top row according to
Equation (3.4). Each Laplacian image contains the edges at successively coarser scales.
where ↑ indicates the images have been upsampled and interpolated to the original
image resolution before summing them.
To compose a source image S onto a target image T using Burt and Adelson’s
approach, we first compute the Laplacian pyramids L S and L T for both images. We
3.1. C o m p o s i t i n g H a r d - E d g e d P i e c e s 61
also assume we have a binary mask M specifying the desired boundary, so that pixels
inside S have M = 1 and pixels inside T have M = 0, and compute a Gaussian pyramid
G for this mask. Then we compute a Laplacian pyramid {L I } for the composite image
as follows:
LiI (x, y) = Gi (x, y)LiS (x, y) + (1 − Gi (x, y))LiT (x, y), i = 0, . . . , N (3.6)
We sum the Laplacian components according to Equation (3.5) to get the new image.
Effectively, the transition region is wider at lower spatial frequencies and narrower
at high spatial frequencies, producing a more natural transition between the source
and target. Figure 3.5 illustrates the process for the same images as in Figure 3.3; note
the higher quality of the composite and the relative lack of artifacts.
The general approach of a multiresolution filter-bank decomposition applies to
other operators besides the Laplacian. For example, a steerable pyramid [453] fur-
ther decomposes each bandpass image into the sum of orientation bands, which
can be used to selectively enhance or de-emphasize components at different ori-
entations. Another important alternative is a discrete wavelet transform (e.g.,
[277, 278]), which also represents images at different scales and can be computed very
efficiently.
(a) (b)
(g)
Figure 3.5. Laplacian Image Compositing. (a) The target image. (b) The source image, indicating
the boundary of the compositing region. (c) Several levels of the Laplacian pyramid for the target
image. (d) Several levels of the Laplacian pyramid for the source image. (e) Several levels of the
Gaussian pyramid for the compositing mask. (f) The combination of the source and target at each
level according to Equation (3.6). (g) The final composite.
62 Chapter 3. Image Compositing and Editing
However, none of these pyramid-style methods are well suited to the situation
when the source and target colors are not already well matched, as we’ll see in the
next section.
Ω
Figure 3.6. Terminology for Poisson image editing.
∂Ω
Source image
3.2. P o i s s o n I m a g e E d i t i n g 63
(a)
(c)
(b)
Figure 3.7. (a) The target image and (b) the source image, indicating the region to be com-
posited. (c) Laplacian pyramid blending fails when the source and target regions’ colors differ
by too much.
then the calculus of variations implies that the I (x, y) that solves Equation (3.7) is a
solution of the Euler-Lagrange equation:
∂F d ∂F d ∂F
− − = 0 in (3.9)
∂I dx ∂Ix dy ∂Iy
Plugging Equation (3.8) into Equation (3.9) yields1
∂ 2I ∂ 2S ∂ 2I ∂ 2S
2 − +2 − = 0 in (3.10)
∂x 2 ∂x 2 ∂y 2 ∂y 2
or more simply,
Figure 3.8. Discrete sets required for solving the Poisson equation using digital images. A small
image region is shown. The lightly shaded squares comprise ; the darker-shaded squares
comprise ∂.
∂ I ∂ I 2 2
where we have used the common notation of ∇ 2 I = ∂x 2 + ∂y 2 for the Laplacian
operator.
An equation of the form (3.11) (with a generic right-hand side) is called a Poisson
equation, and a constraint of the form (3.12) is called a Dirichlet boundary condition.
If the right-hand side of Equation (3.11) is zero, it is called a Laplace equation2 ; if the
right-hand side of Equation (3.12) is zero, it is called a Neumann boundary condition.
2 Sx (x, y) ∂Sx ∂Sy
∇ I (x, y) = div = + in (3.13)
Sy (x, y) ∂x ∂y
2 The Laplace equation is also sometimes known as the heat equation or diffusion equation.
3 For a readable refresher on vector calculus and derivatives, see the book by Schey [427].
3.2. P o i s s o n I m a g e E d i t i n g 65
2. N(p) ⊂ . In this case — such as pixel B in Figure 3.8 — the pixel is on the edge
of the source region, and the estimate of the Laplacian includes pixels from
the target that are specified by the boundary condition:
I (q) + T (q) − 4I (x, y)
q∈N (p)∩ q∈N (p)∩∂
Typically, the region is well inside the target image (i.e., surrounded by a
healthy border of target pixels). However, if runs all the way to the image bor-
der, Equation (3.14) and Equation (3.15) need to be modified to avoid querying pixel
values outside the image. For example, if the upper left-hand corner (1, 1) ∈ , we
would modify Equation (3.14) to
Collecting together all the equations for each p ∈ results in a large, sparse linear
system. There are as many unknowns as pixels in , but at most five nonzero elements
per row, with a regular structure on where these elements occur.4
Solving the Poisson equation for the example images in Figure 3.7 results in
the improved composite in Figure 3.9. As with the Laplacian pyramid, the Poisson
equation was applied to each color channel independently. We can see that the over-
all colors of the target image merge naturally into the source region, while keeping
the sharp detail of the source region intact.
We can obtain a slightly different interpretation of Equations (3.11)–(3.12) by
defining E(x, y) = I (x, y) − S(x, y) and rearranging:
∇ 2 E(x, y) = 0 in (3.17)
s.t. E(x, y) = T (x, y) − S(x, y) on ∂ (3.18)
That is, E(x, y) is a “correction” that we add to the source pixels to get the final image
pixels. We can think of E(x, y) as a smooth membrane that interpolates the samples
of the difference between the target and source pixels around the boundary of .
Now Equation (3.17) is a Laplace equation, which implies that the solution E(x, y) is
a harmonic function. Once we compute E(x, y), we recover I (x, y) = S(x, y) + E(x, y).
4 In fact, the same kinds of systems occurred when we considered the matting problem in Sections
2.4 and 2.6.
66 Chapter 3. Image Compositing and Editing
Figure 3.10. (a) The region includes some key features of the target image. (b) Poisson image
compositing without modification creates unacceptable visual artifacts; the mountain’s color is
smudged into the source region. (c) Using mixed gradients to preserve the target edges in is
a big improvement.
We’ve assumed that the pixels from the source image entirely overwrite whatever
pixels used to be in the same place in the target image. However, in some cases, it
may be appropriate for the original target pixels to “show through.” For example, we
may want to maintain some of the texture of the target image, or give the sense that
the source pixels are slightly transparent. In this case, we could use a guidance vector
field given by a mixture of the source and target gradients, such as:
Sx (x, y) ∇T (x, y) if ∇T (x, y) > ∇S(x, y)
= (3.19)
Sy (x, y) ∇S(x, y) otherwise
This would preserve whatever gradients were stronger inside . This is an example
of a non-conservative vector field, so we must use Equation (3.13), not Equation (3.11)
(though the numerical implementation is basically the same). Figure 3.10 illustrates
an example.
3.2. P o i s s o n I m a g e E d i t i n g 67
wi (p)
λi (p) = N , i = 1, . . . , N (3.20)
j=1 wj (p)
where
tan(θi−1 (p)/2) + tan(θi (p)/2)
wi (p) = , i = 1, . . . , N (3.21)
pi − p
and θi (p) is the angle formed by pi , p, and pi+1 , as illustrated in Figure 3.11. (For the
purposes of Equation (3.21), p0 = pN .)
Then we can obtain an estimate of the harmonic function E(p) at any point in
as a simple weighted combination of the boundary conditions:
N
E(p) = λi (p)(T (pi ) − S(pi )) (3.22)
i=1
As previously, we recover I (p) = E(p) + S(p). Note that we can directly evaluate
I (p) for any point in by computing Equation (3.22), which is extremely efficient
and highly parallelizable. While the values of E do not precisely solve the Laplace
pi
p i⫺1
p i+1
θi θ i⫺1
Ω ∂Ω
^
∂Ω
Ωobj
Ω
^
∂Ω
Figure 3.12. Contours for drag-and-drop pasting. The outer contour ∂ was roughly drawn by
the user. The inner region obj is produced by the GrabCut algorithm. The drag-and-drop pasting
algorithm estimates an intermediate contour ∂ . ˆ
equation, Farbman et al. showed that the differences in typical compositing prob-
lems are imperceptible, and the fast formulation allows Poisson-type problems to be
solved in real time, allowing virtually instant compositing.
Jia et al. [219] proposed a variation to Poisson image editing called drag-and-drop
pasting, making the key observation that Poisson composites may have unappeal-
ing visual artifacts due to a user’s poor choice of boundary ∂. Since the user is not
expected to provide a highly conformal outline for the source region, there is typically
some space between the user-provided boundary ∂ and the actual object boundary,
so that an intermediate contour ∂ ˆ can be found to produce a more visually pleas-
ing composite. The idea is to optimize ∂ . ˆ Jia et al. observed that the less variation
along the boundary provided to the Laplace equation in Equations (3.17)–(3.18), the
smoother (lower-energy) the correcting membrane E would be. They proposed to
estimate ∂ ˆ as a contour between the user-specified contour ∂ and the contour
∂obj produced by the GrabCut algorithm from Section 2.8.2 (Figure 3.12). The opti-
mal contour ∂ ˆ is estimated by alternating the following two steps, starting from the
initial estimate ∂ ˆ = ∂:
1. ˆ
Compute c as the average color of T (x, y) − S(x, y) on ∂ .
2. Find the ∂ ˆ satisfying obj ⊂ ˆ ⊂ that minimizes the average value
ˆ This contour can be computed using a
of T (x, y) − S(x, y) − c2 on ∂ .
shortest-closed-path algorithm based on dynamic programming.
Jia et al. also proposed to incorporate the object’s alpha matte into the composit-
ing process by modifying the guidance vector field (Sx (x, y), Sy (x, y)), which further
mitigates visible differences between the source and target.
Lalonde et al. [261] observed that in many visual effects compositing situations, the
user may not care about pasting a particular source image into a scene, but instead
wants to insert an object from a certain class (e.g., the job is to populate a clean
plate of a street with cars and pedestrians). In this case, they proposed to leverage a
library of thousands of “clip art” foreground objects, automatically providing the user
with choices that fit well with the target scene in terms of estimated lighting, camera
3.3. G r a p h - C u t C o m p o s i t i n g 69
orientation, and resolution. These objects are then composited into the target image
with an approach very similar to drag-and-drop pasting.
While we focused specifically on the compositing problem in this section, similar
gradient-domain techniques based on the Poisson equation have been applied in
several other areas of computer vision and graphics, including the removal of visible
seams in panorama construction [6], high dynamic range compression [136], and
locally changing color, illumination, and detail [364]. More generally, researchers
have proposed optimization frameworks that operate directly on image pixels and
their gradients for similar effects, such as Bhat et al.’s GradientShop [43].
Poisson image editing works very well when the source and target images are rela-
tively simple and smooth near the desired boundary. However, if the source or target
is highly textured, it may be difficult to manually guess a good boundary around
the source object that will be harmonious with the texture at the desired region in
the target image. Drag-and-drop pasting offers one approach to automatically esti-
mating a good boundary for gradient-domain compositing, but there may be no
low-energy contours in a highly textured image. As another consideration, the colors
of a gradient-domain composite may seem unnatural, since the original colors of the
source image inside the target region are not preserved.
An alternative is to not blend the images across a boundary at all, but instead
to select a region of the source image that can be directly copied to its position
in the target image in the most unobtrusive way possible. The idea is to hide the
compositing boundary in places where either the source and target are very similar,
or there is enough texture in the target to obscure the presence of a discontinuity.
This can be naturally viewed as a labeling problem: given a measure of the quality
of a boundary and certain constraints, which pixels should come from the source
and which from the target? This is quite similar to the graph-cut-based segmentation
problem from Section 2.8.2.
Suppose the user has aligned a source image S with a target image T , as illustrated
in Figure 3.13. The user designates a set of pixels S that definitely must come from
the source, and another set T that definitely must come from the target. These con-
straints are analogous to those of the trimap and scribbles in the matting problem of
S T S T
Figure 3.13. Seam-based compositing. (a) The source image with a constrained set S. (b) The
target image with a constrained set T. (c) The final composite contains some pixels from the
source image (striped region) and some from the target image (white region), separated by a
seam.
70 Chapter 3. Image Compositing and Editing
S S
T T
(a) (b)
Figure 3.14. Graph formation for seam-based compositing. (a) Nodes in the set S (black dots)
are attached to S with infinite weight and have no links to T; conversely, nodes in the set T (white
dots) are attached to T with infinite weight and have no links to S. Gray nodes are uncommitted
and are not directly connected to the terminals. (b) The minimum-cost cut separates the source
and target terminals and defines a seam in the composite image.
Chapter 2. The remaining pixels comprise a region where the source/target bound-
ary, or seam, is allowed to pass. We build a graph over the potential boundary pixels,
creating an edge between each pair of 4-neighbors. An easy choice for the weight wij
assigned to edge eij is:
That is, the cost is low if the source and target pixels have similar colors on either
side of a potential seam. We also create a pair of terminal nodes S and T, and create
edges (i, S), (j, T) with infinite weight for all i ∈ S and all j ∈ T. The optimal seam
is then defined as the minimum-cost cut that separates S from T, as illustrated in
Figure 3.14. Just like in the previous chapter, we use graph cuts [59] to solve the
problem. Figure 3.15 illustrates an example of graph-cut-based compositing.
This approach to compositing was first proposed by Efros and Freeman [128] and
Kwatra et al. [259], although their primary interest was synthesizing realistic texture
(see Section 3.8). To bias the seam to go through higher-spatial-frequency regions
where it will be less noticeable, Kwatra et al. suggested modifying Equation (3.23) to:
where d is the direction of the edge (i.e., the vector pointing from pixel i to pixel j).
This way, the denominator will be small and the weight will be large if the seam passes
through a low-frequency (i.e., small-gradient) region.
We can generalize the graph-cut approach to deal with compositing multiple over-
lapping sources at the same time. For example, we may have several very similar
pictures of a family portrait, and want to create a composite that contains the best
view (e.g., eyes open, smiling) of each person. In this case, no single image acts as
the target; instead we begin with several registered source images S1 , . . . , SK and want
to create a composite I (x, y) where each pixel comes from one of the source images.
We want to bias contiguous chunks to come from each image, and as before, to hide
the seams in perceptually unimportant regions. Using graph cuts to make such a
3.3. G r a p h - C u t C o m p o s i t i n g 71
(a) (b)
(c) (d)
Figure 3.15. An example of graph-cut compositing. (a) Target image. (b) Source image, with user
strokes overlaid to indicate regions that must be included. (c) Graph-cut composite. (d) Region
labels used to form the composite (black pixels are from (a), white pixels are from (b)).
multi-image photomontage was suggested by Agarwala et al. [7], building off Kwatra
et al.’s framework.
The basic idea is to minimize a Gibbs energy of the form:
E(L) = Edata (L(i)) + Esmoothness (L(i), L(j)) (3.25)
i∈V (i,j)∈E
Here, V is the set of pixels in the output image, E is the set of all adjacent pixels
(for example, 4-neighbors), and L is a labeling; that is, an assignment in {1, . . . , K } to
each pixel i. For the multi-image compositing problem, the user paints initial strokes
in each source image, signifying that pixels stroked in image Sk must have label k in
the final composite. Natural forms of the two energy terms are:
0 if pixel i is stroked in Sk
Edata (L(i) = k) = ∞ if pixel i is stroked in some image Sj = Sk (3.26)
0 otherwise
Esmoothness (L(i) = k, L(j) = l) = Sk (i) − Sl (i) + Sk (j) − Sl (j) (3.27)
72 Chapter 3. Image Compositing and Editing
(a)
(b)
(c)
Figure 3.16. A multi-image photomontage created with α-expansion. (a) The original images,
color-coded as red, green, and blue. Each image has some unsatisfactory facial expressions. (b) A
user scribbles on faces and body parts to keep from each source image, resulting in the labeling
map at right. Note that in several cases, a person’s head is taken from one source image and
their body from another. (c) The final composite.
We can modify Equation (3.27) similarly to Equation (3.24) to bias seams to lie along
existing image edges.
Unfortunately, the graph-cut algorithm cannot directly minimize a function like
Equation (3.25) where we have more than two possible labels per pixel (i.e., more than
3.4. I m a g e I n p a i n t i n g 73
two terminal nodes). Instead, we use an extension called α-expansion [61]5 , which
briefly works as follows. We cyclically iterate over the possible labels k = 1, . . . , K . For
a given label, we form an auxiliary minimum-cut problem in which each node will
either be associated with the label it had in the previous iteration, or the new label
k, which corresponds to a two-terminal graph (i.e., k or not-k). That is, at each step,
the region of pixels labeled k is allowed to expand by solving a new minimum-cut
problem. The algorithm stops after a cycle through all the labels fails to decrease the
cost function. While we aren’t guaranteed to find a global minimum of Equation (3.25)
as in the two-label case, the cost of the α-expansion result will be within a factor of 2
of the global minimum. Appendix A.3 gives details on the algorithm.
Figure 3.16 illustrates a result of fusing multiple images into a composite image
with this technique. Agarwala et al. [7] discussed how simple modifications to the
energies in Equations (3.26)–(3.27) could be used for other visual effects, such as
creating clean plates, constructing panoramas from multiple images without seams
or artifacts from moving objects, and interactively relighting an object using source
images taken under different lighting conditions. They also mentioned how to com-
bine the approach with Poisson compositing to improve results for dissimilar source
images. Rother et al. [404] extended the multi-image compositing approach to cre-
ate collages of very different source images (for example, to summarize in a single
composite the people and places contained in many different images of a vacation).
Johnson et al. [225] described how realistic composite “photographs” could be easily
created with graph-cut techniques based on a user’s rough placement of meaningful
regions on a blank canvas (e.g., “sky” above “building” above “water”) and a library
of labeled images.
While compositing different elements into a final scene is the backbone of visual
effects production, it is often necessary to remove objects from a scene, such as the
wires suspending a stunt performer or a gantry supporting special effects equipment.
In this case, we specify a region in a source image and want to replace it with pixels
that realistically “fill the hole.” There should be no blurring or visual artifacts that
would lead a viewer to believe the resulting image was manipulated. This hole-filling
problem is known as image inpainting. The term was introduced to the computer
vision community by Bertalmio et al. [40] to describe the process of removing thin
artifacts like scratches in an old photograph or an unwanted caption on a digital
image. However, the term has grown to include the general filling-in problem, no
matter how large the hole. This problem is also sometimes called image completion.
The best-case scenario is when we have a clean plate of the scene from exactly
the same camera perspective and under the same lighting conditions (similar to the
setup for difference matting). In this case, we can treat inpainting as a compositing
problem in which the clean plate plays the role of the source image. Even if an actual
clean plate is unavailable, visual effects artists can often synthesize a good approxi-
mation by stealing pixels from the unoccluded background in different frames, or by
5 There is no relationship between this α and α mattes from the previous chapter.
74 Chapter 3. Image Compositing and Editing
“warping in” background texture from images taken from different perspectives (see
Section 3.7). In this section, we focus on what can be done when a clean plate or an
approximation to one is unavailable, and we only have the pixels in the current frame
as the basis for inpainting.
Most inpainting algorithms iteratively fill in the target region from its boundary
inward; in this context ∂ is sometimes called the fill front. We initialize the process
with
I (x, y) (x, y) ∈
/
I0 (x, y) = (3.28)
black (x, y) ∈
Then at each step we produce a new image In+1 (x, y) based on In (x, y), until the
hole is filled in.
Here, we discuss two basic approaches to the problem: a partial-differential-
equation-based approach better suited to inpainting thin holes, and a texture-
synthesis approach better suited to inpainting large missing regions.
Figure 3.17. (a) The image and inpainting region. (b) Propagating image information normal to
the region boundary can produce unwanted artifacts. (c) Instead, we propagate along isophote
directions to maintain visual continuity.
3.4. I m a g e I n p a i n t i n g 75
Since the gradient at a pixel indicates the direction of greatest change, we can
locally compute the isophote direction as the unit vector perpendicular to the
gradient; that is,
⊥ I (x, y) − I (x, y + 1)
∇ I (x, y) = unit (3.29)
I (x + 1, y) − I (x, y)
Putting all this together mathematically, the pixel intensities that fill the hole
should satisfy the partial differential equation (PDE):
∇(∇ 2 I ) · ∇ ⊥ I = 0 (3.30)
That is, the change in the Laplacian ∇(∇ 2 I ) should be zero in the direction of
the isophote ∇ ⊥ I . Ideally, we could solve Equation (3.30) by creating an image that
changes as a function of time according to the following PDE:
∂I
= ∇(∇ 2 I ) · ∇ ⊥ I (3.31)
∂t
Figure 3.18. (a) The original image. (b) The inpainting mask. (c) After 4,000 iterations of PDE-
based inpainting, the wire locations are still perceptible as blurry regions. (d) After 10,000
iterations of PDE-based inpainting, the wires have disappeared.
76 Chapter 3. Image Compositing and Editing
(a) (b)
Figure 3.19. A failure case of PDE-based inpainting. (a) The original image. (b) The inpainting
mask. (c) The result after 20,000 iterations of PDE-based inpainting. (d) The result after 200,000
iterations of PDE-based inpainting. The inpainting region is unacceptably blurry. (e) Poisson
compositing with a guidance vector field of 0 inside the inpainting mask, giving a similar result.
inside . This would result in a second-order PDE (i.e., a Laplace equation with
Dirichlet boundary conditions) as opposed to the third-order PDE of Equation (3.30).6
PDE-based inpainting techniques are a reasonable choice for certain visual effects
scenarios, such as painting out thin wires holding up a stunt performer. However,
a major drawback of PDE-based techniques is that the interior of the inpainted
region is inevitably smoother and blurrier than its surroundings, leading to unac-
ceptable visual artifacts when a large hole is located inside a textured region, as
illustrated in Figure 3.19. The patch-based methods we discuss next do not have this
shortcoming.
6 Since inpainting can take many iterations to converge, using the Poisson approach is also likely to
be much faster.
3.4. I m a g e I n p a i n t i n g 77
Ω
Ψ^q
Ψ^p
(a) (b)
Figure 3.20. One cycle of patch-based inpainting. (a) The patch of pixels p̂ centered around
the pixel on the fill front with the highest priority is selected. The patch of pixels q̂ ⊂ that
best matches p̂ is determined. (b) Pixels are copied from q̂ to shrink the target region.
78 Chapter 3. Image Compositing and Editing
A 0.68 0.001
0.48 0.37
0.50 0.12
B 0.23 0.001
T
∇ I(p)
0.56 0.001
0.56 0.35
p n(p)
(a) (b)
Figure 3.21. (a) The confidence term for patch-based inpainting. The term is high where ∂ is
convex (e.g., point A), low where ∂ is concave (e.g., point B), and near 0.5 where ∂ is straight.
(b) The data term for patch-based inpainting. The term is relatively high when a strong edge is
nearly perpendicular to ∂, lower when a strong edge is nearly parallel to ∂, and very small in
nearly-constant-intensity regions. Point p illustrates the vectors used to compute the data term.
1
C(p) = C(q) (3.33)
W2
q∈p
The data term incorporates similar reasoning as the PDE-based method; we want
to propagate intensities into the target region along isophote directions, starting
with strong edges that should continue into . We prefer strong edges that hit the
boundary ∂ head-on (i.e., at a right angle) as opposed to a strong edge tangent to the
boundary. Thus the data term for an image with intensities in [0, 1] is computed as:
where n(p) is a unit vector orthogonal to ∂ at p and ∇ ⊥ I (p) is the unit vector per-
pendicular to the gradient defined in Equation (3.29). Thus, the data term at a pixel
increases with the strength of the image gradient and with its alignment to the tan-
gent of the fill front. Figure 3.21b illustrates the vectors in Equation (3.34) and some
values of the data term for an example I and .
After we compute the pixel p̂ on the fill front with the highest priority, we form
a patch around it and find its best match in the source region, defined simply as
the patch with the minimum Euclidean distance (measured in color space). The
distance is only computed over the region of the patch containing known pixels —
i.e., p̂ ∩ (I − ). The corresponding pixel colors from the resulting “exemplar” patch
are simply pasted into the target region p̂ ∩ .
Finally, the confidence values for the newly copied pixels are all assigned to be
C(p̂), so that as we work toward the interior of the target region, the confidences get
lower and lower. The algorithm stops when all the target pixels have been filled in.
3.4. I m a g e I n p a i n t i n g 79
Figure 3.22. Results of patch-based inpainting. (a) The original image. (b) The inpainting mask.
(c) The final inpainted image. The bottom row illustrates the result after (d) 200, (e) 800, and
(f) 2,000 iterations. Note that strong linear structures are propagated through the inpainting
region first.
Figure 3.22 illustrates an example result showing several intermediate steps; we can
see that the algorithm tends to propagate strong linear structures through the target
region first, leaving large flat regions until the end. While the result has several regions
that look unusual (e.g., along the left crossbar and on the foreground post), several
regions are very convincing (e.g., the bridges, trees, and foreground terrain). PDE-
based inpainting cannot achieve a result with this level of realistic texture.
Drori et al. [126] proposed a similar inpainting method at about the same
time, which used a coarse-to-fine approach and Laplacian-pyramid-based blend-
ing instead of direct copying of pixel regions. They also allowed the target patches
to adaptively change size and the source patches to change in scale and orientation,
permitting a wider range of possibilities at the cost of speed.
Sun et al. [481] noted that while patch-based methods try to continue strong linear
features into the target region, these methods cannot guarantee that salient features
like straight lines or junctions will correctly “meet up” in its interior. They proposed a
belief-propagation-based algorithm to first complete salient structures drawn in the
target region by the user, afterward filling in the remaining pixels with a patch-based
80 Chapter 3. Image Compositing and Editing
algorithm and Poisson blending. This approach produces good results compared to
Criminisi’s algorithm when the target region should contain continuations of struc-
tures to which the human eye is sensitive, such as horizon lines or fences. Komodakis
and Tziritas [249] took the belief-propagation approach a step further by applying it
to the entire target region instead of just the linear structures specified by the user.
While this method has the advantage of posing the problem globally instead of greed-
ily selecting the best patch at each iteration, the main difficulty is managing the large
space of patch labels at each target position.
Finally, Hays and Efros [191] noted that the source region need not be confined
to the pixels in the image outside the target region. Instead, they proposed to use
millions of photographs gathered from the Internet to suggest plausible completions
for a large target region, filling it with a large chunk from one of these images. These
plausible alternatives can vary widely, and the user is allowed to pick the best one for
his or her purpose. The main problem is devising a highly parallelizable algorithm to
efficiently search the huge database of images for good matches. Once each match-
ing image is aligned to the target image, a seam-based algorithm similar to that of
Section 3.3 is used to create the composite.
Finally, we discuss image retargeting techniques for changing the size and aspect
ratio of an image. While many retargeting algorithms were designed for making an
image smaller (for example, to fit on a mobile device), these types of techniques
could also be used to reshape high-resolution images and video — for example, to
change a movie from the theater aspect ratio of 2.39:1 to the HDTV aspect ratio of
16:9. Common techniques for retargeting an image to a new aspect ratio include
cropping, scaling, and letterboxing, but as Figure 3.23 illustrates, these methods can
remove, distort, or shrink image features in perceptually distracting ways. However,
there has recently been an explosion of interest in the computer vision and graphics
communities in content-aware image retargeting. These methods resize an image
Figure 3.23. Resizing an original image (a). (b) Cropping can remove important details from the
image (e.g., the second cow.) (c) Scaling changes the aspect ratio, causing unnatural distortions.
(d) Letterboxing makes all the elements of the image small, losing detail.
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 81
by moving and removing pixels in such a way that the aspect ratios of important sub-
objects (such as faces) are preserved, while omitting information in areas that are less
perceptually important (such as large flat regions). Image retargeting techniques can
also be used for realistic, automatic image recompositing or reshuffling — that is,
the spatial rearrangement of elements of a scene.
Figure 3.24. (a) An original image, with ROI given by the center box. (b) The ROI is uniformly
scaled to avoid distortion, while the eight rectangular regions surrounding it are warped with
piecewise linear transformations. (c) Final result.
82 Chapter 3. Image Compositing and Editing
(a) (b)
Figure 3.25. (a) An original image. (b) The saliency map obtained using Itti et al.’s algorithm.
7 sq can be determined as a function of a given V and V and eliminated from the equation; see
Problem 3.19.
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 83
(a) (c)
(b) (d)
Figure 3.26. (a) An original image. (b) Its importance map. (c) The optimized scale-and-stretch
grid. Note that unimportant rectangles can be squished, while important rectangles retain their
original aspect ratio. (d) The final result of stretching the image to be about fifty percent wider.
(a) (b)
seams in an image. Unlike the seams in Section 3.3 that can have an arbitrary non-
self-intersecting shape, here we consider only connected paths of pixels that start at
the top edge of the image and end at the bottom edge, passing through exactly one
pixel per row (or similarly, paths that go from the left to right passing through one
pixel per column), as illustrated in Figure 3.27.
Clearly, removing a top-to-bottom seam and pushing the remaining pixels
together will reduce the image width by one column; similarly, removing a left-to-
right seam reduces the image height by one row. Avidan and Shamir [21] made a
simple observation: to reduce the size of an image, we should first remove those
84 Chapter 3. Image Compositing and Editing
seams that pass through perceptually uninteresting regions of the image. The notion
of “interesting” is encapsulated in a seam energy, the simplest of which is the sum of
image gradients along a given seam s:
∂I ∂I
E(s) = (x, y) + (x, y) = e1 (x, y) (3.36)
∂x ∂y
(x,y)∈s (x,y)∈s
Figure 3.28. Reducing an image’s width by successively removing the lowest-energy vertical
seams. (a) The original image. (b) The lowest-energy seams. (c) The result of seam removal.
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 85
Figure 3.29. Increasing an image’s width by adding pixels at the lowest-energy vertical seams.
(a) The original image. (b) The lowest-energy seams. (c) The result of seam expansion.
(a) (b)
Figure 3.30. Inpainting with seam carving. (a) The original image. (b) An inpainted image of the
same size created by removing, then adding vertical seams. Can you find the two books that have
been removed, and identify other books that have been compressed/expanded to compensate?
Rubinstein et al. [408] proposed two key algorithmic refinements to the original
seam carving algorithm. First, they showed how the problem of finding the optimal
seam could be naturally posed as computing the minimum cut on a graph, which
can be generalized more easily than dynamic programming. As usual, the vertices
of the graph correspond to image pixels. However, unlike the graph-cut methods in
Section 3.3 and the previous chapter, we create a directed graph in which each pair of
4-neighbors can be connected by two arcs going in different directions. Figure 3.31
illustrates the graph setup and arc weights that correspond to seam carving. For a
vertical seam, we attach all the pixels on the left edge of the image to one terminal S,
and all the pixels on the right edge to another terminal T. We seek a cut of the graph
that separates the terminals; the cost of such a cut is the sum of the weights of the
86 Chapter 3. Image Compositing and Editing
e1(x⫺1,y)
x⫺1, y x⫺1, y+1
∞
∞
S T ∞
e1(x,y)
x, y x, y+1
∞
(a) (b)
Figure 3.31. (a) Graph-cut formulation of seam carving. (b) Arc weights for a subset of the
graph (the dotted region in (a)) corresponding to finding the minimal-cost vertical seam. Here,
e1 represents the sum of gradients in the summand of Equation (3.36).
directed arcs from S to T. After computing the cut, we remove the pixels to the left of
the seam in each row. The special configuration of infinite-weight arcs ensures that
the cut forms a connected path of pixels that only intersects one pixel per row (see
Problem 3.21).
Rubinstein et al. also observed that the original formulation of the seam energy
in Equation (3.36) ignores energy that may be inserted into the image by removal
of the seam, since new edges are created when previously non-neighboring pixels
are pushed together. Instead, they proposed to measure the energy of a seam as the
energy introduced by removing the seam. They called this the forward energy of the
seam to distinguish it from the backward energy in Equation (3.36). As illustrated in
Figure 3.32, there are three possibilities for new edges introduced for a vertical seam
depending on its direction at pixel (x, y).
These three cases correspond to the forward energy cost function terms
CLR (x, y) + CLU (x, y) Case 1
C(x, y) = CLR (x, y) Case 2 (3.37)
C (x, y) + C (x, y) Case 3
LR RU
where
CLR (x, y) = I (x, y + 1) − I (x, y − 1)
CLU (x, y) = I (x − 1, y) − I (x, y − 1) (3.38)
CRU (x, y) = I (x − 1, y) − I (x, y + 1)
which can again be minimized using dynamic programming. To minimize the for-
ward energy using graph cuts instead, we modify the subgraph in Figure 3.31b to
have the weights in Figure 3.33a. Figure 3.33b-d illustrates an example of reducing
an image’s size using both the backward and forward energies, showing that using
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 87
x–1, y–1 x–1, y x–1, y+1 x–1, y–1 x–1, y x–1, y+1 x–1, y–1 x–1, y x–1, y+1
x–1, y x–1, y+1 x–1, y–1 x–1, y+1 x–1, y–1 x–1, y
Figure 3.32. Pixel configurations introduced by seam removal. The shaded seam pixels in the
top row are removed to produce the new pixel arrangements in the bottom row. New neighbors
are indicated by bold lines in the bottom row. The bold lines in the top row represent the edges
cut in a graph-cut formulation of the problem.
the forward energy avoids introducing visual artifacts. For this reason, the forward
energy is usually preferred in implementations of seam carving.
1 1
D(I , I ) = min d(,
) + min d(, ) (3.40)
N ⊂I N ⊂I
⊂I ⊂I
Here, each is a patch in I and each is a patch in I ; these patches are effectively
generated at multiple scales, as described next. We assume N patches are created in I
88 Chapter 3. Image Compositing and Editing
CLR(x–1,y)
x–1, y x–1, y+1
∞
∞
CLU(x,y) CRU(x,y–1) CLU(x,y+1) CRU(x,y)
∞
CLR(x,y)
x, y x, y+1
∞
(a) (c)
(b) (d)
Figure 3.33. (a) The arc weights corresponding to forward energy in graph-cut-based seam
carving, corresponding to finding the minimal-cost vertical seam. (b) An original image. (c) Seam
carving with backward energy, showing visual artifacts (e.g., the right leg, the sword, the vertical
white bar). (d) Seam carving with forward energy results in a more acceptable image with fewer
introduced artifacts.
and N patches are created in I . The function d in the summands is the average sum-
of-squared-distance between the colors of corresponding pixels in a pair of same-
sized patches. The first term in Equation (3.40) captures the notion of completeness,
and the second term captures the notion of coherence. The measure D(I , I ) is low
when both properties are well satisfied.
If we specify the desired dimensions of the retargeted image, then our goal is to
find the image I that minimizes D(I , I ). We use an iterative update rule for this
estimation problem, as follows.
Let’s consider a pixel j ∈ I and think about how it contributes to the cost function in
Equation (3.40), as illustrated in Figure 3.35. Suppose we only consider W ×W patches
in Equation (3.40). That means that j will be a member of W 2 patches {1 , . . . , W } in
2
I , which have the best matches {1 , . . . , W 2 } in I . Let the pixels in I corresponding to
pixel j in each of these patches be i1 , . . . , iW 2 . Thus, the second term in Equation (3.40)
corresponding to pixel j is:
2
1
W
I (ik ) − I (j)2 (3.41)
N
k=1
where the norm in Equation (3.41) is the Euclidean distance in color space.
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 89
I I'
Figure 3.34. An original image I and a proposed retargeted image I . Most patches from I are
present in I , indicating a high degree of completeness. Conversely, I doesn’t contain patches
that don’t appear in I , indicating a high degree of coherence. The retargeted image can remain
complete and coherent while removing large chunks of repetitive texture, which is difficult to do
with seam carving.
On the other hand, the contribution of j to the first term in Equation (3.40) is harder
to determine, since we don’t know how many patches from I (if any) include j in their
best-match patches. If we assume that Nj patches in I include j in their best-match
patch in I , and let the pixels in I corresponding to pixel j in each of these patches be
i1 , . . . , iN
, then the first term in Equation (3.40) corresponding to pixel j is:
j
Nj
1
I (ik ) − I (j)2 (3.42)
N
k=1
This suggests a simple iterative algorithm, repeated until the retargeted image stops
changing:
1. For each patch in I , find the most similar patch in I to find the pixel colors
for Equation (3.41).
2. For each patch in I , find the most similar patch in I to find the pixel colors
for Equation (3.42).
3. Update the image I using Equation (3.43).
A scaled version of the original image is a good choice for initializing the process if
the retargeted dimensions are not too different than the original dimensions; other-
wise, the desired dimension can be reached using several steps in which the relative
90 Chapter 3. Image Compositing and Editing
I I'
(a)
I I'
(b)
Figure 3.35. An original image I and a proposed retargeted image I . We consider the contri-
bution of the black pixel j in I to the bidirectional similarity cost function. (a) For the coherence
term, we consider the W 2 patches in I that contain the black pixel j, and find their best matches
in I . (b) For the completeness term, we consider the Nj patches in I whose best matches in I
contain the black pixel j, where Nj changes from pixel to pixel. The pixel color at j in the retar-
geted image is iteratively updated as the weighted sum of colors at the gray pixels in I involved
in both terms.
change in dimension is not too large. Applying the algorithm with a coarse-to-fine
approach using a Gaussian pyramid effectively creates multi-scale patches if W is the
same at every level, and produces good approximations for successively finer scales.
Weights can also be applied at each pixel, for example using the saliency measure
from Section 3.5.1, to bias the algorithm to preserve important regions (e.g., faces),
or to place zero weights on pixels to be removed from the image for inpainting.
Bidirectional similarity seems to do a better job at creating extremely small ver-
sions of images compared to seam carving (which is not really of interest for visual
effects); for example, it allows repetitive textures like windows on a building to auto-
matically be condensed. However, this approach is exciting for its ability to easily
recompose or reshuffle images. For example, the user can roughly cut image features
out of an image, rearrange them on the target image I , and fix these pixels’ intensi-
ties. Then the iterative algorithm is applied to find the remaining pixels’ intensities so
that the resulting image is as complete and coherent as possible. An example of this
approach is illustrated in Figure 3.36. Similarly, a realistic expansion of an original
image can be created by fixing the position of the original on a larger canvas and opti-
mizing the coherence term in Equation (3.41) only over the unknown border regions.
3.5. I m a g e R e t a r g e t i n g a n d R e c o m p o s i t i n g 91
Figure 3.36. Reshuffling an image using bidirectional similarity. (a) An original image. (b) Con-
straints fixing certain pixels (note that the trays in the center are in new positions). (c) The
automatically created recomposition.
The result effectively synthesizes realistic texture around the original, similar to the
patch-based inpainting methods in Section 3.4.2. If two separate images are used to
provide patches for I , the retargeted image I will resemble a seamless montage of
the inputs, similar to Section 3.3.
However, this algorithm has one serious drawback: it is highly computation-
ally time-consuming, as well as memory-intensive, to search for the minimum-cost
patches in Equation (3.40). The algorithm can be substantially accelerated — by
factors of 20 to 100 — using an approximate nearest-neighbor algorithm proposed
by Barnes et al. called PatchMatch [30]. The approximate algorithm is based on ran-
dom sampling and exploits the coherence of natural images to find good approximate
matches for the bidirectional similarity algorithm (but which could apply to any of the
block-matching algorithms discussed throughout this book). Barnes et al. performed
several experiments on real and synthetic data to show that PatchMatch substan-
tially outperformed other algorithms conventionally used for approximate nearest
neighbors (such as [19]), to the extent that image reshuffling can be performed at
interactive rates. By restricting the search areas for the nearest neighbors, Patch-
Match also allows new effects that were difficult to obtain with the other methods in
this section, such as preserving long straight lines.
Cho et al. [93] proposed an algorithm for retargeting and reshuffling called
the patch transform at the same time as Simakov et al.’s bidirectional similarity
approach; the two methods share a similar philosophy and produce similar-looking
results. However, the patch transform uses loopy belief propagation as its optimiza-
tion engine, and explicitly constrains the output image to be composed of a disjoint
set of patches from the original image, each of which is only used once.
In the extreme case where we only consider patches of size 1×1 pixel, a retargeted
image can be thought of as a collection of pixels from the input image. That is, I (x, y)
is defined by a label (δx, δy) such that I (x, y) = I (x + δx, y + δy). From this perspec-
tive, retargeting the image can be thought of as a label assignment problem. Pritch
et al. [376] proposed an algorithm called shift-map editing based on this concept. If
L is a labeling, that is, an assignment (δx, δy) at each pixel of the retargeted image,
then a cost function over labelings is defined in the same way as Equation (3.25); that
is, we create a data term Edata (L(i)) that encapsulates retargeting constraints, and a
smoothness term Esmoothness (L(i), L(j)) that encapsulates neighborhood constraints
for each pair of 4-neighbors. For example, to perform inpainting, the data term at
92 Chapter 3. Image Compositing and Editing
pixel (x, y) with label (δx, δy) would be ∞ if (x + δx, y + δy) was inside the target region
. To perform reshuffling, the data term at pixel (x, y) with label (δx, δy) would be 0
if the user wanted to force I (x, y) = I (x + δx, y + δy) and infinite for any other shift.
The smoothness term is a typical penalty for label disagreements between neighbors
based on color and gradient mismatches, similar to the discussion in Section 3.3.
As before, α-expansion is used to solve the labeling problem, using a coarse-to-fine
algorithm to make the problem computationally tractable.
Wang et al. [527] were the first to extend the Poisson gradient-domain image edit-
ing approach to video; the 3D Poisson equation and its discrete approximation are
straightforward generalizations of what we discussed in Section 3.2.1. This allows
dynamic elements to be composited into a video sequence, such as flickering flames
or a lake with waves and ripples. Since the size of the linear system to be solved is
much larger, fast numerical methods to solve the Poisson equation are critical, as
discussed in Section 3.2.2.
Many algorithms for video inpainting approach the problem based on the layered
motion model mentioned in Section 2.9. That is, the video is separated into a static
3.6. Video Recompositing, Inpainting, and Retargeting 93
background layer and one or more moving foreground layers. There are generally
three possibilities for a pixel (x, y, t) in the region to be inpainted:
1. The pixel lies on a moving foreground object and the desired result is a static
background pixel exposed in a different frame. In this case, the known back-
ground pixel can simply be pasted from this other frame after compensating
for the layer motion (e.g., [243]).
2. The pixel lies on a static background object, so the desired background is static
and unknown. In this case, patch-based inpainting can be generalized to fill
in a plausible background.
3. The pixel should represent the continuation of some object that moves in and
out of the hole as the video progresses. For example, we may want to remove
a tree that people walk behind, generating plausible results of people walking
through an empty field. If such regions are known to contain objects that move
in a predictable, cyclical way, object-based video inpainting algorithms can
exploit the consistency. For example, Venkatesh et al. [512] used background
subtraction to generate a set of moving object templates that could be pasted
into a video hole, and Jia et al. [220] explicitly estimated which positions in an
object’s motion cycle are missing at each time instant.
Wexler et al. [546] presented a general algorithm for video inpainting that was the
predecessor of the bidirectional similarity approach in Section 3.5.3 and requires no
object detection or segmentation. The method basically maximizes a coherence term
of the form
"
C(V , V ) = max exp(−λd(, )2 ) (3.44)
⊂V
⊂V
time time
(a) (b)
Figure 3.37. (a) Generalizing optimized scale-and-stretch to video encourages temporal coher-
ence between the quads in each frame. (b) Generalizing seam carving to video results in a
spatio-temporal seam that separates the left half of each image from the right half.
and introduced new constraints based on user keyframing of important objects and
structures.
Wang et al. [534, 535] generalized the optimized scale-and-stretch algorithm of
[536] to video. In [534], all the frames are first aligned to compensate for camera
motion. Moving objects are detected and tracked across video frames, and con-
straints are imposed to ensure that the quads on these objects are resized consistently.
The single-frame salience map is also replaced by a moving average across several
neighboring frames. Overall, the approach avoids artifacts that would be created by
retargeting each frame independently. In [535], the constraints on separating camera
and object motion were relaxed, replaced by a simpler optical-flow-based method for
determining a critical region in each frame that will not be removed. A cost function is
proposed that penalizes the deviation of each quad transformation from a similarity
transform and encourages temporal coherence in the quad transformations based
on the estimated optical flow, as sketched in Figure 3.37a.
Rubinstein et al. also showed how seam carving could be applied to resize video;
the generalizations of the graphs in Figures 3.31b and 3.33a to (x, y, t) pixels are
straightforward. In this case, a cut on the graph defines a spatio-temporal seam
that cuts through the video volume, as illustrated in Figure 3.37b. Alternately, we
could make the video shorter in the temporal direction by removing seams “parallel”
to the (x, y) plane. Since the number of vertices in the graph can be very large, a
coarse-to-fine strategy for computing the cut may be required (e.g., [299]).
Paul Lambert, compositing supervisor, and Blake Sloan, software engineer from Dig-
ital Domain in Venice, California, Shankar Chatterjee, software developer at Cinesite
3.7. I n d u s t r y P e r s p e c t i v e s 95
in Hollywood, and Gabriel Sanchez, 2D supervisor from LOOK Effects in Los Angeles,
discuss various aspects of video editing and compositing.
RJR: In what kinds of situations do pieces of two video sequences need to be seamlessly
merged, and how do you handle them?
Lambert: That kind of thing is all about slight of hand. The easiest thing is if there’s an
element in a shot that passes behind something like a telegraph pole. That pole is the
natural place to draw the dividing line between the two pieces of footage, especially
if the element isn’t really where the viewer’s expected to look. If there’s no pole, it
makes sense to pick a flat area with not a lot of detail to draw the seam.
Suppose you’ve got a stunt, like a guy’s on a motorbike doing a big jump and
there’s a big explosion behind him, but the stuntman obviously doesn’t look like our
hero character since you don’t want to put your A-list actor in danger. Nowadays you
could probably do it all in CG, but years ago you could try to match a take of the
actor’s head onto the real footage as well as possible. But you’ve got this explosion
going on, with light coming from behind, so ideally you’d want to shoot the actor with
roughly the same color light. You’d have had to draw a roto spline and fake it in, but
if done poorly it can look very fake since the lighting is very tough to match.
For both TRON: Legacy and The Curious Case of Benjamin Button, we did a lot of
splicing a CG head onto a real actor’s body. We’d always roto along the actual collar
line and just replace the neck and head. If you have a shot where that neckline’s
going into shadow, into the noise floor of the camera, you’ll never be able to find it
and you just make it up. You take the curve and just imagine what would happen.
With TRON, the roto of the collar was easier, since the neckline on the character’s
bodysuit was basically a rigid piece of plastic, so you knew it always had a certain
shape. On Benjamin Button, it was much harder since the character wore many types
of different, natural clothing — in some scenes he was wearing a flimsy shirt. In some
sequences where the body double was moving around a lot, we actually had to roto
the collar and warp it around. Those shots took the longest to do in roto, paint, and
compositing — sometimes we’d have to make up the whole inside of the shirt because
the body double was wearing a kind of blue hoodie that reflected color back into the
shirt in the original plate. Plus once we put the CG head back in, there might be a gap,
so sometimes we’d have to clone that piece of texture and track or warp it in; that was
all done by hand. It was very intensive work.
what I’m going to do. You’re trying to find out, am I going to have to blend across the
seam or am I going to use a hard line? Is there architecture in there that’s going to
help me hide what I need to do?
If it’s a still background, there’s no movement, and it’s slightly out of focus, using a
fuzzy or a soft matte line will usually help. I can use the width of blur along a vertical
surface in the background as a rule of thumb for how much to blur the seam. If the
background’s really sharp, then you try to follow strong edges and cut though empty
regions with the dividing line. In the sports bar example, I may follow the edge of
a table, arbitrarily go through the blank wall since it won’t matter, then follow the
edge of a TV since I don’t want to split the TV in half, and so on. You’ll get a kind of
jigsaw-puzzle pattern that will be less detectable than a straight line; you purposely
want to go back and forth, grabbing a table from the A side and a monitor from the
B side. But even when the background is sharp, you always want some amount of
blend, even if it’s just a pixel and a half.
If the cameras are moving, then you need to track or keyframe that line. A lot of it
depends on how well the two videos are registered. If they’re both moving quickly and
at the same speed, you may be able to get away with a super-wide, soft line, because
something’s always moving across the screen and the blend is twenty-five percent
of the image, you’re not going to see where it’s transferring from one side to the
other. Now, if there are very detailed objects in the background, say you see a specific
car go by, you may have to roto out that car frame by frame and use it only from
the B side.
This kind of thing happens in TV as well as movies. Sometimes it’s easy — say an
actor misses his cue and walks into a room too soon. You can actually use the original
footage as its own clean plate, and basically “slip” him — instead of having him come
in on frame 1 you have him come in on frame 25, because the information’s there
to create a clean back plate. You hear a lot of “we’ll just fix it in post.” There are so
many of those things that people don’t know about — things that were not meant to
obviously be an effects shot, but where effects were done.
RJR: How do you deal with inpainting problems, where you need to synthesize realistic
texture inside a hole?
Lambert: The most common approach is to track and warp a piece of existing texture
into the hole. These days, every shot that comes into Digital Domain gets tracked, so
we know the camera motion in 3D space, and we often survey the set or environment
to obtain accurate 3D geometry (see Chapters 6 and 8). Then the compositors have
tools where we can project a given image of the scene onto this 3D geometry. To fill
in a hole in one image, we can figure out which 3D surface lies behind the object
we want to replace, project that surface onto a region in a different image where the
background is visible, and fill in the pixels we couldn’t see with a piece of texture
that looks correct and moves in the right way. We do that kind of thing all the time
for set extensions. It’s like a 2D compositing tool that actually uses 3D under the
hood. Achieving spatial and temporal consistency is still difficult, though; it’s very
noticeable if there’s a slight camera shake and your texture all of a sudden swims in
the opposite direction of what it’s supposed to be doing.
3.7. I n d u s t r y P e r s p e c t i v e s 97
Chatterjee: At Cinesite, I was able to apply some of the texture synthesis research from
my PhD for inpainting and wire removal problems. At that time, in the mid-nineties, I
was using Markov Random Field approaches to resynthesize texture; the artist would
outline the areas of defects, and I would run my texture synthesis program to fill in
the pixels. A common example was painting out marker balls that had been placed
around a set for camera tracking.
We used to use a wire removal tool that was similar to PDE-based inpainting, but
it didn’t work very well; you could see that something had been done. They used
to call it the “glass rod” effect because that’s what the filled-in wire region kind of
looked like. Later I moved on to Efros et al.’s technique, and then to Criminisi et al.’s
algorithm.
The same kinds of tools apply to general problems of image restoration. For exam-
ple, a frame may contain random dirt, which is very hard to automatically detect. It
was also common to see scratches on frames that came from something inside the
projector periodically or continuously contacting the film. It may show up as a long,
unbroken vertical scratch, or intermittently recur in some parts of some frames. For
historical films — for example I helped on the restoration of To Kill a Mockingbird —
you don’t want to touch a pixel unless it’s absolutely essential. For very thin scratches
you can use something simple like a median filter, but the artifacts you introduce can
be very subtle. If you freeze a frame and look at the one before and the one after you
won’t notice a problem, but if you watch the moving footage, you’ll be very suspicious
about that area — something must have happened there but you can’t put your finger
on it — some subtle motion that evokes critical analysis in the viewer.
For the recent movie Hereafter, there was a problem with a “hair in the gate” in one
scene — a hair got stuck in the film scanner and was there throughout the sequence.
We used patch-based inpainting, motion compensated to pull from other frames, to
remove the hair. In another scene outside the San Francisco airport, a dark cloudy
scene with a taxicab, there was actually a fairly large scratch on the film — about forty
pixels at its widest point. The artist created a matte painting for one frame and the
software was able to track that scratch throughout the whole sequence and inpaint it
from the artist’s background.
RJR: How is the aspect ratio change between a widescreen cinema release and a DVD
or Blu-Ray handled?
Sloan: You might hope that there’s an automatic solution, but there are several rea-
sons why it’s problematic. At any step in production before the final shot is done,
you run the risk of things running into each other in a way that you don’t expect.
So you can’t do content-aware retargeting before all the content is in the scene. The
more important issue is that a well-composed frame is like a visual art form. Yes,
an automatic algorithm could do a nice job of preserving the content and making
sure everyone’s heads were in frame when it got cropped down, but every frame
is kind of like a painting. It has a composition and it has spatial relationships that
are probably best preserved by a human, making the decision during actual film-
ing about what part’s going to be in the frame and what part’s going to be out.
Skilled cinematographers do that all the time; there’s this idea of an importance gra-
dient that kind of flows out from a central rectangle that’s literally marked in their
98 Chapter 3. Image Compositing and Editing
(a)
(b)
(c)
(d)
Figure 3.38. (a) In this Impossible is Nothing commercial for adidas, a young Muhammad Ali from
archival footage fights his daughter Laila Ali in a seamless composite. (b,c) Image inpainting is
used to remove blue-painted stand-in objects in these sequences from Transformers: Dark of the
Moon. (d) To remove larger objects, like these green-suited fighters from The Mummy: Tomb
of the Dragon Emperor, more extensive manual work is usually required. adidas, the 3-Stripes
mark and the Impossible is Nothing mark are registered trademarks of the adidas Group used
with permission. The name, image, and likeness of Muhammad Ali are provided courtesy of
Muhammad Ali Enterprises LLC. Transformers: Dark of the Moon ©2011 Paramount Pictures.
All Rights Reserved. The Mummy: Tomb of the Dragon Emperor courtesy of Universal Studios
Licensing LLC.
3.7. I n d u s t r y P e r s p e c t i v e s 99
camera viewfinder. They’ll see the part of the frame corresponding to the aspect
ratio of the movie in the theater, as well as a TV-safe area. Productions are now
designed to be targeted to different types of media without any intervention, to
avoid the noticeable pan-and-scans you used to see when movies were shown on
TV in the eighties. It’s also much better now that the 16:9 HDTV aspect ratio is closer
to the movie aspect ratios than standard-definition video was, so letterboxing is less
obnoxious.
Lambert: For the movie we’re working on now, Jack the Giant Killer, they’re film-
ing it using the RED Epic camera, with 5K resolution. We’re going to be taking a
portion out of this huge frame, which is exciting. We can see the regions corre-
sponding to the 2.39 cinema release, the proposed 16:9 TV release, and we have
extra information on the sides of the plates that will help in the stereo conver-
sion. We’re using the extra information from the top and bottom as well, because
there are giants in this movie, and when they hit the ground, the camera shakes,
and we’ll be able to pull in information from pixels that are out of frame instead of
making it up.
This is the first time we’ve had this kind of oversize plate. There are so many pixels
that for the TV release they may not even use the 16:9 rectangle but blow up a piece
of the actual 2.39 cinema release. With a normal camera, you can get away with a
blow-up up to about fifteen to twenty percent. On this show, they’re shooting at 5K,
5120×2700, but actually rendering the show at half that resolution, they’re actually
reframing some of the shots, knowing that they can zoom into the shot because they
have this exceptional resolution. I’ve got a plate that just came in with 170 percent
blow-up, which I’ve never seen before, but because you have the extra resolution you
can do that.
David Geoghegan, Flame operator at LOOK Effects in Los Angeles, California, dis-
cusses the compositing involved in a pair of shots from Captain America: The First
Avenger.
Geoghegan: The setup here is that one of Captain America’s team runs out of the for-
est to a road. He lies down in the middle of the road as an armored truck approaches,
and just as it rolls over him, he places a bomb on its underside. So here, the A side is
just a stationary camera in the middle of the road that the truck drives over, and the
B side is a moving camera that follows the guy as he runs out of the forest, lies down
in the empty road, and raises his hands up to place the bomb. Since the B side is
moving, we had to get a really good camera track, which wasn’t so bad because they
put lots of markers on trees in the forest — but then you have to go back and paint
out all the markers.
The key problem is getting the truck plate to really “live” in the other plate. Luckily,
they shot both plates in the same location, but the color is very different in the two
shots. I did a really tight roto of the truck from the A side so I could pull it completely
off. Then before putting it into the B side, I had to color-correct it, “slip” or alter the
timing, and flop it — that is, reverse it left to right. I had to recreate the shadowing
100 Chapter 3. Image Compositing and Editing
under the truck to marry the two shots together. It’s stuff like contact shadows that
really sell the effect. If the blacks in the underside of that truck aren’t exactly the same
as the actor’s, when the truck shadows and his shadows are in the same space, they
won’t match, and it’ll be a dead giveaway. It won’t look like that truck is really rolling
over that guy. They even shot stills of the underside of that truck as it rolls over, since
they wanted to see more detail; we had to track those underside stills into the plate,
just by hand.
Since the guy actually hangs onto the bomb throughout the whole B side, the bomb
itself was CG past a certain point and it also had to be painted out. This again is all
done essentially with roto shapes hand-drawn frame by frame, which is difficult since
there’s a lot of motion blur as he’s pulling the bomb back to his chest.
In the next shot you see the truck drive away and explode after a few moments.
That explosion was another A and B situation — putting them together and painting,
making truck parts fly apart. The A plate was basically the truck driving off safely, not
much happening, and the B plate was a practical explosion — the two shots are totally
different. The B shot looks like it’s nighttime, but it’s not; they just cranked down the
exposure so the explosion would be exposed right. Even though it’s literally only a
few frames, there’s just a lot of hand working to make it look good to the human eye.
There are very few things that can really be automated; software gets you so far, and
then it comes down to your ingenuity of actually connecting the A and B sides and
making them live in the same plate.
Wang and Cohen [533] studied the problem of simultaneous matting and composit-
ing; that is, instead of treating the process as two separate steps, the matte of the
foreground region is optimized to minimize visual artifacts in the resulting compos-
ite. The benefit is that the foreground matte need not be highly accurate in regions
where the foreground and new background are similar. The computation of the matte
is similar to the methods in Section 2.6.
Early work on image editing by manipulating gradients was proposed by Elder and
colleagues (e.g., [130]). Later, McCann and Pollard [318] proposed tools for directly
“painting” in the gradient domain for image manipulation and art creation. Orzan
et al. [357] showed how smooth-shaded vector graphics artwork could be created
by specifying the locations of curved edges and the colors on either side. Gradient-
domain techniques can also be used to combine or remove artifacts in flash/no-
flash pairs [10]. Sunkavalli et al. [483] observed that the results of gradient-domain
compositing may exhibit a mismatch in texture between the source and target, even
if the colors are consistent. For example, the source patch may be sharp and clear
while the target image might have visible film grain and blurring. They proposed a
wavelet-pyramid-based compositing framework that matches the histograms of both
the image intensities and the estimated noise in the source and target images, to
propagate the target texture into the pasted source region. They called this approach
“image harmonization.”
Lo et al. [298] proposed a compositing method for 3D stereoscopic movies, an area
of increasing interest to the visual effects community (see Chapter 5). An unusual
3.8. N o t e s a n d E x t e n s i o n s 101
application of compositing was proposed by Chu et al. [95], whose goal was to create
realistic “camouflage images” containing hidden elements, as in a child’s picture
book. Agarwala et al. [9] extended the ideas in Section 3.3 to build a “panoramic video
texture,” a seamlessly looping moving image created from a video shot by a single
panning camera. Rav-Acha et al. [384] showed how to create a similar effect, as well
as nonlinear temporal edits of a video, such as the manipulation of a race to create a
different winner.
While the topic is outside the scope of this chapter, the patch-based approach
to inpainting in Section 3.4.2 is an application of texture synthesis, the problem of
creating a large chunk of realistic, natural texture from a small example. Major work
in this area includes that of Wei and Levoy [539], Ashikhmin [20], Efros and Freeman
[128], and Hertzmann et al. [197]. We also note that inpainting can be generalized to
apply to non-image-based scenarios, such as filling in holes in a depth image or 3D
triangle mesh [221, 124].
An early hybrid approach to retargeting was proposed by Setlur et al. [437], who
used an importance map to compute ROIs, removed these from the image, and
inpainted the holes to create an “empty” background image. This image is uniformly
resized to the desired dimensions, and the ROIs pasted back onto the new background
in roughly the same spatial relationship.
Krähenbühl et al. [255] made an interesting observation that naïve retargeting can
introduce aliasing into the resulting image, manifesting as blurring of sharp edges,
and proposed cost function terms to preserve the original image gradients, as well as
a low-pass filter to limit the spatial frequencies in the retargeted image. Mansfield et
al. [313] showed that when a user-supplied depth map was available, a seam-carving-
based approach could be used to create retargeted images that respected depth
ordering and could even contain realistically overlapping foreground objects. Cheng
et al. [90] described reshuffling applications specifically in the context of scenes with
many similar repeated elements.
Rav-Acha et al. [383] proposed an interesting approach to video editing of a fore-
ground object that undergoes pose changes (e.g., a person’s rotating head). They
estimated a non-photorealistic 2D texture map for the 3D object and its mapping
to each image in the video sequence. The user performs editing/compositing oper-
ations directly on the texture map, which is then warped to produce a retextured
video.
This chapter should provide fairly convincing evidence that today it is very difficult
to tell if a digital image resulted from an untouched photograph of a real scene or if it
has been manipulated. This is great news for visual effects in movies, but somewhat
unsettling for photographs in other spheres of life that we expect to be trustworthy
(e.g., newspaper photographs of historic events, photographic evidence in trials). In
fact, a new field of digital forensics has arisen to detect such tampering. Farid pro-
vided an excellent general overview [134] and technical survey [133] of techniques for
detecting whether a digital image has been manipulated. Such techniques include
the detection of telltale regularities in pixel correlations, or inconsistencies in JPEG
quantization/blocking, camera transfer function and/or noise, lighting directions,
and perspective effects. Lalonde et al. [261, 260] also noted the importance of match-
ing lighting, camera orientation, resolution, and other cues to make a composite
image look convincing.
102 Chapter 3. Image Compositing and Editing
3.1 Sketch the hard composite implied by the images S, T , and M in Figure 3.39.
S T M
Figure 3.39. Source, target, and mask images for a compositing problem.
3.2 Consider the corresponding rows of a source and target image given by
Figure 3.40. We want to use a weighted transition region between columns
50 and 70 so that pixels to the left of column 50 come entirely from the
source, pixels to the right of column 70 come entirely from the target, and
pixels in the transition region are a linearly weighted blend between the two
images. Sketch the row of the composite, indicating important values on
the x and y axes.
250 250
200
50
50 60 70 100 50 60 70 100
where I is the original image, and Ki is a low-pass filter whose spatial extent
increases with i.
3.4 Prove that Equation (3.5) is true — that is, that the original image can be
obtained as the sum of the upsampled images of the Laplacian pyramid.
Use induction; that is, show that
is true for i = 0, . . . , N − 1.
3.9. H o m e w o r k P r o b l e m s 103
3.5 Suppose we use a Laplacian pyramid to blend two images along a vertical
boundary using the 5 × 5 kernel in Equation (3.3). If we use five levels of the
pyramid, how wide is the transition band at the lowest level (with respect to
the original image resolution)?
3.6 Show how Equations (3.11)–(3.12) result from Equation (3.7) based on the
Euler-Lagrange equation.
3.7 Prove that when (Sx , Sy ) is a conservative vector field, Equation (3.13) is the
same as Equation (3.11).
3.8 Suppose the region for a Poisson compositing problem is given by
Figure 3.41, with the pixels labeled from 1 to 30 as shown.
1 2 3 4 5 6
7 8 9 10 11 12
Figure 3.41. An example region for a
13 14 15 16 17 18 Poisson compositing problem.
19 20 21 22 23 24
25 26 27 28 29 30
p 2 = (100,50) p 1 = (100,100)
Ω
p 3 = (150,50) p 4 = (150,100)
2 3
1 2
1
3
Initial labels Final labels
Figure 3.44. Initial and final states for a series of α-expansion steps.
3.15 Consider the images and inpainting regions in Figure 3.45. Explain why,
using PDE-based inpainting, the inpainted region will be identical in both
cases. (For more investigation of this issue, see [273].)
3.16 Determine the white point in Figure 3.21 with the highest inpainting
priority, given the labeled confidence and data values.
3.17 Determine a target region such that every pixel on the fill front for patch-
based inpainting has the same confidence term.
3.18 Construct an image and inpainting region for patch-based inpainting that
contains both a point with a very high confidence term and a very low data
term, and a point with very high data term and a very low confidence term.
3.9. H o m e w o r k P r o b l e m s 105
Ω Ω
Figure 3.45. Two images with inpainting regions .
3.19 Consider the distortion energy for one quad q in optimized scale-and-
stretch resizing, given by
(vi − vj ) − sq (vi − vj )2 (3.48)
vi ,vj ∈Vq
Show that the optimal scaling factor sq that minimizes Equation (3.48) is
given by:
(vi − vj ) (vi − vj )
vi ,vj ∈Vq
sq = (3.49)
vi − vj 2
vi ,vj ∈Vq
3.20 Consider the image and seams in Figure 3.46. Note that the region on the
left side of the image is darker than the region on the right side. Order the
seams in increasing energy (assuming we use Equation (3.36)).
A B C D E
3.21 Show why the arc weights in Figure 3.31b result in a minimal cut that (a)
forms a connected path of pixels that (b) only intersects one pixel per row.
106 Chapter 3. Image Compositing and Editing
3.22 Show why the three cases in Figure 3.32 and the costs in Equation (3.37)
imply the arc weights in Figure 3.33a.
3.23 Verify that the bidirectional similarity update rule in Equation (3.43)
is obtained from differentiating the sum of Equation (3.41) and
Equation (3.42).
3.24 Construct an example pair of images such that the number of patches Nj in
the bidirectional similarity completeness term is equal to 0 for some pixel,
and much larger than W 2 for some other pixel.
4 Features and Matching
In many visual effects applications, we need to relate images taken from different
perspectives or at different times. For example, we often want to track a point on a
set as a camera moves around during a shot so that a digital creature can be later
inserted at that location. In fact, finding and tracking many such points is critical for
algorithms that automatically estimate the 3D path of a camera as it moves around a
scene, a problem called matchmoving that is the subject of Chapter 6. However, not
every point in the scene is a good choice for tracking, since many points look alike. In
this chapter, we describe the process of automatically detecting regions of an image
that can be reliably located in other images of the same scene; we call these special
regions features. Once the features in a given image have been found, we also discuss
the problems of describing, matching, and tracking them in different images of the
same scene.
In addition to their core use for matchmoving, feature detection is also important
for certain algorithms that estimate dense correspondence between images and video
sequences (Chapter 5), as well as for both marker-based and markerless motion cap-
ture (Chapter 7). Outside the domain of visual effects, feature matching and tracking
is commonly used for stitching images together to create panoramas [72], localiz-
ing mobile robots [432], and quickly finding objects [456] or places [424] in video
databases.
Feature tracking is a subset of the more general problem of visual tracking from
computer vision. However, there are some big differences to keep in mind. Visual
tracking algorithms are usually designed to follow a particular meaningful object
such as a person or car throughout a video sequence. On the other hand, features are
automatically extracted from an image based purely on mathematical considerations,
and usually look like individually uninteresting blobs or corners. Precise localization
of features is critical for subsequent applications like matchmoving, while a general
visual tracker may use a crude box or ellipse (e.g., [103]) to outline the region of
interest. It’s also common for general visual trackers to maintain a probabilistic rep-
resentation of an object’s state, for example using a Kalman filter (e.g., [58]), while
this approach is fairly uncommon in feature tracking. Finally, a major area of interest
in feature matching is the wide-baseline case in which the images under considera-
tion were taken from cameras that were physically far apart, whereas visual tracking
generally assumes the camera moves only slightly between images.
While we generally use the term features throughout this chapter to denote image
regions of interest, several other terms are often used to describe the same concept,
107
108 Chapter 4. Features and Matching
including interest points, keypoints, and tie points. We generally use the word
matching when discussing an arbitrary pair of images of the same scene, and use
the word tracking when the images come from a video sequence.
This chapter is split into two main sections. We first discuss the key problem of
feature detection — that is, deciding which image regions are sufficiently distinctive
(Section 4.1). We then discuss the problem of feature description — that is, decid-
ing how to represent the image information inside each region for later matching
(Section 4.2). We briefly describe evaluation techniques that help in deciding on a
good detector/descriptor combination (Section 4.3) as well as extensions to color
images (Section 4.4).
In this chapter, we generally assume that the problem is to detect and match fea-
ture points in a set of natural images, e.g., acquired from a camera on location. This
is frequently the situation for matchmoving with a freely moving camera, as we’ll
discuss in Chapter 6. When we have more control over the environment — for exam-
ple, a soundstage set — it’s common to introduce artificial tracking markers (e.g.,
gaffer-tape crosses on the surfaces of a blue- or green-screen set) that are relatively
straightforward to detect and track. We discuss the problem of designing distinctive
artificial tracking markers in Section 4.5.
Initially, we’ll assume that a feature is a square block of pixels centered at a certain
location in an image. Our first goal is to mathematically characterize what makes a
good feature. Intuitively, we want to select a block that is highly distinctive, so that in
a different image of the same scene, we can find a unique match. Put another way,
we want the detection to be repeatable — that is, given a different image of the same
scene, the feature is distinctive enough that we can find it again in the correct location.
Figure 4.1 illustrates several feature candidates in an example image. Candidate A
is a poor choice of feature, since this nearly-constant-intensity patch is almost iden-
tical to other nearly-constant-intensity patches in the image. Candidate B is a better
feature, since the strong edge passing through it makes it more distinctive. However,
there are still several blocks in the image that are almost identical to Candidate B,
which can be obtained by sliding the block along the edge; this ambiguity is called
the aperture problem. Candidates C and D are good choices for features; the image
intensities at Candidate C form a corner and those at Candidate D form a blob; both
blocks are locally unique. That is, each block does not resemble any other block in
its local neighborhood. In the following sections, we formalize this intuition that a
feature should have locally distinctive intensities, and discuss detectors or interest
operators that automatically find such features. In this section, we’ll assume that
the images under consideration are grayscale, and will discuss extensions to color in
Section 4.4.
diagonal directions. If the difference is high in all directions, the block is a good can-
didate to be a feature. Harris and Stephens [186] are widely credited with extending
this idea to create what has become known as the Harris corner detector,1 which
can be derived as follows.
Let w(x, y) be a binary indicator function that equals 1 for pixels (x, y) inside the
block under consideration and 0 otherwise. Then consider the function E(u, v) that
corresponds to the sum of squared differences obtained by a small shift of the block
in the direction of the vector (u, v):
E(u, v) = w(x, y)(I (x + u, y + v) − I (x, y))2 (4.1)
(x,y)
2
∂I ∂I
E(u, v) = w(x, y) I (x, y) + u (x, y) + v (x, y) − I (x, y)
∂x ∂y
(x,y)
2
∂I ∂I
= w(x, y) u (x, y) + v (x, y)
∂x ∂y
(x,y)
2 2
∂I ∂I ∂I ∂I
= w(x, y) u2 (x, y) + 2uv (x, y) (x, y) + v 2 (x, y)
∂x ∂x ∂y ∂y
(x,y)
2
∂I ∂I ∂I
(x,y) w(x, y) (x, y) (x,y) w(x, y) (x, y) (x, y)
u ∂x ∂x ∂y u
= 2
v ∂I ∂I ∂I v
(x,y) w(x, y) (x, y) (x, y) (x,y) w(x, y) (x, y)
∂x ∂y ∂y
(4.2)
1 While Harris’s name is now attached to the idea, other authors proposed very similar approaches
earlier, notably Förstner [150].
110 Chapter 4. Features and Matching
80 80 80
60 80 60 60
60
40 40 40
40
20 20 20
20
0 0 0 0
0 v
0 v u 0 0 u 0 v u 0 0 v u 0
Figure 4.2. Top row: Candidate feature blocks from Figure 4.1. Middle row: Harris matrix eigen-
values and Harris quality measure C with k = 0.04. Bottom row: Error surfaces E(u, v) around
block center.
The symmetric positive definite matrix in Equation (4.2) is called the Harris
matrix:2
2
∂I ∂I ∂I
(x,y) w(x, y) ∂x (x, y) (x,y) w(x, y) ∂x (x, y) ∂y (x, y)
H = 2 (4.3)
∂I ∂I ∂I
(x,y) w(x, y) ∂x (x, y) ∂y (x, y) (x,y) w(x, y) ∂y (x, y)
The eigenvalues and eigenvectors of the Harris matrix can be analyzed to assess
the cornerness of a block. Let these be (λ1 , λ2 ) and (e1 , e2 ) respectively, with λ1 ≥ λ2 .3
Consider the following cases, illustrated in Figure 4.2:
∂I ∂I
1. The block is nearly-constant-intensity. In this case, ∂x and ∂y will both be
nearly zero for all pixels in the block. The surface E(u, v) will be nearly flat and
thus λ1 ≈ λ2 ≈ 0.
∂I
2. The block straddles a linear edge. In this case, ∂x and ∂I will both be nearly
∂I ∂y
∂x
zero for pixels far from the edge, and the gradient ∂I
will be perpendicular
∂y
to the edge direction for pixels near the edge. Thus, λ1 will be a non-negligible
positive value, with e1 normal to the edge direction, while λ2 ≈ 0 with e2 along
the edge direction. The surface E(u, v) will resemble a trough in the direction
of the edge.
3. The block contains a corner or blob. In this case, the surface E(u, v) will resem-
ble a bowl, since any (u, v) motion generates a block that looks different than
the one in the center. Both λ1 and λ2 will be positive.
Consequently, we look for blocks where both eigenvalues are sufficiently large. How-
ever, to avoid explicitly computing the eigenvalues, Harris and Stephens proposed
2 This is also sometimes called the second moment matrix and is related to the image’s local
autocorrelation.
3 Note that both eigenvalues are real and non-negative since the matrix is positive semidefinite.
4.1. F e a t u r e D e t e c t o r s 111
where k is a tunable parameter (frequently set to around 0.04; the lower the value
of k, the more sensitive the detector). When both eigenvalues are large, C will be a
large positive number, while C will be near zero if one eigenvalue is small. Figure 4.2
illustrates the candidate blocks from Figure 4.1, along with the corresponding error
surfaces E(u, v), eigenvalues of H , and quality measures C. We can see that both
eigenvalues are large for Candidates C and D, with correspondingly high quality
measures, while the quality measures for Candidates A and B are very low.
To detect features in an image, we simply evaluate the quality measure at each
block in the image, and select feature points where the quality measure is above
a minimum threshold. The resulting points are called Harris corners. Figure 4.3
illustrates Harris corners detected in an example image; we can see that most of
the features lie on actual image corners and other distinctive features, while few
features are found in flat regions or along edges. Since the test only depends on the
eigenvalues and not the direction of the eigenvectors, the detected feature locations
are approximately rotation-invariant (meaning that we would detect roughly the
same apparent features if the image were rotated).4
We usually apply non-maximal suppression to the results of Harris corner detec-
tion, since the Harris quality measure will be high for many pixels in the neighborhood
of a corner. That is, to avoid multiple detections for the same underlying corner, we
Figure 4.3. Harris corners detected in an image, using 15×15 windows and a threshold of one
percent of the maximum quality measure value. Non-maximal suppression is applied to avoid
generating many responses for the same feature.
4 The proper term is actually rotation-covariant. That is, the feature locations detected in a rotated
image will be approximately the same as the rotations of the locations in the original image.
However, the term “invariant” is often misused to mean “covariant” in the context of feature
detection.
112 Chapter 4. Features and Matching
only retain Harris corners whose quality measure is larger than that of all the points
in their N × N pixel neighborhood for some user-selected N .
∂I ∂I
(x, y) = I (x + 1, y) − I (x − 1, y) (x, y) = I (x, y + 1) − I (x, y − 1) (4.5)
∂x ∂y
An alternative that ties in more closely with techniques discussed in the rest of
the chapter is to approximate the gradients by convolving them with the derivatives
of a Gaussian function:
∂I ∂G(x, y, σD ) ∂I ∂G(x, y, σD )
(x, y) = I (x, y) ∗ (x, y) = I (x, y) ∗ (4.6)
∂x ∂x ∂y ∂y
That is, we smooth the image to remove high frequencies before taking the deriva-
tive. Also, to make the response as a function of window location smoother, we
can replace the binary function w(x, y) in Equation (4.3) with a radially symmet-
ric function that weights pixels in the center of the window more strongly, such as a
Gaussian:
1 1 2 2
w(x, y) = exp − 2 ((x − x0 ) + (y − y0 ) ) (4.8)
2πσI2 2σI
= G(x − x0 , y − y0 , σI ) (4.9)
where w(x, y) is 1 for pixels inside W and 0 otherwise. Using the same type of Taylor
series approximation as we did in Equation (4.2) and setting the derivative equal to
zero yields the linear system
2
∂I ∂I ∂I
(x,y) w(x, y) ∂x (x, y, t) (x,y) w(x, y) ∂x (x, y, t) ∂y (x, y, t) u
2
w(x, y) ∂I
(x, y, t) ∂I
(x, y, t) w(x, y) ∂I
(x, y, t) v
(x,y) ∂x ∂y (x,y) ∂y
∂I ∂I
(x,y) w(x, y) ∂x (x, y, t) ∂t (x, y, t)
= − (4.12)
∂I ∂I
(x,y) w(x, y) ∂y (x, y, t) ∂t (x, y, t)
We can see that the square matrix in Equation (4.12) is exactly the Harris matrix
H of Equation (4.3). Shi, Tomasi, and Kanade argued that for the linear system to be
well conditioned — that is, for the feature to be reliably trackable — both eigenvalues
of H should be sufficiently large, suggesting the criterion
where τ is a user-defined threshold. Features discovered in this way are quite similar
to Harris corners, and are sometimes called KLT corners since they form the basis
for the well-known KLT (Kanade-Lucas-Tomasi) tracker [307, 492].
Shi and Tomasi extended their model for the motion of a feature from a trans-
lation to an affine transformation, to account for the deformation of features that
typically occur over long sequences (see also Section 4.1.5). That is, the scene patch
corresponding to a square feature block in the first image will eventually project
to a non-square area as the camera and scene objects move, so Equation (4.10) is
modified to
where the parameters a, b, c, d allow the feature square to deform into a paral-
lelogram. The corresponding tracker is again obtained using a Taylor expansion.
We will discuss more advanced methods for affine-invariant feature detection in
Section 4.1.5.
When the feature dissimilarity (e.g., the error in Equation (4.11)) gets too large,
the feature is no longer reliable and should not be tracked. When many features
are simultaneously matched or tracked, outlier rejection techniques can be used to
dispose of bad features [494], and the underlying epipolar geometry provides a strong
constraint on where the matches can occur [576]. We will discuss the latter issue
further in Chapter 5. Wu et al. [554] noted that the KLT tracker could be improved by
processing frames both forward and backward in time, instead of always matching
the current frame to the previous one.
Jin et al. [222] extended Shi and Tomasi’s affine tracker to account for local photo-
metric changes in the image — that is, instead of assuming that the pixel intensities
114 Chapter 4. Features and Matching
in the transformed block remain the same from frame to frame, we allow for a scale
and shift:
When training images of the target feature under different illuminations are available,
more advanced photometric models can be obtained [184].
4.1.2 Harris-Laplace
A major drawback of Harris corners is that they are only extracted for a fixed, user-
defined block size. While setting this block size to a small value (e.g., 7 × 7 pixels)
enables the extraction of many fine-detail corners, it would also be useful to extract
features that take up a relatively larger portion of the image. That is, we would like to
detect features at different spatial scales. Detecting features in scale space is a critical
aspect of most modern feature detectors. We first describe how Harris corners can be
extracted at multiple scales, and in the following sections introduce new criteria not
based on the Harris matrix. Lindeberg [285, 286] pioneered the use of scale space for
image feature detection, providing much of the theoretical basis for subsequent work.
The key concept of scale space is the convolution of an image with a Gaussian
function:5
L(x, y, σD ) = G(x, y, σD ) ∗ I (x, y) (4.16)
where σD takes on a sequence of increasing values, typically a geometrically increas-
ing sequence of scales {σ0 , kσ0 , k 2 σ0 , . . .}. As σD increases, the image gets blurrier,
since the Gaussian acts as a low-pass filter. The idea is similar to the Gaussian pyra-
mid discussed in Section 3.1.2, except that the output image is not downsampled
after convolution.
Revisiting Section 4.1.1.1, we can rewrite the Harris matrix evaluated at a point
(x, y) using derivation scale σD and integration scale σI as:
∂L(x,y,σD ) 2 ∂L(x,y,σD ) ∂L(x,y,σD )
∂x ∂x ∂y
H (x, y, σD , σI ) = G(x, y, σI ) ∗ (4.17)
∂L(x,y,σD ) ∂L(x,y,σD ) ∂L(x,y,σ ) 2
D
∂x ∂y ∂y
Note that
∂L(x, y, σD ) ∂G(x, y, σD )
= ∗ I (x, y) (4.18)
∂x ∂x
since differentiation and convolution are commutative, which implies that we can
take the derivative and smooth the image in either order.
If we compute features at different scales and look at the eigenvalues of the Harris
matrix to decide which features are the most significant, we don’t want larger features
to outweigh smaller features just because they’re computed over a larger domain. We
would like to scale-normalize the Harris matrix so that feature quality can be directly
compared across scales. Furthermore, it’s sometimes desirable to detect and match
features at different scales — for example, to match a large square of pixels from a
zoomed-in shot with a small square of pixels from a wider angle shot. In this case, we
5 The use of L(x, y, σ ) to denote this Gaussian-blurred image is conventional notation and shouldn’t
be confused with the Laplacian pyramid images from Chapter 3.
4.1. F e a t u r e D e t e c t o r s 115
should make sure that the derivatives we compute in creating the Harris matrix are
scale-invariant — that is, that we compute the same matrix regardless of the image
resolution.
We can determine the correct scale normalization as follows [127]. Suppose that
we have two versions of the same image: a high-resolution one I (x, y) and a low-
resolution one I (x , y ). The coordinates of the two images are related by x = kx and
y = ky , where k is a scale factor greater than 1. If we consider a block centered at
(x , y ) with scales (σD , σI ) in the low-resolution image, it will correspond to the block
centered at (kx , ky ) with scales (kσD , kσI ) in the high-resolution image. From the
chain rule, we can also compute that the image gradients at corresponding points
satisfy ∇I = k∇I . Substituting everything into Equation (4.17), we have that
1
H (x, y, kσD , kσI ) = H (x , y , σD , σI ) (4.19)
k2
where H and H are the scale-dependent Harris matrices computed for the high- and
low-resolution images, respectively. This implies that if we compute a scale space
where each derivation scale and integration scale is a multiple of a base scale σ , i.e.,
at {σ0 , kσ0 , k 2 σ0 , . . .}, then we should compute the scale-normalized Harris matrix as
That is, in order to directly compare the response from the Harris matrix at different
scales, we must multiply Equation (4.17) by the compensation term k 2 . Now we can
apply the Harris criterion with the same threshold at every scale, that is:
1. Create the scale space of the image for a fixed set of scales σD ∈
{σ0 , kσ0 , k 2 σ0 , . . .}, with σI = aσD . Typical values are σ0 = 1.5, k ∈ [1.2, 1.4],
and a ∈ [1.0, 2.0] (see [325, 327]).
2. For each scale, compute the scale-normalized Harris matrix in Equation (4.20)
and find all local maxima of the Harris function in Equation (4.4) that are above
a certain threshold.
Figure 4.4 illustrates the idea; if we fix the (x, y) value specified by the dot in each
image and plot the normalized Laplacian as a function of σ , we see that the function
assumes a maximum at the same apparent scale in each case (visualized as the radius
116 Chapter 4. Features and Matching
0.2 0.2
0.15 0.15
NL(σ)
NL(σ)
0.1 0.1
0.05 0.05
0 0
20 40 60 80 20 40 60 80
σ σ
Figure 4.4. Selecting the characteristic scale of a feature using the normalized Laplacian. Top
row: original images with manually-selected center locations (white dots). Bottom row: the nor-
malized Laplacian as a function of scale. The characteristic scale σ that maximizes the normalized
Laplacian is used as the radius of the corresponding circle in the top row. The ratio between the
two characteristic scales is 2.64, which is almost the same as the actual underlying scale factor
of 2.59 relating the images.
of the circle in the top row). We will discuss the normalized Laplacian further in the
next section.
Mikolajczyk and Schmid [325] adopted this approach to compute what they called
Harris-Laplace features. We use the same two steps as previously shown to detect
Harris corners at each scale, and add the additional step6
3. For each detected feature (say at scale k n σ0 ), retain it only if its normalized
Laplacian is above a certain threshold, and it forms a local maximum in the
scale dimension, that is:
NL(x, y, k n σ0 ) > NL(x, y, k n−1 σ0 ) and NL(x, y, k n σ0 ) > NL(x, y, k n+1 σ0 ) (4.22)
6 A slight modification that gives higher localization accuracy but has more computational cost was
described in [327].
4.1. F e a t u r e D e t e c t o r s 117
Figure 4.5. Harris-Laplace features detected in a pair of images of the same scene. The radius of
each circle indicates the characteristic scale of the feature located at that circle’s center. (Fairly
aggressive non-maximal suppression was used so that the features don’t overwhelm the image.
In practice, a much larger number of features is detected.)
(Note that it’s still possible to generate multiple detections at the same (x, y)
location with different characteristic scales, but the scales will be somewhat
separated.)
Figure 4.5 illustrates a pair of images and a subset of their detected Harris-Laplace
features, using circles to indicate each feature’s scale. We can see that the features
center on distinctive regions of the image and that the detected scales are natural.
More important, many of the same scene locations are detected at the same apparent
scales, indicating the promise of Harris-Laplace features for automatic matching. This
critical property is called scale covariance.7
As before, L(x, y, σD ) is the Gaussian-filtered image at the specified scale. Note that
2 ∂ L(x, y, σD ) ∂ 2 L(x, y, σD )
2
trace Ŝ(x, y, σD ) = σD + (4.24)
∂x 2 ∂y 2
2
4 ∂ L(x, y, σD ) ∂ L(x, y, σD ) ∂ 2 L(x, y, σD )
2 2
det Ŝ(x, y, σD ) = σD − (4.25)
∂x 2 ∂y 2 ∂x∂y
In particular, we can see that the absolute value of the Hessian’s trace is the same as
the normalized Laplacian in Equation (4.21).
(a) (b)
(c) (d)
Figure 4.6. (a) An original image. (b) Harris-Laplace features. (c) Laplacian-of-Gaussian features
(i.e., obtained with the trace of the Hessian). Note that the detector responds to both blobs and
edges. (d) Hessian-Laplace features (i.e., obtained with the determinant of the Hessian). The
detector does not respond to edges.
We can substitute the Hessian matrix and either its trace or determinant in Step 2
on p. 115 to obtain a feature detector that responds strongly to blobs, as illustrated in
the bottom row of Figure 4.6. Using the trace, that is, the Laplacian, has a pleas-
ing symmetry in that we can view the detector as selecting local maxima of the
same function in both the spatial and scale dimensions. These features are called
Laplacian-of-Gaussian or LoG features, since we’re computing the Laplacian of a
Gaussian-smoothed image at a given scale. That is, to detect LoG features, we com-
pute the quantity in Equation (4.24) at every (x, y, σD ), and find points where this
function of three parameters is locally maximal.
Figures 4.6c-d illustrate that the determinant of the Hessian does a better job than
the trace for rejecting long, thin structures and finding well-proportioned blobs. This
approach (using the determinant of the Hessian for detection and its trace for scale
selection) produces what are called Hessian-Laplace features. One can also require
that the trace and determinant of the Hessian are simultaneously maximized [326].
All of these features are scale-covariant.
Bay et al. [33] noted that the discrete Gaussian filters used in the computation of the
scale-normalized Hessian could be approximated by extremely simple box filters, as
illustrated in Figure 4.7. Since box filters only involve simple sums and differences of
pixels, they can be applied very quickly compared to filters with floating-point coeffi-
cients. If integral images [516] are used for the computation, the speed of applying the
box filter is independent of the filter size. Bay et al. proposed to use the box filters in
an approximation of the Hessian’s determinant, resulting in what they called the Fast
Hessian detector. We will discuss additional fast feature detectors in Section 4.1.6.
4.1. F e a t u r e D e t e c t o r s 119
1 –2 1
Figure 4.7. (a) Example discrete 9 × 9
Gaussian derivative filters used for com-
puting the Hessian, with σ = 1.2. The top
∂ 2 L(x,y,σ )
filter is ∂x 2
and the bottom filter is
∂ 2 L(x,y,σ )
∂x∂y . Light values are positive, black
values are negative, and gray values are
1 –1 near zero. (b) Efficient box-filter approx-
imations of the filters at left. Gray values
are 0.
–1 1
(a) (b)
4.1.4 Difference-of-Gaussians
Lowe [306] made the important observation that the Laplacian-of-Gaussian detector
could be approximated by a Difference-of-Gaussians or DoG detector. Why is this
the case? From the definition of the Gaussian function in Equation (4.7), we can
show that
∂G
= σ ∇ 2G (4.26)
∂σ
If we assume that we generate Gaussian-smoothed images where adjacent scales
differ by a factor of k, then we can approximate
∂G G(x, y, kσ ) − G(x, y, σ )
≈ (4.27)
∂σ kσ − σ
That is, the difference of the Gaussians at adjacent scales is a good approximation
to the scale-normalized Laplacian, which we used to construct LoG features in the
previous section. This is highly advantageous, since to compute scale-space features
we had to create these Gaussian-smoothed versions of the image anyway. Figure 4.8
compares a difference of Gaussians with the Laplacian of Gaussian, showing that the
DoG is a good approximation to the LoG.
Therefore, the key quantity for DoG feature detection is the difference of adjacent
Gaussians applied to the original image:
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
5 5
5 5
0 0 0 0
y −5 −5 x y −5 −5 x
(a) (b)
0.01 0.01
−0.01 −0.01
−0.03 −0.03
−0.05 −0.05
5 5
5 5
0 0 0 0
y −5 −5 y −5 −5
x x
(c) (d)
Figure 4.8. (a) A Gaussian with σ = 1. (b) A Gaussian with σ = 1.2. (c) The Laplacian of Gaussian
with σ = 1. (d) The difference of the Gaussians in (b) and (a), normalized to have the same
maximum value as (c). We can see that the DoG is a good approximation to the LoG.
As with LoG detection, we seek maxima in both the spatial and scale dimensions to
detect features.
Lowe proposed a rearrangement of scale space to make the detection of local
maxima more computationally efficient. We define an octave of a Gaussian scale
space as a pair of images whose smoothing factors σ differ by a factor of two.8 Lowe’s
1
idea was to define the constant factor k characterizing the scale space as k = 2 s , so
that each set of s + 1 images results in an octave. The image beginning each octave
is downsampled by a factor of 2 in both directions, but the sizes of the sequence
of Gaussian filters applied at each octave remains the same. In this way, we avoid
applying very large filters to original-sized images, and instead apply same-sized
filters to increasingly smaller images. If we just took the first image in each octave, we
would have a Gaussian pyramid, as discussed in Section 3.1.2. Figure 4.9 illustrates
the idea. Lowe suggested using s = 3 intervals (that is, four images per octave), with a
base scale of σ0 = 1.6 applied to the first image in each octave (after downsampling).
8 While the word “octave” might imply a factor of eight is involved, the term originally comes from
music theory (i.e., there are eight notes in a major scale, which corresponds to a doubling of
frequency).
4.1. F e a t u r e D e t e c t o r s 121
octave
G(kσ0) G(k2σ0) − G(kσ0)
G(k3σ0) = G(2σ0)
G(σ0)
Figure 4.9. Lowe’s octave structure for computing the scale-space DoG representation of an
image. In this example, s = 3. The white images represent Gaussians with the same sequence
of scales applied to octaves of images of different resolutions. The images at each octave are
half the size of the ones above. The gray images represent the differences of adjacent Gaussian-
filtered images. Features are detected as extrema in both the spatial and scale dimensions of
the DoG function. For example, the response at the black pixel must be larger or smaller than all
of its white neighbors.
1. Instead of finding a single characteristic scale for each feature, local extrema
(i.e., both maxima and minima) of the DoG function in Equation (4.29) with
respect to both space and scale are computed. That is, the DoG value must
be larger or smaller than all twenty-six of its neighbors (see Figure 4.9). To
compute the extrema at the “ends” of each octave of equally sized images,
we compute an extra pair of images such that the first and last images
of adjacent octaves represent the same scale. This way we don’t have to
directly compare images of different sizes. (These extra images aren’t shown in
Figure 4.9.)
2. After detecting the DoG extrema, we further localize each keypoint’s position
in space and scale by fitting a quadratic function to the twenty-seven val-
ues of D(x, y, σ ) at and around the detected point (xi , yi , σi ). This function is
given by:
x x x
1
Q(x, y, σ ) = D(xi , yi , σi ) + g y + y y (4.30)
2
σ σ σ
(a) (b)
Figure 4.10. (a) A sampling of extrema of the multi-scale DoG function for the image in
Figure 4.6a prior to feature rejection. The size of the circle illustrates its detected scale. (b)
After feature rejection, low-contrast and edge-like features are removed. 110 of 777 original DoG
extrema passed the rejection tests.
given by
xˆi
−1
yˆi = − g (4.31)
(xi ,yi ,σi )
σˆi
Figure 4.10 illustrates DoG features detected at multiple scales using this tech-
nique. We can see that, like the LoG and Hessian-Laplace detectors, the DoG feature
detector qualitatively detects blobs. As with these other feature detectors, each fea-
ture carries with it an associated scale that is used in the subsequent description and
matching processes. Each feature that survives the rejection tests is also assigned
a dominant orientation; we will discuss this process further in Section 4.2.3. Fea-
tures detected in this way are sometimes called SIFT features, where SIFT stands for
Scale-Invariant Feature Transform.
(a) (b)
Figure 4.11. (a) Two images of the same scene taken from different perspectives as a result of
substantial camera motion. A fixed-size square is centered on corresponding locations in both
images. (b) A close-up of the two squares shows that the patches of pixels are quite different.
(a) (b)
Figure 4.12. (a) Hessian-Laplace (green circles) and Hessian-Affine (yellow ellipses) features
detected in a pair of wide-baseline images. We can see that despite the rotation and skew, many
features are repeated and the elliptical regions are roughly covariant with the transformation. (b)
Affine-adapted circular regions created from the feature at the upper right corner of the lantern.
We can see that the circular regions are almost identical (in general up to a rotation).
pixels than a square centered at the same point in the right-hand image. If we were
to detect such a feature using a scale-invariant detector (e.g., Hessian-Laplace) and
draw a circle corresponding to the scale, the circles would contain different sets of
pixels. This means that descriptors based on these neighborhoods would use different
information, resulting in suboptimal — and perhaps incorrect — matches.
To combat this problem, we require an affine-invariant way of detecting features.
The basic idea is illustrated in Figure 4.12; after detecting a good spatial location for
a feature, we also estimate an elliptical neighborhood that we expect to be reliably
detectable in an image of the same scene patch from a different perspective. Math-
ematically, we want this ellipse to be covariant with an affine transformation of the
image. That is, if E(I ) is an elliptical region produced by a detector in an image I and
T is an affine transformation, we want
Before creating the feature descriptor, we can warp the detected affine-covariant
ellipse to a circle, and be confident that the circle produced by the same feature in
a different image contains the same set of pixels (up to a rotation). The fundamen-
tal theory of affine-invariant regions was first proposed by Lindeberg and Gårding
[287] and applied by other researchers including Baumberg [32], Schaffalitzky and
Zisserman [423], and Mikolajczyk and Schmid [327].
We can find affine-invariant elliptical regions around feature points with a
straightforward iterative procedure called affine adaptation:
1. Detect the feature point position and its characteristic scale (e.g., using Harris-
Laplace or Hessian-Laplace).
2. Compute the local second-moment matrix H at the given scale (i.e., the
scale-normalized Harris matrix in Equation (4.20)). Scale H so it has unit
determinant.9
3. Compute the Cholesky factorization H = CC , where C is a lower-triangular
matrix with non-negative diagonal elements.10 C is sometimes called the
matrix square root of H .
4. Warp the image structure around the feature point using the linear transfor-
mation C. That is, the new image coordinates are related to the old ones by
Inew (xnew ) = Iold (Cxold ).
5. Compute the local second-moment matrix H for the new image and scale H
so it has unit determinant.
6. If H is sufficiently close to the identity (i.e., its eigenvalues are nearly equal),
stop. Otherwise, go to Step 3.
We obtain the desired ellipse by mapping the unit circle back into the original
image coordinates by inverting the chain of scalings and matrix square roots. Linde-
berg and Gårding showed that under an ideal affine transformation of the image and
using Harris-Laplace features, the process will indeed produce covariant elliptical
regions. If we consider the same feature before and after an affine transformation,
the circular regions resulting from the affine adaptation process will be identical up
to a rotation. We will describe one way to account for this rotation when describing
the feature in Section 4.2.4.
Mikolajczyk and Schmid [327] proposed to simultaneously detect feature point
locations and corresponding affine-invariant regions using an iterative algorithm,
resulting in Harris-Affine features. The basic idea is the same as the algorithm
described earlier, except that within each iteration the location and characteristic
scale of the feature point are re-estimated. They proposed the ratio of the smaller
eigenvalue to the larger one in Step 6 as a measure of the local isotropy of the region,
stopping when the ratio was above a threshold (e.g., 0.95). On the other hand, fea-
tures can be rejected when the ratio of eigenvalues in the original image is too low,
indicating a highly elliptical (i.e., edge-like) region. If we use the Hessian matrix in
9 This normalization was suggested by Baumberg [32]; Lindeberg and Gårding [287] instead
recommended dividing H by its smallest eigenvalue.
10 The factorization exists since H is symmetric with non-negative eigenvalues, but it is not unique
unless the eigenvalues are both positive.
4.1. F e a t u r e D e t e c t o r s 125
place of the second-moment matrix during the feature detection stage, we obtain
Hessian-Affine features.
16 1 2
15 3
14 4
13 p 5
12 6
11 7
10 9 8
(a) (b)
Figure 4.13. (a) An original image, (b) A zoomed-in image illustrating the candidate pixel
(labeled p) and the circle of pixels (labeled 1-16) used for the FAST corner test. In this exam-
ple, the twelve-pixel clockwise arc from 11 to 6 is substantially darker than the center pixel,
resulting in a FAST corner.
126 Chapter 4. Features and Matching
detector. On the average, fewer than three intensity comparisons need to be made
to determine if a candidate pixel is a FAST corner. Both the original detector and the
extended version were found to produce repeatable features while running at least
an order of magnitude faster than the Harris and DoG detectors, enabling frame-rate
corner detection. A later extension that investigated a neighborhood of forty-eight
pixels around the candidate pixel improved the repeatability even more [402].
|i+1 − i−1 |
M (i) = (4.33)
|i |
Figure 4.14. Examples of dark extremal regions obtained by thresholding an (a) original image.
(b)-(d) illustrate dark extremal regions (connected components) obtained by highlighting pix-
els below the intensity thresholds 20, 125, and 200, respectively. Bright extremal regions are
obtained by complementing these binary images.
4.2. F e a t u r e D e s c r i p t o r s 127
x 105
0.8
4
0.6
3
M(i)
|Ωi|
2 0.4
1 0.2
0 0
0 50 100 150 200 250 70 80 90 100 110 120
i i
(a) (b) (c)
Figure 4.15. (a) The area of the dark extremal region at the location of the dot in Figure 4.14a
as a function of increasing intensity threshold. (b) The measure in Equation (4.33) as a function
of intensity threshold for this extremal region. The function is minimized at intensity level 99,
roughly corresponding to the center of the nonzero plateau in (a). (c) The corresponding dark
MSER.
illustrate the area of i and value of Equation (4.33) as a function of intensity thresh-
old i for the component centered at the dot in Figure 4.14a. Figure 4.15c illustrates
the maximally stable extremal region corresponding to the minimizer of M (i); we
can see that the region corresponds to a dark, irregularly shaped blob in the original
image that has high contrast with the lighter background.
All the extremal regions in an image can be quickly generated using an efficient
algorithm for computing watersheds [514], and the test for extracting the maximally
stable extremal regions is fast. The overall algorithm can extract MSERs at video frame
rates. Matas et al. [314] also showed that MSERs are also affine-covariant, as well as
invariant to affine changes in the overall image intensity.
Once a feature’s location (and perhaps some additional information such as its scale
or support region) has been determined, the next step is to describe the feature with
a vector of numbers called a descriptor. To enable high-quality feature matching —
especially among images taken from widely separated cameras — the descriptor must
be designed so that features arising from the same 3D location in different views
of the same scene result in very similar descriptor vectors. That is, we desire that
D(f ) ≈ D(fˆ ), where D is an algorithm to create a descriptor, and f and fˆ are detected
features in two different images, such that fˆ = Tf for some geometric and photometric
transformation T of the first image. Thus, while we want feature detection to be
covariant to geometric transformations, we want feature description to be invariant
to them. We must also specify a criterion for matching two descriptor vectors from
different images to form correspondences.
The easiest descriptor is simply a vector containing the intensity values from a
fixed-size block of pixels centered at the feature’s location. Two such vectors can
be compared simply by computing the sum of squared differences (SSD) between
corresponding elements. When the change between two images is small (with respect
128 Chapter 4. Features and Matching
4000
3000
h(θ)
2000
1000
0
−π −π/2 0 π/2 π
θ
(a) (b) (c) (d)
Figure 4.16. Estimating the dominant orientation of a feature. (a) A detected feature point with
its scale. (b) Gradient angles of the scale-blurred image for points in the feature’s neighborhood.
Vector lengths are weighted by their gradient magnitude and the centered Gaussian function. (c)
The gradient orientation histogram peaks at about π/3 radians. (d) The estimated orientation of
the feature is indicated by the short line inside the circle, which agrees well with the perceptual
gradient. A descriptor support region larger than the characteristic scale can be built using this
orientation, indicated by the square.
[328]). This ensures that we use the same number of support region pixels when
constructing a descriptor, regardless of its size. The final step is usually to normalize
the patch to compensate for affine illumination changes. That is, we compute the
mean µ and standard deviation s of the pixels in the support region, and rescale the
intensities of the patch by
Iold − µ
Inew = (4.34)
s
In certain cases, a training set of ground-truth matches may be available that can
be used to estimate the n × n covariance matrix that statistically relates each pair
of descriptor values. In this case, we can define the Mahalanobis distance as
1/2
Dmahal (a, b) = (a − b) −1 (a − b) (4.37)
Given a distance function D between two descriptors, matches are typically gen-
erated using the method of nearest neighbors. That is, the match to a descriptor a
from a set of candidates B is computed as
Typically (a, b∗ ) is accepted as a match only if the distance D(a, b∗ ) is below a user-
specified threshold. After all, not every feature in one image is expected to appear
in another, and incorrect feature matches can severely affect the performance of
subsequent algorithms that depend on matches (e.g., matchmoving in Chapter 6).
Lowe [306] proposed a rule for more precise descriptor matching based on both
D(a, b∗ ) and D(a, b∗∗ ), where b∗∗ is the descriptor with the second closest distance
to a. The rule is to accept (a, b∗ ) as a match if D(a, b∗ )/D(a, b∗∗ ) is below a user-
specified threshold (e.g., 0.8). This criterion is sometimes called the nearest neighbor
distance ratio. The goal is to prevent situations where a has several very similar
matches in B, making it impossible to choose an unambiguously best match. For
example, the corner of a window on the wall of a building may make an excellent fea-
ture on its own, but the wall may have several nearly identical windows that prevent
the corner from being matched correctly in a second image.
Another less-often used matching criterion is the normalized cross-correlation
given by
n
1
NCC(a, b) = (ai − µa )(bi − µb ) (4.39)
sa sb
i=1
where µa and sa are the mean and standard deviation of the elements of a. Two
vectors that match well should have a high NCC, near 1. The NCC is most often used
for matching raw blocks of intensities as opposed to derived descriptors. If the blocks
are already normalized for affine illumination changes as described in the previous
section, then the NCC is simply the dot product between the two descriptor vectors
(i.e., the cosine of the angle between the vectors). While the NCC is not a distance
function, the same concepts of best and second-best matches can apply to detecting
and filtering potential matches. The NCC can also be computed very efficiently using
the Fast Fourier Transform due to its similarity to convolution.
Finally, we note that a match to a given feature in one image cannot occur at
an arbitrary location in a second image; the match location is constrained by the
epipolar geometry. That is, the match location in the second image must lie along
the epipolar line corresponding to the feature in the first image. We can iteratively
4.2. F e a t u r e D e s c r i p t o r s 131
estimate the epipolar geometry and reject feature matches that pass the above tests
but are inconsistent with the epipolar lines (e.g., [576]). We discuss epipolar geometry
in detail in Chapter 5.
(a) (b)
Figure 4.17. Constructing the SIFT descriptor. (a) An original detected feature, with charac-
teristic scale and dominant orientation. The descriptor is computed from the pixels inside the
indicated square, where the small arrow indicates the top edge. (b) The rotated and resampled
square of pixels, with the 4×4 grid overlaid. The eight lines inside each square indicate the size
of the histogram bin for each corresponding orientation. The SIFT descriptor is the concatenation
of these 8×4×4 = 128 gradient magnitudes into a vector, which is then normalized.
132 Chapter 4. Features and Matching
Figure 4.18. Correspondences obtained using the nearest neighbor distance ratio to match SIFT
descriptors computed for images taken of the same scene under different imaging conditions.
Most of the matches are correct.
length. The overall normalization makes the descriptor invariant to affine illumi-
nation changes, and the zeroing-out step prevents large gradients (which may be
erroneous or viewpoint dependent) from dominating the descriptor.
Figure 4.18 illustrates the result of using the nearest neighbor distance ratio to
match the SIFT descriptors between two images taken from different perspectives
with different resolutions and illumination conditions. We can see that many of the
correspondences are correct in terms of the matches’ location, scale, and orientation;
the incorrect correspondences can be removed with an outlier rejection rule based
on the images’ epipolar geometry, as discussed in Section 5.4.
4.2.3.1 GLOH
Mikolajczyk and Schmid [328] proposed a SIFT variant called GLOH (for Gradient
Location and Orientation Histogram) that they claimed outperformed Lowe’s original
formulation. Instead of using a square grid as the domain for orientation histograms,
a log-polar grid is created, as illustrated in Figure 4.19. The center region (of radius 6) is
undivided, while the two outer rings (of radii 11 and 15) are divided into eight regions.
The gradients in each grid region are quantized into sixteen angles (as opposed to
eight for SIFT). Thus, the raw descriptor is 17×16 = 272-dimensional. Based on a
training set of image patches, principal component analysis (PCA) is applied to reduce
the dimensionality of the descriptor to 128.
4.2.3.2 DAISY
Winder and Brown [548] and Tola et al. [490] suggested a generalized histogram-based
descriptor called DAISY.11 Instead of specifying “hard” support regions in which to
compute histograms, we specify the centers of “soft” support regions and a Gaussian
11 The name DAISY comes from the petal-like appearance of the descriptor support regions.
4.2. F e a t u r e D e s c r i p t o r s 133
function at each support region, as illustrated in Figure 4.20. Like GLOH, the support
regions are arranged radially around a center point. Like the dominant orientation
estimation algorithm in SIFT, the Gaussian function at each center point specifies a
weighting function for the gradients in the neighborhood, so that points further from
the center contribute less to the orientation histogram.
intensities have been normalized as described in Section 4.2.1. The dimension of the
descriptor is the number of intensity bins times the number of rings. Since there are
no angular subdivisions of the rings, the descriptor is rotation-invariant.
These complex filters are similar to the derivatives of a Gaussian. For example, for
m + n ≤ 6, we obtain a descriptor of length 15. Again, this approach avoids the need
to explicitly estimate the dominant gradient orientation.
4.2.5.2 SURF
Bay et al. [33] proposed a simplified descriptor inspired by SIFT called SURF (for
Speeded-Up Robust Features). As in SIFT, the oriented square at a feature’s detected
scale is split into a 4×4 square grid. However, instead of computing gradient orien-
tation histograms in each subregion, Haar wavelet responses at twenty-five points
in each subregion are computed. The sums of the original and absolute responses
in the x and y directions are computed in each subregion, yielding a 4×4×4 =
64-dimensional descriptor. Since Haar wavelets are basically box filters, the SURF
descriptor can be computed very quickly.
4.2.5.3 PCA-SIFT
The SIFT descriptor is extremely popular, and is most frequently used as Lowe origi-
nally described it, that is, a 128-dimensional vector. However, it’s natural to question
whether all 128 dimensions of the descriptor are necessary and should receive equal
weight in matching. Ke and Sukthankar [235] partially addressed this question using
a dimensionality reduction step based on principal component analysis (PCA). The
technique is generally known as PCA-SIFT, but this is a misnomer; the principal
component analysis is not performed directly on SIFT descriptor vectors, but on
the raw gradients of a scale- and rotation-normalized patch. More precisely, they
collected a large number of DoG keypoints and constructed 41 × 41 patches at the
estimated scale and orientation of each keypoint. The x and y gradients at the interior
pixels of each patch were collected into a 39 × 39 × 2 = 3042-dimensional vector,
and PCA was applied to determine a much smaller number of basis vectors (e.g.,
twenty or thirty-six). Thus, the high-dimensional vector of gradients for a candidate
feature is represented by a low-dimensional descriptor given by its projection onto
the learned basis vectors. Nearest-neighbor matching was then carried out on these
lower-dimensional descriptor vectors.
136 Chapter 4. Features and Matching
Given the wide range of available detectors and descriptors discussed so far, choosing
an appropriate combination for a given problem seems a daunting task. Luckily, sev-
eral research groups have undertaken thorough analyses of detectors and descriptors
both separately and together. We briefly summarize several findings here.
The main figure of merit for evaluating a feature detector is its repeatability. That
is, a feature detected in one image should also be detected in the correct location
in an image of the same scene under different imaging conditions. These different
conditions could include changes in camera settings (such as focal length, blur, or
image compression), camera viewpoint (inducing affine or more complex changes in
pixel patterns), or the environment (such as global or local illumination).
Studies of detector repeatability [431, 329] first focused on images of a planar sur-
face, like a graffiti-covered wall, taken from different perspectives. Since all points in
such an image pair are related by the same projective transformation (see Section 5.1),
we can immediately determine the correct location in the second image of any feature
point detected in the first image, and check to see if any features were detected near
this location. Figure 4.21a illustrates the idea. The repeatability score of a detector
for a given image pair is then
R
RS = (4.41)
min(N1 , N2 )
where N1 and N2 are the numbers of features detected in each image respectively,
and R is the number of repeated detections. An ideal detector has a repeatability of 1.
If the feature detector also produces an estimated support region (e.g., an affine-
covariant ellipse), these regions should also transform appropriately under different
p'
p p' q
q
(a) (b)
Figure 4.21. Evaluating feature detector repeatability. (a) A feature is detected at location p in
the left image. If two images are related by a projective transformation, we can easily determine
the location p corresponding to the same position in the right image. If a feature q is detected
within a small neighborhood of p in the second image, we say the detection is repeated. (b) If
a feature detector produces a support region, we can compare them for a more stringent test of
repeatability. A detection is repeated if the ratio of the intersection of the two regions to their
union is sufficiently large.
4.3. E v a l u a t i n g D e t e c t o r s a n d D e s c r i p t o r s 137
imaging conditions, and should be taken into account when determining if a detec-
tion is repeated. Figure 4.21b illustrates this more stringent test. A detection is
considered repeated if the area of intersection of the two regions is sufficiently large
compared to the area of their union (e.g., above sixty percent).
Mikolajczyk et al. [329] surveyed the affine-covariant feature detectors discussed
in Section 4.1, and tested them with respect to viewpoint and scale change, blurring,
JPEG compression, and illumination changes on a varied set of images. Their gen-
eral conclusions were that the Hessian-Affine and MSER detectors had the highest
repeatability under the various conditions, followed by the Harris-Affine detector.
In general, Hessian-Affine and Harris-Affine produced a larger number of detected
pairs than the other algorithms. They then used the SIFT descriptor as the basis for
matching features from each detector, computing a matching score as
M
MS = (4.42)
min(N1 , N2 )
where the correct matches and true correspondences are determined from the
repeatability score and region overlap measure defined previously. A good descriptor
should have high precision — that is, few false matches — and high recall — that is,
few matches that are present in the detector results but poorly represented by the
descriptor. Their general conclusions, independent of the detector used, were that
the GLOH and SIFT descriptors had the best performance. Shape contexts and PCA-
SIFT also performed well. This study also confirmed the usefulness of the nearest
neighbor distance ratio for matching SIFT descriptors.
Moreels and Perona [336] undertook a similar controlled evaluation of detec-
tor/descriptor combinations, for the specific problem of matching features in
close-up images of 3D objects with respect to viewpoint and lighting changes. They
found that Hessian-Affine and DoG detectors with SIFT descriptors had consis-
tently high performance for viewpoint changes. MSER and shape contexts, which
performed well on planar scenes in [328], were found to have only average perfor-
mance for matching 3D objects. The Harris-Affine detector with the SIFT descriptor
138 Chapter 4. Features and Matching
was found to perform the best for lighting and scale changes. In general, far fewer
matches were found when comparing images of 3D objects versus images of planar
scenes.
It’s somewhat surprising that almost all research on detectors and descriptors
assumes a grayscale image as input. Since distinctive color regions provide obvious
cues for feature detection that are often lost or subdued by conversion to grayscale, it’s
worthwhile to investigate detectors and descriptors that preserve color information
throughout the process. Here we mention a few such algorithms.
In terms of color detectors, Kenney et al. [237] extended the Harris matrix from
Section 4.1.1 using several basic axioms for properties of a good corner detector.
They showed how to generalize the Harris matrix in Equation (4.3) and Shi-Tomasi
criterion in Equation (4.13) to images with multidimensional range (such as color
images) and/or domain (such as 3D medical images). Unnikrishnan and Hebert [505]
introduced a generalization of scale space for feature detection in color images. They
proposed a family of functions involving first and second derivatives of the three color
channels at a given scale, so that the output of the function at each scale is roughly
invariant to linear illumination changes. Similar to Harris-Laplace, feature points
are detected at local extrema in scale space. Forssén [148] described an extension of
MSERs to color. The one-dimensional watershed algorithm to compute the extremal
regions is replaced with an agglomerative clustering scheme to produce the nested
set of connected components.
In terms of color descriptors, early work focused on invariant-based techniques.
For example, grayscale moment invariants were generalized to color images by
Mindru et al. [330]. That is, we compute the generalized moments
x m y n (IR (x, y))r (IG (x, y))g (IB (x, y))b (4.44)
4.5. A r t i f i c i a l M a r k e r s 139
where the integers m, n ≥ 0 specify the order of the moment, and the integers r, g , b ≥ 0
specify the degree of the moment for each color channel. Mindru et al. showed various
ratios of sums and products of the color moments to be invariant to both geometric
(e.g., affine) and photometric transformations of an image, which can be concate-
nated to form a descriptor. Montesinos et al. [333] extended first-order grayscale
differential invariants to form a color descriptor.
The popularity of SIFT led to various attempts to “colorize” the SIFT descriptor.
An obvious approach is simply to concatenate the SIFT descriptors for a feature point
location computed in three color channels (e.g., RGB or HSV) to create a 3×128 =
384-dimensional descriptor, possibly reducing the dimensionality with PCA. van de
Weijer and Schmid [509] augmented the standard 128-dimensional SIFT descrip-
tor computed on luminance values with an additional 222 dimensions representing
color measurements, including histograms of weighted hue values and photometric
invariants. More straightforwardly, Abdel-Hakim and Farag [1] proposed to apply the
usual SIFT detector/descriptor to a color invariant image obtained as the ratio of two
linear functions of RGB. Both Burghouts and Geusebroek [77] and van de Sande et
al. [508] recently presented surveys and evaluations of color descriptors, generally
concluding that augmenting the SIFT descriptor with color information improved
repeatability results over using luminance alone.
Even if a grayscale feature detector/descriptor scheme is dictated by the appli-
cation, it may still be possible to improve performance by modifying the input to
the detector. For example, Gooch et al. [174] proposed an algorithm for processing
a color image into a one-channel image that differs from the traditional luminance
image. Instead, adjacent pixel differences in the new image are optimized to match a
function of color differences in the original image as well as possible. It may also be
possible to apply an algorithm like that of Collins et al. [101] to adaptively choose the
most discriminating one-dimensional space over the course of object tracking. For
example, the green channel may be most discriminating for tracking a certain feature
on an actor in a natural environment, but a combination of red and blue channels
may be better as the actor crosses into a green-screen background.
Figure 4.22. Various artificial markers. (a) A QR code. (b) ARToolKit markers. English or
Japanese characters are frequently used for the interior pattern. (c) ARTag fiducial markers.
The interior pattern is a 6 × 6 pattern designed for robust detection and identification.
be detectable and decodable, which is not useful in visual effects situations where
we want to detect a large number of markers. They are also not designed for accurate
localization.
ARToolKit markers [231], designed for augmented reality applications, are square
patches with a thick black border and an arbitrary binary interior (Figure 4.22b). A
marker is detected in an image by thresholding, finding the lines defining the black
border, rectifying the pattern to be a square, and identifying the interior pattern
using cross-correlation with a library of known templates. As with QR codes, robust
detection of these patterns is only possible when they take up a relatively large fraction
of a camera’s field of view, and since the interior patterns are arbitrary, the approach
is not optimized for detectability.
Fiala [140] introduced ARTag markers, designed to be robustly detectable at vari-
ous scales in an image in the presence of perspective distortion, lighting conditions,
and mild occlusion, making them a good choice for visual effects applications. Like
ARToolKit markers, ARTags are delimited by a black border and contain a binary
interior pattern; however, the interior patterns are grids of 6×6 binary squares
designed to minimize inter-marker confusion and provide resistance to decoding
errors (Figure 4.22c). Markers are found by connecting detected edges into quadrilat-
erals, extracting the thirty-six bits in the interior of each pattern, and applying forward
error detection and correction to extract a valid ten-bit marker ID. The markers are
not rotationally symmetric, so four orientations of the marker must be tested; on the
other hand, detection produces a useful orientation estimate. A similar approach was
taken by Wagner and Schmalstieg [524] to create ARToolKitPlus markers optimized
for detection by mobile devices.
Doug Roble, creative director of software, and Som Shankar, integration supervisor
from Digital Domain in Venice, California discuss the role of features on a movie set.
4.6. I n d u s t r y P e r s p e c t i v e s 141
RJR: When you need to track natural features in plates, what detection and description
strategies do you use?
Roble: When we do automatic feature detection and tracking we tend to use things
like Harris corners. These are important first steps for matchmoving algorithms.
However, we’re often trying to specifically follow a point an artist has selected in an
image. For example, the artist may know that a CGI character is going to be dancing
on top of a cabinet in the scene. The rest of the points in the scene don’t matter; it’s
(a)
(b)
Figure 4.23. (a) In blue- or green-screen environments, artificial features are often introduced
to aid camera tracking, such as the red tape marks in this frame from Thor. (b) In natural envi-
ronments, automatically extracted image features such as Harris corners play the same role, as
in this sequence from Transformers: Revenge of the Fallen. Scale and rotation invariance may be
necessary to compensate for camera rotation and zoom. Thor images appear courtesy of Marvel
Studios, TM & ©2011 Marvel and Subs. www.marvel.com. Transformers: Revenge of the Fallen
©2009 DW Studios L.L.C. and Paramount Pictures Corporation. All Rights Reserved.
142 Chapter 4. Features and Matching
that cabinet that counts. The artist will specifically choose points that are close to
where the effect is going to happen, which may or may not show up in an automatic
corner detector. They might be near a gradual slope or a pattern.
The artist can also draw a pattern or outline around that point and say, “track
this stuff.” That outline doesn’t need to be a square or rectangle — for example, a
rectangle might contain background pixels that you don’t want to deal with. It could
be an arbitrary shape, which avoids the need to do some sort of automatic outlier
rejection on pixels you don’t care about.
To follow the pattern through a sequence, we use a template-based method based
on cross-correlation in three-channel color space. Of course, there are lots of prob-
lems with that. The object may rotate or slide around. The pattern tracker that we use
solves for not just translation and rotation, but also skew, perspective, and lighting
changes using an affine or nonlinear transformation as necessary. I definitely started
with Shi and Tomasi’s Good Features to Track, because that’s the classic approach,
but these days my approach is more related to a paper by J. P. Lewis that does a good
job of addressing lighting changes [276]. Of course, motion blur and occlusions are
always a real pain to deal with.
RJR: SIFT is incredibly popular in the computer vision community; what about in
visual effects?
Roble: Right now, we don’t really use SIFT all that much here. It’s isn’t necessary for
the kind of pattern tracking through frames of a shot that we deal with all the time.
Also, the SIFT descriptor contains a much different kind of information than artists
are used to.
One idea I wanted to investigate using SIFT was as a pre-process before doing
corner matching. The problem is occlusion: the actors on the sets walk in front of the
features all the time. With a corner-matching algorithm, or something that’s purely
template-based, once a feature gets occluded, by the time it becomes visible again,
the camera’s often moved significantly, and it comes in as a whole new track. SIFT
seems like it could potentially help hook those tracks back together across the gap.
Shankar: Every set is different; it’s totally free-form. This is a good example of the
pragmatic side of filmmaking. It would be awesome to be able to place regularly
spaced markers with coded patterns on the set and write software to automatically
recognize their unique IDs, but the reality is you have to get in and out fast without
interfering with the crew or the actors.
When we’re on a stage, the camera crew alone can take up half the afternoon and
you’ve got to throw up what markers you can in a few minutes. You’ll see little squares
of gaffer’s tape on corners here and there because sometimes that’s all you can get.
We often use markers that consist of big orange triangles, which contrast well with
blue screens. An advantage of these is that by looking at their edges they help us a
bit with estimating motion blur from moving footage — but tracking through motion
blur is always a hard problem.
4.7. N o t e s a n d E x t e n s i o n s 143
If it’s a locked-off, stationary camera we just need a couple of good markers that
stay out of people’s hair. If we have a moving camera I have to think about where to
place these points so that we get good parallax with good visibility, where the actors
aren’t going to walk.
There’s a tradeoff between our data integration and roto/paint departments since
every marker we put up on a set, they eventually have to paint out later in production.
It can be very difficult — for example if you have an actor’s hair in front of a marker on
a blue screen, it’s murder to get rid of it. However, especially with a moving camera
in front of a blue screen, we need to make sure we have enough markers to give us an
accurate 3D camera track — that’s the priority.
We also have what we call our “tepee”s. These are simple assemblies made of
tubes and balls that look like an upside-down three-sided pyramid on top of another
three-sided pyramid. We carry the pieces around in a bag and can quickly assemble
and dissemble them on a set to quickly get objects in the scene that have known
dimensions and geometry. For smaller-scale scenes we have the “cubee,” which is
just a rigid metal frame with balls on the vertices.
Roble: A lot of times, people not in the filmmaking business will say, “Why don’t you
try this cool new technique?” The environment on set is extraordinarily stressful. The
producers are there watching how much time everything is taking; they basically have
this little stopwatch that counts in units of money. The guys in the data integration
group are really good about running into the scene, throwing up just enough markers
to get what they need, and then running out before the director notices. A lot of
careful setup is a rarity. Most of it is get what you can and go.
Roble: For AI: Artificial Intelligence, Steven Spielberg wanted to be able to just walk
around the set, film it with a handheld camera, and see the result in real time on
a kind of virtual set. They mounted a couple of little cameras on top of the camera
pointed upward, and put a whole bunch of checkerboard patterns on the ceiling.
The upward-facing cameras could then find the checkerboards no matter where he
was on the set, and since they knew exactly where the checkerboards were in 3D
they could track the camera position pretty accurately in real time. Industrial Light
and Magic uses similar black-and-white patterns for their iMocap system, which lets
them track the heads and bodies of performers on set who will later be replaced by
CGI characters.
intensity extrema and trace many rays outward until a photometric measure reaches
an extremum along each ray. An ellipse is fit to the resulting points, producing an
affine-invariant region. Mikolajczyk et al. [329] found the performance of the two
detectors to be reasonable, but noted that their computational cost was quite high
compared to the Harris/Hessian-Laplace and MSER detectors. Kadir and Brady [229]
also proposed an affine-invariant detector based on the idea that good features should
be detected at patches whose intensity distributions have high entropy. However, the
algorithm is extremely slow, and Mikolajczyk et al. [329] found its performance not
to compare with the algorithms discussed here.
FAST corners were predated by an early approach to fast low-level corner detec-
tion called SUSAN proposed by Smith and Brady [461]. A disc is centered around a
candidate point and the area of pixels in the disc with intensities similar to the center
pixel is computed. A corner is detected if the area is a local minimum and below
half the disc area. Another approach by Trajković and Hedley [497] uses the same
concept of a circle of pixels around a candidate point, assuming that for some pair
of diametrically opposite points, the intensities must substantially differ from the
center point.
Tell and Carlsson proposed an affine-invariant descriptor based on the line of
intensities connecting pairs of detected features [487]. However, choosing appropri-
ate feature pairs and obtaining a sufficient number of matches for a given application
can be problematic. Forssén and Lowe [149] proposed a descriptor for MSERs based
on the region’s shape alone, since MSER shapes can be quite distinctive. The descrip-
tor is based on applying the SIFT descriptor to the binary patch corresponding to the
MSER.
Lepetit and Fua [270] proposed a feature recognition algorithm based on random-
ized trees that assumes that several registered training images of the object to be
detected are available. The idea is to build a library of the expected appearances of
each feature from many different synthetic viewpoints, and then to build a classifier
that determines the feature (if any) to which a new pixel patch corresponds. While the
training phase requires some computational effort, the recognition algorithm is fast,
since it only requires the traversal of a precomputed tree. Thus, feature descriptors
are not explicitly formed and compared. The approach was later extended to non-
hierarchical structures called ferns [358]. Stavens and Thrun [466] similarly noted
that if a feature matching problem is known to arise from a certain domain (e.g.,
tracking shots of buildings), a machine learning algorithm could be used to tune the
parameters of a detector/descriptor algorithm to obtain the best performance on
domain-specific training data. These kinds of learning algorithms are worth investi-
gation, with the understanding that performance may suffer if input from a different
domain is used.
The popularity of SIFT and its validation as a high-performance descriptor has
led to a variety of extensions (for example, the color versions in Section 4.4). One
area of particular interest is the acceleration of the algorithm, since in its original
form descriptor computation and matching was fairly slow, especially compared
to template-based cross-correlation. One approach is to leverage the processing
power of GPUs (e.g., [455]). Alternately, other groups have stripped out features
of SIFT to make the basic idea viable on a resource-constrained platform like a
mobile phone (e.g., [523]). In general, any approach that requires the matching
4.8. H o m e w o r k P r o b l e m s 145
4.1 Show that the Harris matrix for any positive set of weights must be positive
semidefinite. That is, show that b Hb ≥ 0 for any b ∈ R2 .
4.2 Consider the N × N patch in Figure 4.24a, where the slanted line passes
through the center of the patch at an angle of θ ◦ from the positive x axis,
and the intensity is 1 above the line and 0 below the line. Estimate the
eigenvectors and eigenvalues of the Harris matrix for the pixel in the center
of the patch (assuming w(x, y) is an ideal box filter encompassing the whole
patch).
B
A
(a) (b)
Figure 4.24. (a) A binary N × N patch. (b) A binary image, with two potential feature locations.
4.3 Consider the image in Figure 4.24b. Will the Harris measure using an ideal
box filter give a higher response at point A (centered on a corner) or at point
B (further inside a corner)? Think carefully about the gradients in the dotted
regions.
4.4 Write the Harris measure C in Equation (4.4) as a function of the eigenvalues
of H .
4.5 Show that if one eigenvalue of the Harris matrix is 0 and the other is very
large, the Harris measure C is negative.
4.6 Explain why the Gaussian derivative filters in Equation (4.6) act as gradient
operators on the original image.
4.7 Show that minimizing Equation (4.11) leads to Equation (4.12).
4.8 Determine the generalization of Equation (4.12) that corresponds to the
affine deformation model of Equation (4.14).
4.9 Sketch a simple example of an image location that would fail the Harris
corner test at a small scale but pass it at a larger scale.
146 Chapter 4. Features and Matching
4.10 Use Equation (4.21) and a real image of your choice to duplicate the result in
Figure 4.4. That is, create a zoomed version of the original image, and deter-
mine the characteristic scale of the same blob-like point in both images.
You should verify that the ratio of characteristic scales is approximately the
same as the zoom factor.
4.11 Verify with a simple real image that the Laplacian-of-Gaussian detector
responds strongly to edges, while the Hessian-Laplace detector does not.
4.12 Speculate about the form of a simple detector based on box filters (similar to
the Fast Hessian detector in Figure 4.7) that approximates the normalized
Laplacian in Equation (4.21).
4.13 Prove that the Gaussian function in Equation (4.7) satisfies the diffusion
equation in Equation (4.26).
4.14 Show why the refined keypoint location in Equation (4.31) follows from the
quadratic fit in Equation (4.30).
4.15 If C is the matrix square root of a positive semidefinite matrix H , show how
the eigenvalues/eigenvectors of C and H are related.
4.16 The FAST Corner detector, which requires three of the four pixels labeled 1,
5, 9 and 13 in Figure 4.13 to be brighter or darker than the center pixel, can
miss some good corner candidates. For example, show that a strong corner
can exist if only two of the four pixels are significantly brighter or darker
than the center pixel.
4.17 Explain the shape of the curve in Figure 4.15a — notably the low plateau,
sharp increase, and subsequent slow increase.
4.18 Describe how the computation of MSERs is related to the watershed
algorithm from image processing.
4.19 Explain why MSER detection is invariant to an affine change in intensity.
4.20 Show how the normalized cross-correlation between two real vectors can
be computed using the Discrete Fourier Transform. (Fast Fourier Transform
algorithms make this approach computationally efficient when searching
for a template patch across a large image region.)
4.21 In Lowe’s definition of the SIFT descriptor, the gradient at each sample
location contributes to the surrounding spatial and orientation bins using
trilinear interpolation. This means the descriptor will change smoothly as
its center and orientation are varied. For example, the point with gradient
indicated in Figure 4.25a will contribute to the eight labeled histogram bins
in Figure 4.25b, with most of the weight in Bin 3. If the point lies two-
thirds of the way along the line segment connecting the center of the lower
left bin and the upper right bin, and the angle of the gradient from the
positive vertical axis is π/16, compute the weights w1 , . . . w8 that represent
the contribution of the point to each orientation bin.
4.8. H o m e w o r k P r o b l e m s 147
1 2 3 4
5 6 7 8
(a) (b)
Figure 4.25. The point with the indicated gradient contributes to eight bins in the SIFT
descriptor.
4.22 Show that the descriptors given by the total intensity (x,y) L(x, y), sum
∂L(x,y) 2 ∂L(x,y) 2
of squared gradient magnitudes (x,y) ∂x + ∂y , and sum of
∂ 2 L(x,y) ∂ 2 L(x,y)
Laplacians (x,y) ∂x 2 + ∂y 2 are invariant to rotation around the cen-
tral point of the descriptor. Assume that the image’s range and domain are
continuous (so the sums become integrals) and that the aggregations are
taken over equivalent circular regions.
4.23 Show that the directional derivative of a Gaussian function G (given by
Equation (4.7)) at a point (x, y) in an arbitrary unit direction v is a linear
combination of its x and y derivatives at (x, y). This is a simple example of a
steerable filter [153].
4.24 The performance of a descriptor can be measured by plotting a curve of
(1-precision) versus recall as the descriptor distance defining a match is
varied. Sketch two such curves for two hypothetical descriptor algorithms,
and discuss which one represents a better algorithm and why.
4.25 ARTag markers are based on the theory of error correcting codes. They
include ten bits to define the marker ID, sixteen bits for a checksum, and
ten bits for error correction, leading to thirty-six bits arranged in a 6 × 6
square. Show that e = 10 bits are required to correct c = 2 errors in the
initial n = 26 bits, using the formula
e ≈ c log 2 n (4.45)
Dense Correspondence and Its
5 Applications
148
Chapter 5. Dense Correspondence and Its Applications 149
The assumption that two images are related by a simple parametric transformation
is extremely common in computer vision. For example, if we denote a pair of images
and their coordinate systems by I1 (x, y) and I2 (x , y ), the two are related by an affine
transformation if
1 For an affine transformation to exactly represent the motion of all pixels in images acquired using
perspective projection (see Section 6.2), the image planes must both be parallel to each other and
to the direction of camera motion. Furthermore, if the translational motion is nonzero, the scene
must be a planar surface parallel to the image planes. Nonetheless, the affine assumption is often
made when the scene is far from the camera and the rotation between viewpoints is very small.
5.1. A f f i n e a n d P r o j e c t i v e T r a n s f o r m a t i o n s 151
Figure 5.1. An original image (a) and the results of various affine and projective transformations.
(b) A similarity transformation (i.e., translation, rotation, and uniform scale). (c) A vertical shear.
(d) A general projective transformation. While (b) and (c) can be written as affine transformations,
(d) cannot.
The symbol ∼ in Equation (5.3) means that the two vectors are equivalent up to
a scalar multiple; that is, to obtain actual pixel coordinates on the left side of
Equation (5.3), we need to divide the vector on the right side of Equation (5.3) by
its third element.
Hartley and Zisserman [188] described how to obtain an initial estimate of the
parameters of a projective transformation given a set of feature matches using the
normalized direct linear transform, or DLT. The steps are as follows:
1. The input is two sets of features {(x1 , y1 ), . . . , (xn , yn )} in the first image plane
and {(x1 , y1 ), . . . , (xn , yn )} in the second image plane. We normalize each set of
√
feature matches to have zero mean and average distance from the origin 2.
This can be accomplished by a pair of similarity transformations, represented
as 3 × 3 matrices T and T applied to the homogeneous coordinates of the
points.
2. Construct a 2n × 9 matrix A, where each feature match generates two rows of
A, that is, the 2 × 9 matrix
0 0 0 xi yi 1 −yi xi −yi yi −yi
Ai = (5.4)
xi yi 1 0 0 0 −xi xi −xi yi −xi
Ai [h11 h12 h13 h21 h22 h23 h31 h32 h33 ] = 0 (5.5)
While the DLT is easy to implement, the resulting estimate does not minimize a
symmetric, geometrically natural error. Under the assumptions that the measure-
ment errors in each feature location are independently, identically distributed (i.i.d.)
with a zero-mean Gaussian pdf, we can obtain a maximum likelihood estimate for
the projective transformation by minimizing
n
2 2
xi x̂i
x
x̂i
− + i − (5.6)
yi ŷi yi ŷi
i=1 2 2
over the nine elements of H and {(x̂1 , ŷ1 ), . . . , (x̂n , ŷn )}, which are estimates of the
true feature locations exactly consistent with the projective transformation. Each
(x̂i , ŷi ) is the transformation of the corresponding (x̂i , ŷi ) by H . This cost func-
tion is nonlinear and can be minimized by the Levenberg-Marquardt algorithm
(Appendix A.4).
When the feature matches have errors not well modeled by an i.i.d. Gaussian
distribution — for example in the case of outliers caused by incorrect matches —
the RANSAC algorithm [142] should be used to obtain a robust estimate of H by
repeatedly sampling sets of four correspondences, computing a candidate H , and
selecting the estimate with the largest number of inliers (see Problem 5.18). Hartley
and Zisserman [188] give a detailed description of these and further methods for
estimating projective transformations. We will not focus heavily on parametric cor-
respondence estimation here, since correspondence in most real-world scenes is not
well modeled by a single, simple transformation.
A different way to think about the problem of obtaining dense correspondence from
sparse feature matches is to view the matches as samples of a continuous deformation
field defined over the whole image plane. That is, we seek a continuous function f (x, y)
defined over the first image plane so that f (xi , yi ) = (xi , yi ), i = 1, . . . , n.2 This problem
can be viewed as scattered data interpolation since the (xi , yi ) are sparsely, unevenly
distributed in the first image plane (as opposed to being regularly distributed in a grid
pattern, in which case we could use a standard method like bilinear interpolation).
The scattered data interpolation problem is sketched in Figure 5.2.
For all scattered data interpolation methods, it may be necessary for a user to
manually add additional feature matches to further constrain the deformation field
in areas where the estimated correspondence seems unnatural. We discuss this issue
further in Section 5.7.
2 The motion vector (u, v) at a point (x, y) is thus (u, v) = f (x, y) − (x, y).
5.2. S c a t t e r e d D a t a I n t e r p o l a t i o n 153
2 2 3
1 3
6
1
6 5
4 5
4
Figure 5.2. In scattered data interpolation, we’re given a set of feature matches (numbered
circles) unevenly distributed in each image. The goal is to generate a dense correspondence for
every point in the first image plane. In this example, the match for the black point in the left
image is estimated to be the striped point in the right image.
The weights on the radial basis functions and the affine coefficients can be easily
computed by solving a linear system:
0 φ(r12 ) ··· φ(r1n ) x1 y1 1 w11 w21 x1 y1
φ(r21 ) ··· φ(r2n ) x2 y2
0 x2 y2 1 w12 w22
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
φ(rn1 ) φ(rn2 ) ··· 0 xn yn 1 w1n w2n = xn yn (5.9)
···
x1 x2 xn 0 0 0 a11 a21 0 0
y1 y2 ··· yn 0 0 0 a12 a22 0 0
1 1 ··· 1 0 0 0 b1 b2 0 0
where
x xj
i
rij = − (5.10)
yi yj
2
3 So named because they correspond to the shape a thin metal plate would take if constrained at
certain points.
154 Chapter 5. Dense Correspondence and Its Applications
(d) (e)
Figure 5.3. An example of thin-plate spline interpolation between two images of the same
scene. (a) Image 1 with feature locations. (b) Image 2 with feature locations. (c) Image 1 warped
to Image 2’s coordinates using an estimated thin-plate spline deformation. The images look virtu-
ally identical, except for minor color changes (e.g., the hubcaps). Rectilinear grid lines (d) on the
coordinate system of Image 1 are transformed into non-rectilinear grid lines (e) in the coordinate
system of Image 2.
The thin-plate spline results in a smooth deformation field from the coordinates
of one image plane to the coordinates of the second, as illustrated in Figure 5.3. It also
has the appealing property of being covariant to rigid transformations of the input
data. However, the interpolating function at (x, y) explicitly depends on all the point
correspondences, so adding a new correspondence requires the deformation field to
be recomputed everywhere (see also Section 8.4.3).
4 Actually, there are two surfaces here; the first satisfies fx (xi , yi ) = xi and the second satisfies
fy (xi , yi ) = yi .
5.2. S c a t t e r e d D a t a I n t e r p o l a t i o n 155
points in its neighborhood. This means that new control points can be added in one
place without changing the interpolation far away. Thus,
3
3
f (x, y) = wkl (x, y)ψkl (x, y) (5.12)
k=0 l=0
for appropriate basis functions ψ (usually cubic polynomials), and k and l that define
which basis functions are active at pixel (x, y).
Lee et al. [267] described the details of computing the B-spline basis functions
and weights for scattered data interpolation, and proposed a method for adaptively
varying the resolution of the control point lattice to avoid a very large number of
basis functions. Depending on the algorithm settings, the B-spline can either exactly
interpolate the feature matches, or merely approximate the matches to a desired
tolerance (which generally allows a coarser lattice). Further details about the B-spline
interpolation process can be found in the book by Farin [135].
5.2.3 Diffeomorphisms
Joshi and Miller [228] noted that the thin-plate spline approach is not guaranteed
to produce a diffeomorphism5 in cases where the deformation underlying the fea-
ture matches is extreme. Figure 5.4a-b illustrates a simple example of the problem;
the deformation field corresponding to the thin-plate spline causes the grid lines in
the center to intersect themselves. Instead, the deformation field can be computed
as the solution of a ordinary differential equation. The idea is to “flow” the first image
I1 (x, y) to the second image I2 (x, y) over a time interval t ∈ [0, 1]. This flow can be rep-
resented by an instantaneous velocity field at each point in time, (u(x, y, t), v(x, y, t)),
where t ∈ [0, 1], and a mapping S(x, y, t) that specifies the flowed location of each
point (x, y) at time t. At t = 0 we have the image I1 and at t = 1 we have the image I2 .
Figure 5.4. (a) Two feature correspondences in a pair of images. The black dot moves to the
left while the white dot moves to the right. (b) A thin-plate spline interpolant results in a non-
diffeomorphic deformation field — that is, the grid lines self-intersect. (c) Forcing the mapping
to be a diffeomorphism avoids this self-intersection.
The velocity field and the mapping are thus related by the differential equation
∂S(x, y, t) u(x, y, t)
= , t ∈ [0, 1] (5.13)
∂t v(x, y, t)
with the initial condition that S(x, y, 0) = (x, y). The problem is to minimize a function
characterizing the smoothness of the velocity field, subject to the constraint that each
feature location (xi , yi ) at t = 0 ends up at (xi , yi ) at t = 1. Joshi and Miller specifically
proposed to solve the problem
! 1 !
min (−∇ 2 u + cu)2 + (−∇ 2 v + cv)2 dx dy dt
(u(x,y,t),v(x,y,t)) t=0
! 1
s.t. xi = xi + u(S(x, y, t), t) dt (5.14)
t=0
! 1
yi = yi + v(S(x, y, t), t) dt
t=0
where α is a user-specified parameter (e.g., α = 1). From Equation (5.15) we can see
that the weight on each feature (xi , yi ) decreases as (x, y) gets further away from it,
and goes to infinity when (x, y) coincides with one of the feature points (which forces
the function to exactly interpolate the feature matches). The rigid transformation at
each point can actually be computed in closed form and the deformation field can
be computed over large image domains interactively. However, as with thin-plate
splines, the computed dense correspondence mapping can be non-diffeomorphic if
the deformation is large, and it depends explicitly on all the input feature matches.
One problem with scattered-data-interpolation techniques is that it’s difficult to
control what happens between the features (e.g., to keep straight lines straight).
Also, the computed deformation fields are independent of the underlying image
intensities, although using this intensity information is clearly critical to estimate
a high-quality deformation. Finally, significant user interaction may be required to
obtain a set of feature matches that results in a high-quality deformation; matched
5.3. O p t i c a l F l o w 157
features that come directly from an automated method like those discussed in the
previous chapter will probably not be sufficient. We now turn to optical flow algo-
rithms, which have the natural interpretation of moving pixels from the first image
toward the second. Every pixel intensity in the image plays a role in the deformation.
I (x + u, y + v, t + 1) = I (x, y, t) (5.16)
Without any other information, this is the only part of the optical flow field that can
be obtained. This makes sense because the problem is inherently underconstrained:
we want to estimate two unknowns (u and v) at each pixel, but only have one equation
(5.16). Therefore, we must make additional assumptions to resolve the remaining
degree of freedom at each pixel.
The most natural assumption is that the optical flow field varies smoothly across
the image; that is, neighboring pixels should have similar flow vectors. Horn and
Schunck phrased this constraint by requiring that the gradient magnitude of the flow
field, namely the quantity
2 2 2 2
∂u ∂u ∂v ∂v
+ + + (5.20)
∂x ∂y ∂x ∂y
should be small. Overall, this leads to an energy function to be minimized over the
flow fields u(x, y) and v(x, y):
∂I ∂I ∂I
2
∂u 2 ∂u 2 ∂v 2 ∂v 2
EHS (u, v) = u+ v+ +λ + + +
x,y
∂x ∂y ∂t ∂x ∂y ∂x ∂y (5.21)
= Edata (u, v) + λEsmoothness (u, v)
where λ is a parameter that specifies the influence of the smoothness term, also
known as a regularization parameter. The larger the value of λ, the smoother the
optical flow field. A large weight on the regularizer (which depends on the domain and
range of I ) is often used to enforce a smooth flow field. As in Section 3.2.1, minimizing
a function like Equation (5.21) can be accomplished by solving the Euler-Lagrange
equations, which in this case are:
2
2 ∂I
∂I ∂I ∂I ∂I
λ∇ u = v+ u+
∂x
∂x ∂y ∂x ∂t
2 (5.22)
∂I ∂I ∂I ∂I ∂I
λ∇ 2 v = u+ v+
∂x ∂y ∂y ∂y ∂t
evaluated simultaneously for all the pixels in the image. Computationally, the solu-
tion proceeds in a similar way to the method described in Section 3.2.1. The partial
derivatives of the spatiotemporal function I are approximated using finite differences
between the two given images. That is, at pixel (x, y),
∂I 1#
≈ I1 (x + 1, y) − I1 (x, y) + I1 (x + 1, y + 1) − I1 (x, y + 1)
∂x 4
$
+ I2 (x + 1, y) − I2 (x, y) + I2 (x + 1, y + 1) − I2 (x, y + 1)
∂I 1#
≈ I1 (x, y + 1) − I1 (x, y) + I1 (x + 1, y + 1) − I1 (x + 1, y)
∂y 4 (5.23)
$
+ I2 (x, y + 1) − I2 (x, y) + I2 (x + 1, y + 1) − I2 (x + 1, y)
∂I 1#
≈ I2 (x, y) − I1 (x, y) + I2 (x + 1, y) − I1 (x + 1, y)
∂t 4
$
+ I2 (x, y + 1) − I1 (x, y + 1) + I2 (x + 1, y + 1) − I1 (x + 1, y + 1)
5.3. O p t i c a l F l o w 159
(a) (b)
Figure 5.5. Optical flow computed using a hierarchical implementation of the Horn-Schunck
algorithm. (a) Image 1 (top) and Image 2 (bottom). (b) Optical flow field overlaid on Image 1,
indicated by arrows (only a sampling of arrows are illustrated for visibility). We can see that the
flow field accurately captures the motion introduced by camera rotation, even in flat regions.
where w(x, y) is a window function centered at (x0 , y0 ) (for example, a box filter, or a
Gaussian with a given scale σ ). Effectively, we’re assuming that all the pixels in the
window have the same motion vector. The minimizer corresponds to the solution of
the linear system
2
∂I ∂I ∂I u
(x,y) w(x, y) ∂x (x, y) (x,y) w(x, y) ∂x (x, y) ∂y (x, y)
2
∂I ∂I ∂I
(x,y) w(x, y) ∂x (x, y) ∂y (x, y) (x,y) w(x, y) ∂y (x, y) v
∂I ∂I
(x,y) w(x, y) ∂x (x, y) ∂t (x, y)
=− (5.25)
∂I ∂I
(x,y) w(x, y) ∂y (x, y) ∂t (x, y)
In the last chapter, we used the eigenvalues of the matrix on the left-hand side
of Equation (5.25) — the Harris matrix — as the basis for detecting good feature
locations. Windows where one of the eigenvalues is small correspond to flat or edge-
like regions that suffer from the aperture problem and for which a reliable optical
flow vector can’t be estimated. Thus, the only way to use Equation (5.25) to obtain
5.3. O p t i c a l F l o w 161
dense correspondence across the whole image is to make the support of the window
large enough (for example, to make σ large if we’re using a Gaussian) to ensure
both eigenvalues are far from zero. Using large windows may not do a very good job
of estimating flow, since as we observed in Figure 4.11, large square windows are
unlikely to remain square in the presence of camera motion.
On the other hand, the Lucas-Kanade algorithm is local, in that the flow vector
can be computed at each pixel independently, while the Horn-Schunck algorithm
is global, since all the flow vectors depend on each other through the differential
equations (5.22). This makes the Lucas-Kanade problem computationally easier to
solve. Since the linear system in Equation (5.25) also resulted from the assumption
that the flow vectors (u, v) are small, a pyramidal implementation can be used in the
same way as in the previous section to handle large motions [55].
∇I (x + u, y + v, t + 1) = ∇I (x, y, t) (5.26)
where ∇ is the spatial gradient. This allows some local variation to illumination
changes; consequently, Brox et al. [74] proposed a modified data term reflecting
both brightness and gradient constancy:
Edata (u, v) = (I2 (x + u, y + v) − I1 (x, y))2 + γ ∇I2 (x + u, y + v) − ∇I1 (x, y)22 (5.27)
x,y
where γ weights the contribution of the terms (typically γ is around 100). Note that
Equation (5.27) directly expresses the deviation from the constancy assumptions,
instead of using the Taylor approximation in Equation (5.17) that is only valid for small
u and v. Xu et al. [558] claimed that it was better to use either the brightness constancy
or gradient constancy assumption at each pixel (but not both), and introduced a
binary switch variable for this purpose. An alternate approach is to explicitly model
an affine change in brightness at each pixel, as proposed by Negahdaripour [346] (in
which case these parameters also need to be estimated and regularized).
162 Chapter 5. Dense Correspondence and Its Applications
Bruhn et al. [76] proposed to replace the Horn-Schunck data term with one
inspired by the Lucas-Kanade algorithm, that is,
Edata (u, v) = w(x, y)(I2 (x + u, y + v) − I1 (x, y))2 (5.28)
(x,y)
where w(x, y) is a Gaussian with scale σ centered at the point (x0 , y0 ) at which the flow
is computed. The scale σ is usually in the range of one to three pixels. This approach
combines the advantages of Lucas-Kanade’s local spatial smoothing, which makes
the data term robust to noise, with Horn-Schunck’s global regularization, which
makes the flow fields smooth and dense. Bruhn et al. also extended this approach
to spatiotemporal smoothing when more than two images are available.
Sun et al. [476] collected ground-truth optical flow fields to learn a more accurate
model of how real images deviate from the brightness constancy assumption. The
learned distribution of I2 (x + u, y + v) − I1 (x, y) can be used to build a probabilistic
data term (e.g., approximating the distribution by a mixture of Gaussians). The same
approach can be used to learn distributions of filter responses applied to the flow
field (such as the gradient proposed by Brox et al. above).
1
D(g ) = g ⊥ g ⊥ + β 2 I2×2 (5.30)
g 22 + 2β 2
Figure 5.6. Neighborhoods corresponding to the anisotropic diffusion tensor with β = 0.25 at
different points in an image. The major and minor axes of each ellipse are aligned with the
tensor’s eigenvectors and weighted by the corresponding eigenvalues. Flat neighborhoods result
in nearly circular smoothing regions while edge-like neighborhoods result in ellipses that indicate
smoothing along, but not across, image edges.
where ρ(z) is a general penalty function. When the error in z is normally distributed,
then the optimal penalty function (i.e., the one that results in the maximum likeli-
hood estimate of the parameters) is ρ(z) = z 2 , which gives the original Horn-Schunck
data term. However, if the values of z are expected to contain outliers (i.e., the distri-
bution of z is heavy tailed), then we want to choose ρ to reduce the weight of these
outliers. Table 5.1 defines and illustrates several robust penalty functions with this
property commonly used for optical flow, including the Lorentzian, Charbonnier,
and Generalized Charbonnier functions.
An appealing property satisfied by the Lorentzian penalty function is that the
derivative of ρ(z) is redescending; that is, it initially increases in magnitude from 0
as |z| grows, but drops back to 0 for large |z|. On the other hand, the Charbonnier
penalty function is nearly equal to |z|, but unlike |z| is differentiable at 0. That is, it is
a differentiable approximation to the L 1 norm.
164 Chapter 5. Dense Correspondence and Its Applications
Table 5.1: Robust penalty functions typically used for optical flow. β is an
adjustable parameter, and ε is a very small number (e.g., 0.001) so that the
Charbonnier penalty function is nearly equal to |z|.
2 4
ρ(z)
Lorentzian ρ(z) = log 1 + 12 βz
2
0
−2 0 2
z
3
2
√
ρ(z)
Charbonnier ρ(z) = z 2 + ε2
1
0
−2 0 2
z
2
Generalized ρ(z)
ρ(z) = (z 2 + ε 2 )β
Charbonnier 1
0
−2 0 2
z
In the same way, we can robustify the smoothness term in optical flow. In this
case, we need to robustly compare a vector at each pixel (i.e., the four optical flow
gradients) to that of its neighbors. A typical robust smoothness term has the form
∂u ∂u ∂v ∂v
Esmoothness (u, v) = ρ , , , (5.32)
∂x ∂y ∂x ∂y 2
where ρ is again one of the robust functions in Figure 5.1 (it need not be the same
penalty function used for the data term). In particular, when ρ is the (approximate) L 1
penalty function, the optical flow method is said to use a total variation regularization
(e.g., [74, 538]). An alternative approach is to apply a different ρ to each of the four
gradient terms and sum the results.
5.3.3.4 Occlusions
The techniques discussed so far implicitly assume that every pixel in I2 corresponds
to some pixel in I1 . However, this assumption is violated at occlusions — regions
visible in one image but not the other. Occlusions are generally caused by objects
close to a stationary camera that move in the interval between taking the images,
or by changes in perspective due to a moving camera. In either case, background
pixels formerly visible in I1 will be hidden behind objects in I2 , and background pixels
5.3. O p t i c a l F l o w 165
A
B
Figure 5.7. Occlusions in optical flow occur when either the camera or objects in the scene
move between taking the images. In this sketch, occlusions are introduced due to a difference
in camera perspective. Point A in the left image is occluded in the right image, and Point B in the
right image is occluded in the left image.
I1 I2
Figure 5.8. Cross-checking for detecting occlusions in optical flow. If the flow vector at the
white pixel in the left image points to the black pixel, then the flow vector at the black pixel
in the second image should point back at the white pixel. That is, the gray and white pixels
in image I1 should be the same. Otherwise, the correspondence is inconsistent and the pixel is
probably occluded in one of the images.
(u, v)fwd (x, y) = −(u, v)bwd (I2 (x + ufwd (x, y), y + v fwd (x, y))) (5.33)
That is, as illustrated in Figure 5.8, the optical flow vector at the location given by
the forward flow in the second image should point back to the original pixel in the first
image. If the two vectors are not opposites, then the pixel is likely to be occluded in
one of the images. The “direction” of the occlusion can be determined by examining
the value of the cost function for each of the flows.
The robust cost functions discussed in the previous section can partially mitigate
the occlusion problem, but several proposed algorithms (e.g., [556]) explicitly detect
7 Belhumeur [37] notes that “half-occluded” regions is more accurate, since the problematic pixels
are visible in one image but not the other.
166 Chapter 5. Dense Correspondence and Its Applications
and deal with occlusions while estimating the flow. We will return to the role of
occlusions in dense correspondence in Section 19.
be smoothed away and not detectable at the coarser levels of the pyramid where the
small (u, v) assumption is valid. Thus, the correct motion never gets propagated to
finer levels of the pyramid.
Techniques for large-displacement optical flow often adapt ideas from the invari-
ant descriptor literature discussed in the last chapter to match regions that are far
apart. For example, Brox et al. [73] proposed an algorithm that begins with the seg-
mentation of each image into roughly constant-texture patches, each of which is
described by a SIFT-like descriptor. The descriptors are matched to obtain several
nearest-neighbor correspondences, and optical flow is estimated for each patch pair.
Finally, an additional term is added to a robust Horn-Schunck-type cost function that
biases the flow at (x, y) to be close to one of the (ui , vi ) candidates at (x, y) obtained
from the descriptor matching.
Liu et al. [289] proposed another option called “SIFT Flow” that replaces the bright-
ness constancy assumption with a SIFT-descriptor constancy assumption. That is,
the optical flow data term is based on S2 (x + u, y + v) − S1 (x, y), where S1 (x, y) and
S2 (x, y) are the SIFT descriptor vectors computed at (x, y) in I1 and I2 , respectively. The
descriptors are computed densely at each pixel of the images instead of at sparsely
detected feature locations. However, the method is really designed to match between
different scenes, and the resulting flow fields typically appear blocky compared to
the method of Brox et al.
Figure 5.9. (a) One frame of an original video sequence. (b) Layers created semi-automatically
based on the human-assisted motion estimation algorithm of Liu et al. [288]. (c) The u components
of an optical flow field semiautomatically generated using the layers in (b). The optical flow field
is very smooth within the layers.
168 Chapter 5. Dense Correspondence and Its Applications
We now consider a special case of the optical flow problem in which the two images
I1 and I2 are taken at the same time instant by two cameras in different positions.8
That is, any differences in the positions of corresponding points is entirely due to the
underlying camera motion. In this special case, we don’t need to search across the
entire image to estimate the motion vector (u, v). In fact, there is only one degree of
freedom for the possible location of a point in I2 corresponding to a fixed point in I1 .
This important constraint is described by the epipolar geometry relating the image
pair. In this section, we introduce the epipolar geometry and the constraints it puts
on feature matches. In the next section, we return to the optical flow problem in this
special case, which is called stereo correspondence.
Figure 5.10 illustrates the situation, illustrating the image planes of the two cam-
eras, a scene point P, and its projection (x, y) on the first image plane. Any point
on the ray extending from the optical center of the first camera through P will project
to the same position (x, y). This line, when viewed from the second camera, generates
the epipolar constraint mentioned earlier. That is, the correspondence (x , y ) must
occur somewhere along this line (called an epipolar line). Similarly, any point in the
second image must have its correspondence on an epipolar line in the first image.
Of course, if we don’t know the positions and orientations of the cameras that took
the images, it seems like we can’t determine the location of the epipolar line for a
(x',y')
(x,y)
I1 I2
Figure 5.10. When two images are taken at exactly the same time, the correspondence of a
point (x, y) in the first image (corresponding to the 3D point P in the scene) must occur along a
special line in the second image (the epipolar line corresponding to (x, y)).
8 Or equivalently, the same camera changes position while looking at a static scene. We assume the
cameras use the perspective projection model (discussed further in Chapter 6).
5.4. E p i p o l a r G e o m e t r y 169
P Π
1 2
I1 I2
(x,y) (x',y')
(a) (b)
Figure 5.11. (a) A family of planes that tilts around the line connecting the two camera centers.
Every point in the scene has to lie on one of these planes. (b) Fixing one plane creates a pair of
conjugate epipolar lines in the two images. All of the epipolar lines in one image intersect at the
epipole, the projection of the camera center of the other image.
given (x, y). However, the epipolar lines are even more constrained than Figure 5.10
suggests. Figure 5.11a illustrates a family of planes that tilts along a common “axis” —
the line connecting the two camera centers. Every point in the scene has to lie on one
of these planes. If we fix a plane , as illustrated in Figure 5.11b, it will intersect the
image planes in two lines, 1 and 2 . These two lines are exactly the epipolar lines
mentioned previously, and now we can see that they come in conjugate pairs. That
is, if a point on 1 in I1 has a correspondence in I2 , it must lie on 2 , and vice versa.9
Thus, in each image we have a one-dimensional family of epipolar lines. The epipolar
lines in each image all intersect at a special point called the epipole, which we can
see from Figure 5.11b is the projection of the other camera.10
In stereo, we reduce the optical flow problem to a one-dimensional correspon-
dence problem along each pair of epipolar lines. The next subsections discuss the
fundamental matrix, which mathematically encapsulates the epipolar geometry and
can be estimated from feature matches in a similar manner to how we estimated a
projective transformation in Section 5.1. It’s conventional to rectify the images before
applying a stereo algorithm, which means that we transform each image with a pro-
jective transformation so that the epipolar lines coincide with rows of the resulting
pair of images. Section 5.4.3 discusses that process.
9 Not all points on 1 may have a correspondence in I2 due to occlusions; see Figure 5.16.
10 In many real situations, the epipole is not visible in the captured image.
170 Chapter 5. Dense Correspondence and Its Applications
The equation of the epipolar line in I2 for a fixed (x, y) ∈ I1 is easily obtained from
Equation (5.34):
x x
y F y = 0 (5.35)
1 1
x
That is, the coefficients on x , y , and 1 are given by the three values of F y .
1
The equations for epipolar lines in I1 can be similarly obtained by fixing (x , y ) in
Equation (5.34).
The fundamental matrix is only defined up to scale, since any scalar multiple of
F also satisfies Equation (5.34). Furthermore, the 3 × 3 fundamental matrix only has
rank 2; that is, it has one zero eigenvalue. We can see why this is true from a geometric
argument. As mentioned earlier, all the epipolar lines in I1 intersect at the epipole
e = (xe , ye ). Therefore, for any (x , y ) ∈ I2 , e lies on the corresponding epipolar line;
that is,
x xe
y F ye = 0 (5.36)
1 1
holds for every (x , y ). This means that
xe
F ye = 0 (5.37)
1
That is, [xe , ye , 1] is an eigenvector of F with eigenvalue 0. Similarly, [xe , ye , 1] is an
eigenvector of F with eigenvalue 0. Therefore, the epipoles in both images can easily
be obtained from the fundamental matrix by extracting its eigenvectors.
A useful way of representing F is a factorization based on the epipole in the second
image:11
xe
F = ye M (5.38)
1 ×
In this form, F is clearly rank-2 since the skew-symmetric matrix [e]× is rank-2.
The fundamental matrix is not defined for an image pair that shares the same cam-
era center; recall our assumption was that the two cameras are in different positions.
In this case, the images are related by a stronger constraint: a projective transfor-
mation that directly specifies each pair of corresponding points, as discussed in
Section 5.1. The same type of relationship holds when the scene contains only a
single plane. In general, the fundamental matrix for an image pair taken by a pair of
separated cameras of a real-world scene is defined and unique (up to scale).12
[xi xi xi yi xi yi xi yi yi yi xi yi 1] [f11 f12 f13 f21 f22 f23 f31 f32 f33 ] = 0 (5.40)
Collecting the linear equations for each point yields an n × 9 linear system
Af = 0, which can be solved similarly to the method we discussed for a projec-
tive transformation. The basic algorithm, also called the normalized eight-point
algorithm13 , is:
1. The input is two sets of features {(x1 , y1 ), . . . , (xn , yn )} in the first image plane
and {(x1 , y1 ), . . . , (xn , yn )} in the second image plane. Normalize each set of
√
feature matches to have zero mean and average distance from the origin 2.
This can be accomplished by a pair of similarity transformations, represented
as 3 × 3 matrices T and T applied to the homogeneous coordinates of the
points.
2. Construct the n × 9 matrix A, where each feature match generates a row given
by Equation (5.40).
3. Compute the singular value decomposition of A, A = UDV . D will be a 9 × 9
diagonal matrix with positive entries that decrease from upper left to lower
right. Let f be the last column of V (a 9 × 1 vector).
4. Reshape f into a 3 × 3 matrix F̂ , filling in the elements from left to right in each
row.
5. Compute the singular value decomposition of F̂ , F̂ = UDV . D will be a 3 × 3
diagonal matrix with positive entries that decrease from upper left to lower
right. Set the lower right (3,3) entry of D equal to zero to create a new diagonal
matrix D̂ and replace F̂ with U D̂V .
6. Recover the final fundamental matrix estimate as F = T F̂ T .
The new Step 5 is required since the previous steps don’t guarantee that the esti-
mated F is rank two (a requirement for a fundamental matrix). Step 5 has the effect
of replacing the F̂ from Step 4 with the nearest rank-2 matrix.14
Figure 5.12 illustrates the result of estimating the fundamental matrix (and thus,
the epipolar geometry) for a real image pair using the normalized eight-point algo-
rithm. We can see that the resulting epipolar lines are well estimated — that is,
the original feature matches and other corresponding points clearly lie along the
estimated conjugate epipolar lines.
12 For details on degenerate scene configurations where the fundamental matrix is not uniquely
defined, see Maybank [317].
13 So named since at least eight points are required to obtain a unique solution.
14 In the sense of minimizing the Frobenius, or sum-of-squares, norm.
172 Chapter 5. Dense Correspondence and Its Applications
(a) (b)
(c) (d)
Figure 5.12. An example of estimating the epipolar geometry using the normalized eight-point
algorithm. (a) and (b) Images of the same scene from different perspectives, with feature matches
overlaid. (c) and (d) The green lines are epipolar lines computed from the estimated fundamental
matrix. We can see that corresponding points lie on conjugate epipolar lines (for example, the
corners of the roof and the brown line down the right side of the building). Note that the epipoles
(where the epipolar lines in each image intersect) are not visible in this example.
Hartley and Zisserman [188] describe further extensions of the eight-point algo-
rithm and discuss how the estimate of the fundamental matrix can be improved (e.g.,
to a maximum likelihood estimate under the assumptions that the measurement
errors in each feature location are Gaussian). Nonlinear minimization is required to
solve the problem and RANSAC can be used to detect and remove outliers in the data
to obtain a robust estimate.
So far, we have avoided detailed discussion of the 3D configuration of the cameras,
focusing on purely 2D considerations of the relationship between correspondences.
When we discuss matchmoving in Chapter 6, these 3D relationships will be made
more explicit. In particular, when each camera’s location, orientation, and internal
configuration are known, the fundamental matrix can be computed directly (see
Section 6.4.1). Conversely, estimating the fundamental matrix is often involved in the
early process of matchmoving.
F*
H1 e H2
F e'
I1 I2
Figure 5.13. Prior to rectification, epipolar lines are slanted, which complicates computing dense
correspondence. After rectification by appropriate projective transformations, epipolar lines
coincide with matching image rows (scanlines), making the correspondence search easier.
once at the beginning of the process, when the rectifying projective transformations
are applied, easing the search for dense correspondence.
Figure 5.13 illustrates the idea of rectification. In a rectified image pair, the epipolar
lines are parallel; thus, the epipoles are said to be “at infinity,” represented by the
homogeneous coordinate [1, 0, 0], which informally corresponds to the “point” [∞, 0]
infinitely far away on the x axis. The fundamental matrix for a rectified image pair is
given by
0 0 0
F∗ = 0 0 1 (5.41)
0 −1 0
Plugging F ∗ into Equation (5.34) gives the simple constraint that for a correspon-
dence in the rectified pair of images, y = y, corresponding to our definition of
rectification.
There are many choices for selecting a pair of rectifying projective transformations
(H1 , H2 ) for a given image pair.15 For example, horizontally stretching a rectified pair
by an arbitrary amount will still produce rectified images. The main consideration is
that the rectified images should not be too distorted.
Here, we describe the rectification method proposed by Hartley [187], which uses a
set of feature matches {(x1 , y1 ), . . . , (xn , yn )} ∈ I1 and {(x1 , y1 ), . . . , (xn , yn )} ∈ I2 . The idea
is to estimate a projective transformation H2 for the second image that moves the
epipole to the homogeneous coordinate [1, 0, 0] while resembling a rigid transfor-
mation as much as possible. Then a matching rectifying transformation is determined
for the first image that minimizes the distance between the matches in the rectified
images. In more detail, we apply the following algorithm:
1. Estimate the fundamental matrix F from the feature matches using the
algorithm in the previous section.
2. Factor F in the form F = [e ]× M , where e = [xe , ye , 1] is the homogeneous
coordinate of the epipole in the second image.
3. Choose a location (x0 , y0 ) in the second image — for example, the center of the
image — and determine a 3 × 3 homogeneous translation matrix T that moves
(x0 , y0 ) to the origin.
4. Determine a 3×3 homogeneous rotation matrix R that moves the epipole onto
the x-axis; let its new location be (x ∗ , 0).
15 In fact, there are seven degrees of freedom in the sixteen parameters of H1 and H2 .
174 Chapter 5. Dense Correspondence and Its Applications
5. Compute H2 as
1 0 0
H2 = 0 1 0 RT (5.42)
−1/x ∗ 0 1
The first matrix in Equation (5.42) moves the epipole to infinity along the
x-axis, while the overall transformation resembles a rigid motion in the
neighborhood of (x0 , y0 ).
6. Apply the projective transformation H2 M (where M was determined in Step
2) to the features in I1 and the projective transformation H2 to the features
in I2 to get a transformed set of feature matches {(x̂1 , ŷ1 ), . . . , (x̂n , ŷn )} and
{(x̂1 , ŷ1 ), . . . , (x̂n , ŷn )} respectively.
7. At this point, the two images are rectified, but applying H2 M to I1 may result
in an unacceptably distorted image. The next step is to find a horizontal shear
and translation that bring the feature matches as close together as possible.
We compute this transformation by minimizing the function
n
(a x̂i + bŷi + c − x̂i )2 (5.43)
i=1
Figure 5.14 illustrates the result of applying Hartley’s rectification algorithm to the
real images from Figure 5.12. We can see that the new epipolar lines are horizontal
and aligned, and that the inevitable warping of the two images is not too severe.
An alternate approach that does not require an initial estimate and factorization of
the fundamental matrix was proposed by Isgrò and Trucco [213]. Seitz and Dyer [434]
also proposed a rectification method particularly well suited to the view morphing
application discussed in Section 5.8.
(a) (b)
Figure 5.14. Rectifying the two images from Figure 5.12 results in horizontal and aligned
epipolar lines.
5.5. S t e r e o C o r r e s p o n d e n c e 175
Suppose that the two images I1 and I2 for which we want to compute dense corre-
spondence are taken at the same time instant by two cameras in different positions.
Thus, we can estimate the epipolar geometry and rectify the image pair, as described
in the previous section. In this case, the dense correspondence problem reduces to
the stereo correspondence problem, one of the most well-studied problems in com-
puter vision. We can think of stereo correspondence as a special case of optical flow,
with a few key differences:
16 We conventionally assume I1 is to the left of I2 with respect to the scene. Therefore, a point in I1
should appear to be further to the right than its matching position in I2 (see Figure 5.11b), and
we define the disparity to be the positive number x − x . Note that we’re assuming a Cartesian
coordinate system for pixels (i.e., the x axis is horizontal and the y axis is vertical).
176 Chapter 5. Dense Correspondence and Its Applications
stereo algorithms significantly outperform the current top optical flow algorithms,
when the conditions allow either type of method to be applied. The improvement is
generally attributable to the facts that (1) stereo is an easier problem (since we only
need to search for correspondences along conjugate epipolar lines instead of any-
where in the images) and (2) the discrete nature of the disparity values enables sharp
discontinuities to be distinguished more accurately, and allows powerful discrete
global optimization methods to be applied.
We begin by briefly mentioning several early methods for estimating stereo cor-
respondence, which are generally only locally optimal. These methods have been
superceded by global optimization algorithms based on algorithms like graph cuts
and belief propagation, which consistently rank highly in quantitative benchmarks
for stereo [425]. For the rest of the section we assume that I1 and I2 have already been
rectified.
where I1min (x, y) and I1max (x, y) are the minimum and maximum of the set
' (
1 1
(I1 (x − 1, y) + I1 (x, y)), I1 (x, y), (I1 (x, y) + I1 (x + 1, y)) (5.47)
2 2
I2min (x, y) and I2max (x, y) are similarly defined. Effectively, we’re linearly interpolating
between pixels in the row of I1 to determine the best match to a pixel in I2 and vice
versa.
17 Birchfield and Tomasi actually proposed to use the measure only between pairs of pixels (i.e., 1 × 1
“windows”), but other researchers have aggregated the measure into larger blocks.
5.5. S t e r e o C o r r e s p o n d e n c e 177
50 70 80
That is, we sum the Hamming distances between corresponding bit strings over the
window to arrive at the final cost. Hirschmüller and Scharstein [200] investigated
a large set of proposed stereo matching costs and concluded that methods based
on the census transform performed extremely well and were robust to photometric
differences between images.
Block-matching stereo methods in which each pixel independently determines its
disparity — known as winner-take-all approaches — are clearly suboptimal com-
pared to methods that simultaneously determine all the disparities according to
some global criterion. Using overlapping blocks implicitly encourages some degree
of coherence between disparities at neighboring pixels (or smoothness, in the ter-
minology of optical flow). However, disparity maps determined in this way typically
exhibit artifacts both within and across scanlines and don’t produce high-quality
dense correspondence. These artifacts include poor performance in flat, constant-
intensity image regions where local methods have no way to determine the correct
match due to the aperture problem, as well as near object boundaries where blocks
overlap regions with significantly different disparities. In addition, multiple matches
may occur between different pixels in I1 and the same pixel in I2 .
A better idea is an algorithm that enforces global optimality of the estimated dis-
parity along each pair of scanlines. One of the earliest approaches, proposed by Ohta
and Kanade [352], used dynamic programming to find the globally optimal cor-
respondence between a scanline pair, using detected edges in each scanline as a
guide to build a piecewise-linear disparity map. Figure 5.16 illustrates the idea. The
disparities d(x, y) for an entire row (i.e., fixed y) are selected to minimize
N
C(x, y, d) (5.49)
x=1
178 Chapter 5. Dense Correspondence and Its Applications
(a) (b)
occlusion
th
pa
g
in
scanline in I2
ch
at
m
scanline in I1
(c)
Figure 5.16. Scanline matching using dynamic programming, as described by Ohta and Kanade
[352]. (a) and (b) Two rectified images of the same scene from different perspectives, with an
overlaid scanline. (c) The matching path for the dynamic programming problem must go through
the nodes generated by pairs of edges in each scanline. The path must proceed from the lower
left corner to the upper right corner without doubling back on itself; occlusions can be modeled
as horizontal or vertical lines.
where C could be any of the cost functions discussed earlier. Appendix A.1 describes
how to set up and solve the dynamic programming problem. Limiting the maximum
allowable disparity can greatly reduce the complexity of finding the solution.
As illustrated in Figure 5.16, occlusions in either image can be modeled as horizon-
tal or vertical segments in the dynamic programming graph. However, note that this
formulation requires monotonicity — that is, the property that corresponding points
appear in the same order along matching scanlines. This assumption is not always
justified for real images, as characterized by the double nail illusion illustrated in
Figure 5.17.
Extensions to the basic dynamic programming approach involved determining
the best cost function C in Equation (5.49), and more importantly, determining
how to assign reasonable costs to occluded regions. For example, Belhumeur [37]
5.5. S t e r e o C o r r e s p o n d e n c e 179
B
A
B
A B
A
Figure 5.17. The double nail illusion. Point A (on the foreground object) and point B (on the
background wall) appear in different order in the left and right images. This phenomenon typically
occurs with thin vertical objects close to the cameras.
built a Bayesian model for the disparity estimation problem from first principles,
with respect to multiple occluding objects that may have sharp interior depth edges
(creases). This led to an explicit prior distribution on disparities that could be
incorporated into a dynamic programming problem for each scanline.
Dynamic programming approaches overcome many of the limitations of winner-
take-all techniques; they can deal naturally with low-contrast regions and avoid the
multiple-match problem. However, a major issue is enforcing the consistency of
disparities across neighboring scanlines, since dynamic programming only applies
to matching two one-dimensional signals. While many authors (including [352, 37])
proposed ways of enforcing consistent disparities across scanlines, these methods
inevitably produce undesirable “streaky” disparity maps that come from inconsistent
estimates between adjacent rows. The global methods in the next sections operate
directly in the 2D image plane, and neither suffer from these artifacts nor require the
monotonicity assumption.
in Appendix A.3. As in optical flow, we minimize the sum of a data term and a smooth-
ness term; however, as in Section 3.3, the energy to be minimized is of the form:
E(L) = Edata (L(i)) + Esmoothness (L(i), L(j)) (5.50)
i∈V (i,j)∈E
Here, V is the set of pixels in I1 , E is the set of all adjacent pixels (for example,
4-neighbors), and L is a labeling; that is, an assignment in {0, . . . , dmax } to each pixel
i, where dmax is the maximum allowable value of the disparity. The data term can
be formed from any of the matching cost functions defined in the previous section;
for example, at a vertex i corresponding to the location (x, y), we could choose the
Birchfield-Tomasi measure
Here, it makes sense for the window size to be a single pixel, since the smoothness
term handles consistency between pixels.
The critical issue is choosing a smoothness term that accurately handles disconti-
nuities. The easiest choice, the quadratic function
encourages the disparity map to be smooth everywhere but does a poor job at object
boundaries. Another option is to use a robust cost function ρ, as we discussed in
Section 5.3.3.3. However, in stereo, it’s more common to use one of the simpler
functions given in Table 5.2.
Each of these functions avoids over-penalizing sharp changes in disparity between
neighboring pixels, while generally favoring disparity maps in which regions have
similar labels. In particular, the Potts model encourages regions to have con-
stant disparity, a more extreme case of smoothness. The intensity-adaptive Potts
model incorporates contextual information about the image intensities; that is, we
impose a higher penalty if the intensity levels are similar but the disparity labels are
different.
5.5. S t e r e o C o r r e s p o n d e n c e 181
(a) (b)
(c) (d)
Figure 5.18. An example stereo result using graph cuts. (a), (b) An original stereo image pair.
(This is the “Tsukuba” dataset frequently used for benchmarking stereo algorithms, originally
created and ground-truthed by Nakamura et al. [344].) (c) The ground-truth disparity map, in
which the disparity labels {0, 1, . . . , 14} are mapped from black to white. Objects closer to the cam-
eras (e.g., the lamp) have higher disparities. (d) The stereo result using graph-cut optimization
as described by Boykov et al. [61].
18 Technically, the truncated quadratic can’t be used in the α-expansion algorithm since it’s not a
metric.
182 Chapter 5. Dense Correspondence and Its Applications
approach based on α-expansion that both enforces uniqueness and properly han-
dles occlusions. However, the set of vertices is quite different than that shown earlier.
Instead, each vertex in the graph is a viable correspondence: a match {(x, y), (x , y )}
such that x − x ∈ {0, 1, . . . , dmax } and y = y . An edge connects two vertices if they have
the same disparity, with a small weight if the two pixel pairs have similar intensities
and a large weight otherwise. This encourages pairs of adjacent pixels with simi-
lar intensities to have similar disparities. Edges with infinite weight are also defined
between vertices that contain the same pixel with different disparities, so that cut-
ting such an edge would violate the uniqueness property. Finally, edges are defined
connecting each vertex to two terminals, so that cutting such an edge implies that the
pixel is occluded in one image or the other. They constructed α-expansion steps on
a series of such graphs, so that the final labeling approximately minimizes a sum of
data, smoothness, and occlusion penalty terms, while maintaining uniqueness and
allowing some pixels to remain unmatched.
While uniqueness seems like a desirable property, it can be problematic in the
case of horizontally slanted surfaces, as noted by Sun et al. [480]. That is, a surface
may extend across many pixels in one image but only a few pixels in the other image
due to foreshortening. They proposed a more general visibility constraint that non-
occluded pixels must have at least one match in the other image, while occluded
pixels must have no matches; this required the estimation of an additional occlusion
map (see more in the next section).
where d ∈ {0, 1, . . . , dmax }. Effectively, this message conveys vertex i’s assess-
ment that vertex j should have disparity d.
3. Repeat Step 2 for T iterations.
4. Form the final belief vector at each i ∈ V and d ∈ {0, 1, . . . , dmax } as:
bi (d) = Edata (L(i) = d) + mjiT (d) (5.54)
j|(i,j)∈E
That is, we only compute disparities for unoccluded pixels, and add a constant
penalty Eocclusion for each occluded pixel. In practice, the computation is simplified
by allowing the values in the occlusion map to range continuously between 0 and
1, resulting in a soft combination of the data and occlusion terms in the cost func-
tion. Sun et al. also proposed a binary discontinuity map defined over the edges,
which explicitly encodes the presence or absence of a discontinuity. Similar to how
Equation (5.55) differs from Equation (5.50), the smoothness term is decomposed
into a term for edges along continuous surfaces and a penalty for discontinuities.
Sun et al. [480] later refined the approach to a symmetric stereo model, where con-
sistent disparities, occlusions, and discontinuities for the left and right images are
computed simultaneously (instead of just computing these quantities for the left
image). Xu and Jia [557] took a similar approach, using a data term inspired by robust
matting ([532], Section 2.6.1).
where a, b, and c can be estimated with robust methods [486]. In this case, we can add
a segmentation-based regularization term to the data term of a stereo cost function,
such as
Esegment (L(i) − (ai x(i) + bi y(i) + ci )) (5.56)
i∈V
where ai , bi , and ci are the estimated plane parameters for the segment containing
pixel i, and Esegment (x) could be based on one of the robust cost functions in Tables 5.1
and 5.2. Of course, now we must perform an extra step of segmenting the image
into roughly constant-intensity pieces, which is commonly solved using the mean-
shift algorithm [102]. If necessary, initial estimates of the disparity map within each
segment can be obtained by an algorithm like Lucas-Kanade. High-performing stereo
algorithms that use a segmentation approach include Sun et al. [480], Klaus et al. [242],
Wang and Zheng [537], and Yang et al. [563]. Bleyer et al. [50] extended the approach
to incorporate a term based on minimum description length to penalize the number
of segments and to allow higher-order disparity surfaces such as B-splines.
Yang et al. [563] also noted that quadratic interpolation could be used to enhance
the quantized disparity estimates from a stereo algorithm, recovering a sub-pixel
disparity image. This step would likely be critical for obtaining good results for the
applications of dense correspondence we discuss in the next three sections.
We can extend the two-frame dense correspondence problem in several ways. One
possibility is to consider simultaneous correspondences between additional synchro-
nized cameras at different locations in the scene; this problem is called multi-view
stereo and will be discussed in detail in Section 8.3. Another possibility is to extend the
dense correspondence problem to video sequences, generalizing optical flow in the
case of a single camera and stereo in the case of a rigidly mounted pair of cameras.
These cases can generally be handled by adding a temporal regularization term to the
cost function that encourages the flow values (u, v) or disparity labels L to be similar
to those of the previous frame. For example, for stereo video this term might look like
Etemporal (L t (i) − L t−1 (i)) (5.57)
i∈V
where the superscript t indicates the time index in the video. Alternately (or in
addition), the flow field/disparity map from the previous frame can be used as an
initial estimate for the current frame. Sawhney et al. [421] described an algorithm for
high-resolution dense correspondence for stereo video in this vein.
A third situation arises when we consider a pair of video cameras that move
through a scene at different times and different velocities. This is related to the visual
effects problem of motion control — that is, the synchronization of multiple cam-
era passes over the same scene. Motion control for an effects-quality shot typically
requires a computer-controlled rig that moves through an environment along a pre-
programmed path with extremely high precision. In this way, an environment can be
set up multiple times so that different elements can be independently filmed (e.g.,
5.6. V i d e o M a t c h i n g 185
separate passes for live action, dangerous elements, lighting, models and minia-
tures20 , and so on). The different elements are then matted and composited together
to create the final shot, as discussed in Chapters 2 and 3.
While motion control rigs are highly precise, they are also extremely large and
expensive. In this section, we generalize optical flow techniques to achieve a motion-
control effect in situations where it would be infeasible to use a professional rig.
We call this problem video matching. Formally, we consider two video sequences
I1 (x, y, t) and I2 (x, y, t). The problem is to estimate a flow field and time offset at each
frame according to a generalized brightness constancy assumption:
I2 (x + u, y + v, t + δ) = I1 (x, y, t) (5.58)
Caspi and Irani [84] addressed a simplified version of the video matching problem
for a rigidly mounted pair of cameras, assuming the two video sequences were related
by a spatial projective transformation and a constant temporal offset. Sand and Teller
[420] addressed the more general video matching problem, illustrated in Figure 5.19;
we summarize the approach here.
Consider a candidate pair of images, one from each video sequence. We estimate
a set of feature matches between the pair, for example, by detecting and matching
Harris corners. Each match is assigned a confidence based on the similarity of the
local neighborhoods around the pair of feature locations. These confidence-weighted
feature matches are used to build a dense correspondence field between the image
pair, using a locally weighted regression to estimate a smooth optical flow field. The
input and the output of the algorithm are the same as the scattered data interpolation
methods described in Section 5.2; like these methods, the resulting dense correspon-
dence field cannot represent discontinuities. By comparing the actual second image
with the warped first image predicted by the motion field, we identify regions of
(a)
(b)
(c)
Figure 5.19. The video matching problem. We begin with two video sequences (a) and (b) that
follow roughly the same trajectory in space (curved line). The arrows indicate viewpoints at
equally spaced points in time, showing that the video sequences have different fields of view and
velocities. After video matching in (c), frames of the second sequence are aligned both spatially
and temporally with the first sequence, for subsequent use in applications like compositing.
20 For models and miniatures, the camera motion and software must compensate for the different
scale with respect to live action. For more information on motion control, see Rickitt [393].
186 Chapter 5. Dense Correspondence and Its Applications
2
xi x xi + u(xi , yi ) x + u(x , y )
λ j j j j
yi − yj − yi + v(xi , yi ) − yj + v(xj , yj )
(xi ,yi )∈I1 (t) (xj ,yj )∈I1 (t) 2 2
2
u(x, y)
+ (5.59)
v(x, y)
(x,y)∈I1 (t) 2
The first term in Equation (5.59) is based on the parallax between a pair of
matches — that is, the difference between their distance in the first image and their
distance in the second image. The parallax is invariant to image rotation and transla-
tion. Here, we compute the average parallax over all correspondences introduced by
the optical flow field, as a measure of the introduced image distortion. The second
term in Equation (5.59) is the average optical flow vector magnitude, which is small
when the overlap between the two images is large. Sand and Teller used λ = 5 to
emphasize the importance of parallax.
Now we can use the pairwise matching cost between pairs of frames in the first and
second video sequences to build a set of frame-to-frame correspondences. The user
initializes the process by selecting the first pair of corresponding frames (I1 (t0 ), I2 (t0 +
δ0 )). Then we iteratively determine the frame-to-frame correspondences with the
following procedure:
1. Set k = 0.
2. Set the initial guess for the offset δk+1 as a weighted average of the five previous
offsets, where the weight decreases as we move back in time.
3. Compute the matching cost between I1 (tk+1 ) and the set of frames I2 (tk +δk+1 ),
I2 (tk + δk+1 + 1), I2 (tk + δk+1 − 1), I2 (tk + δk+1 + 5), and I2 (tk + δk+1 − 5).
4. Fit a quadratic function to these costs as illustrated in Figure 5.20. Determine
the minimizer δ ∗ of the quadratic function.
5. If δ ∗ = δk+1 — that is, the minimizer stays in the same place — set k = k + 1
and go to Step 2. Otherwise, set δk+1 = δ ∗ and go to Step 3.
Once the two videos are spatially and temporally synchronized, we can apply many
of the algorithms from Chapters 2 and 3. For example, if one sequence contains live
action and the other contains an empty background, we have a strong prior estimate
of the alignment required for video matting and inpainting. We can also film one
person on the left side of a moving shot, and the same person on the right side of a
similar shot, compositing the two videos along a seam to create a “twinning” effect.
Alternately, we can replace a stand-in from a live-action plate with a computer-
generated character composited over a clean plate.
5.7. M o r p h i n g 187
matching cost
Figure 5.20. Fitting a quadratic function to frame-to-frame matching costs for video matching.
The white dot is the current estimate of the matching frame’s position in the second video. We
evaluate the frame-to-frame matching cost at this estimate, and 1 and 5 frames on either side
(gray dots). By fitting a quadratic function to the costs, we obtain a new estimate for the matching
frame at the minimizer of the quadratic (striped dot).
5.7 MORPHING
One of the most compelling visual effects created using dense correspondence
between an image pair is morphing, also known as image metamorphosis. Morph-
ing uses an estimated dense correspondence field to create a smooth transformation
from one image into another, and was used to great effect in films like Terminator
2, Indiana Jones and the Last Crusade, and the video for Michael Jackson’s Black or
White.
Unlike our assumptions for optical flow and stereo, the two images in the morphing
problem typically contain different objects (for example, two different faces). Since
these images significantly violate the brightness constancy assumption, the dense
correspondence is typically estimated from a hand-selected set of feature matches,
using methods from Section 5.2. Correspondence fields for morphing applications
generally don’t take into account occlusions or discontinuities, instead resembling a
deformed “rubber sheet.” That is, we require each point in the first image to have a
correspondence in the second image and vice versa.
We begin with two images, I1 and I2 , and two dense correspondence fields (u, v)fwd
from I1 to I2 and (u, v)bwd from I2 to I1 .21 The morphing problem is to construct a
sequence of intermediate images {Mt , t ∈ [0, t, 2t, . . . , 1] so that M0 = I1 , M1 = I2 ,
and the intermediate images create a natural transformation from one image to the
other.
A naïve solution to morphing is to simply cross-dissolve between the two images,
that is, letting
Mt (x, y) = (1 − t)I1 (x, y) + tI2 (x, y) (5.60)
However, as we can see from Figure 5.21, this approach generates poor results,
since corresponding structures in the two images are not aligned. In particular,
21 A consistent (u, v)bwd can be constructed from a given (u, v)fwd if we only computed the flow in
one direction. However, it is sometimes useful to allow the two fields to be inconsistent (as is the
case for the field morphing algorithm discussed shortly).
188 Chapter 5. Dense Correspondence and Its Applications
Figure 5.21. Simply cross-dissolving between two original images (far left and right) produces
an unrealistic transition between them (intermediate images).
^t
I1 I1
Mt
^t
I2 I2
Figure 5.22. The process of image morphing using dense correspondence fields. Intermediate
images Î1t and Î2t are created by applying forward and backward optical flow fields to the original
images I1 and I2 , respectively. The morph image Mt is created as a weighted average between
the colors of Î1t and Î2t . In this sketch, t ≈ 13 , so the morph image Mt is closer to I1 both in structure
and color.
intermediate images don’t look like they contain a single object of the same type
as either of the two original images.
The key to an effective morph is to warp each image toward the other before
performing the cross-dissolve; in this way, the image structures are aligned in each
image of the morphing sequence. The process is illustrated in Figure 5.22. Specifically,
we apply the following basic algorithm to create each image Mt .
1. Compute intermediate images Î1t and Î2t by warping I1 and I2 a fraction of the
way along their estimated flow fields:
Figure 5.23 illustrates an example with real images. We can see that each inter-
mediate image Î1t is the result of warping the pixels of I1 the fraction t of the way to
their corresponding locations in I2 . Thus, Î11 (the rightmost image in the first row of
Figure 5.23) contains the pixel intensities of I1 at the locations in I2 . A similar argu-
ment holds for the intermediate images Î2t in the second row of Figure 5.23. We can
5.7. M o r p h i n g 189
Figure 5.23. A real example of image morphing using dense correspondence fields. Top row:
intermediate images Î1t ; the original image I1 corresponds to the leftmost image Î10 . Middle row:
intermediate images Î2t ; the original image I2 corresponds to the rightmost image Î21 . Bottom row:
final morph sequence created by cross-dissolving between corresponding images in the top and
middle rows. The transformation is much more compelling than the simple cross-dissolve in
Figure 5.21.
see that the cross-dissolve between corresponding images in the first and second
rows of Figure 5.23 yields a realistic morph between both the intensities and image
structures.
Morphing algorithms principally differ in their methods for obtaining the dense
correspondence fields between the image pair. Early methods used a compatible
quadrilateral mesh defined over the images (similar to the optimized-scale-and-
stretch grid in Section 3.5.1). Correspondences between points inside mesh quads
can be obtained by bilinear interpolation, or more generally using B-splines as dis-
cussed earlier in this chapter. The difficulty with this approach comes in attempting
to control the mesh to conform well to important image features, resulting in large
regions with either too many or too few mesh vertices. An alternate approach is to use
one of the scattered data interpolation techniques from Section 5.2; for example, Lee
et al. investigated both thin-plate splines [268] and adaptive, nonuniform B-spline
interpolating surfaces [269] to define the correspondences for morphing. A unique
aspect of the latter approach was the use of “snakes” [230] to automatically snap the
user-specified points to image features.
One of the most popular approaches to estimating dense correspondence for mor-
phing is the field morphing technique proposed by Beier and Neely [35]. Unlike the
methods in Section 5.2, the correspondence is interpolated from a set of several
corresponding user-drawn line segments on the two images. This allows the ani-
mator to have more control over the morph, since the method guarantees that the
correspondence between each pair of segments will be maintained in each mor-
phed image — something that a spline interpolation of feature matches cannot
guarantee. For example, to morph between two faces, an animator would draw
matching lines along the edges of the head, the eyebrows, the lips, and so on,
which is more intuitive than trying to establish feature matches in smooth, flat
regions.
190 Chapter 5. Dense Correspondence and Its Applications
qi
q'i q'i q'j
βi βi
x'i
x
αi x'i
αi x'j
pi p'i p'j
p'i
I1 I2 I2
(a) (b)
Figure 5.24. (a) Field morphing for a single line pair. αi is the projection’s relative distance from
pi to qi . βi is the signed distance to the line segment. (b) When we have more than one line
segment pair, each pair generates an estimate for the location x .
Here, di (x) is the non-negative distance from the point x to line segment i in I1 ,
and a, b, c are user-defined constants. If a is 0, the length of a line segment has no
effect on the weight computation; if a is 1 or larger, longer lines have more weight. b
is a small number that ensures Equation (5.64) is defined even for points exactly on
a line segment. c determines how quickly the influence of a line segment decreases
with distance and is usually in the range [1.0, 2.0].
5.8. V i e w S y n t h e s i s 191
Figure 5.25. (a) User-defined line segment correspondences for the field morphing algorithm.
(b) Line segment positions and image warps corresponding to t = 0.5. (c) The final morphed
image is the average of the images in (b).
To apply field morphing, we first define corresponding line segments for an image
pair, and then interpolate corresponding line segment endpoints to determine the
intermediate line segment positions in the warped image for each t. We then warp
each source image from the original line segments to the intermediate line segments,
and cross-dissolve between every pair of warped images. Figure 5.25 illustrates an
example of user-defined line segment matches between an image pair and a halfway
image created with the induced dense correspondence.
This algorithm is somewhat heuristic, so care must be taken to define “good” pairs
of line segments to make a compelling transformation. For example, line segments
should not cross, and a segment should span the same apparent region in both images
to prevent artifacts that arise from the underlying correspondence field folding back
on itself (i.e., being non-diffeomorphic). Typically line segments are added and mod-
ified in an iterative process until an acceptable morph is created. Schaefer et al. [422]
showed how the as-rigid-as-possible deformation algorithm for interpolating scat-
tered data discussed in Section 5.2.4 can be modified to operate on correspondences
between line segments, resulting in more natural deformation fields than Beier and
Neely’s original algorithm.
Lee et al. [269] noted that the cross-dissolve step in Equation (5.62) can be gen-
eralized to allow the two images to warp toward each other at different rates; for
example, we could replace t in Equation (5.62) with any function f (t) that monoton-
ically increases from 0 to 1. Even further, we can allow this function to vary spatially,
so that different regions of the image move and dissolve at different rates. Wolberg
[551] gave a good overview of morphing algorithms and further extensions.
Figure 5.26. Directly morphing between two very different views of the same object (left
and right images) can result in unrealistic intermediate images, even though the supplied
correspondence fields contain no errors.
5.8. V i e w S y n t h e s i s 193
Figure 5.27. (a) The camera configuration for view interpolation. The white image planes rep-
resent the original images and the striped image plane represents the synthesized image. The
image planes are all parallel to each other and to the baseline. (b) The outlined upper right
and lower left images are synthesized from the upper left and lower right images using view
interpolation, and are physically consistent with the underlying scene.
Figure 5.28. Folds and holes can be introduced in the synthesized view based on changes in
visibility. Points A and B create a fold, since they both map to the same point in the synthesized
view. We select the intensity from A since it has a larger disparity. Point C creates a hole, since
it is visible from the synthesized viewpoint but neither source viewpoint.
Seitz and Dyer [434] generalized view interpolation to allow physically consistent
view synthesis under the relaxed constraint that the source and synthesized camera
centers are collinear, as illustrated in Figure 5.29a. They observed that image planes
with arbitrary orientations could be made parallel to each other and to the line con-
necting the camera centers by applying rectifying projective transformations (called
prewarps), which we discussed in Section 5.4.3. After rectification, the conditions for
view interpolation are met, as illustrated in Figure 5.29b.
We now apply view interpolation to the rectified images and apply a final projec-
tive transformation (called a postwarp) to the synthesized view to effectively rotate
the virtual camera. Seitz and Dyer called this algorithm view morphing. Figure 5.30
illustrates an example of view morphing; we can see that the prewarps and post-
warps enable view synthesis in situations where the source images are quite different
194 Chapter 5. Dense Correspondence and Its Applications
prewarp
postwarp
prewarp
(a) (b)
Figure 5.29. (a) The camera configuration for view morphing. The white image planes repre-
sent the original images and the striped image plane represents the synthesized image. The
only requirement is that the virtual camera center lies on the line connecting the source cam-
era centers. (b) Applying appropriate rectifying projective transformations (prewarps) to the
source images allows view interpolation to be applied to the intermediate gray image planes. A
postwarping projective transformation is used to rotate the virtual camera to the final striped
image.
Figure 5.30. An example of view morphing. The original source images at the upper left and
upper right are rectified by prewarping projective transformations to the images at the lower left
and lower right. These rectified images are interpolated using view interpolation to produce the
new synthetic image at the lower center, which is postwarped to produce the synthetic view at
the upper center. In this case, the four corners of the top face of the box were used to guide the
postwarp.
from each other and from the synthesized view. One way to select the postwarping
projective transformation is by linearly interpolating the vertices of a user-specified
quadrilateral in each of the source images, since four feature matches define a
projective transformation.
Seitz and Dyer observed that the method could also be used to generalize mor-
phing algorithms from the last section; that is, we generate a family of intermediate
views M (t) in which M (0) corresponds to the first source image and M (1) to the
second source image. When we finely sample t, we obtain an effect of smoothly mov-
ing the camera from the position where the first image was taken to the position
5.9. I n d u s t r y P e r s p e c t i v e s 195
where the second image was taken. If the images are taken of the same object at the
same time, the effect is similar to the “bullet time” effect from The Matrix; objects
appear frozen in time as the camera moves around them. If the images are actually of
different objects from different perspectives, the morphing result seems to simulta-
neously transform one object into the other while moving the virtual camera. Radke
et al. [380] generalized view morphing to create an interpolated virtual video from
two source videos, allowing independent temporal manipulation effects and virtual
camera motion.
Mahajan et al. [310] noted that the warping and cross-dissolving steps required
in morphing and view synthesis tend to blur synthetic views due to repeated pixel
resampling and averaging operations (this blurriness is visible in the rectified and
synthetic views in Figure 5.30). Instead, inspired by some of the methods we discussed
in Chapter 3, they proposed to synthesize sharper virtual images so that each pixel in a
synthetic view is taken from exactly one of the source images. Thus, as we morph from
one image to the next, the problem is determining a transition point for each pixel
that specifies which source image should be used for that pixel, and which location
in that source image generates the pixel’s intensity. The algorithm involves a graph-
cut optimization for determining the transition points and Poisson reconstruction to
generate the final synthetic view from its gradients.
Marty Ryan, senior software engineer at Rhythm & Hues Studios in El Segundo, CA
and Shankar Chatterjee, software developer at Cinesite in Hollywood, CA, discuss the
role of optical flow in visual effects.
RJR: What are the most common applications of optical flow in visual effects?
Chatterjee: We also commonly use optical flow for retiming — one example is Thir-
teen Days, a movie about the Cuban missile crisis. In the very last frames there was a
slowdown based on my optical flow algorithm. Optical flow was also used for retim-
ing in the movie Shanghai Noon; they wanted to show Jackie Chan’s movement in
slightly slower than real time, to get the full impact.
196 Chapter 5. Dense Correspondence and Its Applications
(a)
(b)
(c)
Figure 5.31. (a) In this scene from Transformers: Dark of the Moon, parametric transformations
are used to register several high-resolution tiles from a rotating camera into a large background
plate. (b) Determining accurate stereo correspondence is critical for inserting visual effects into
stereo films natively shot in 3D such as Transformers: Dark of the Moon. (c) Optical flow is fre-
quently used to retime scenes, such as this fast-moving sequence from Transformers: Revenge of
the Fallen. Transformers: Dark of the Moon ©2011 Paramount Pictures. All Rights Reserved. Trans-
formers: Revenge of the Fallen ©2009 DW Studios L.L.C. and Paramount Pictures Corporation. All
Rights Reserved.
5.9. I n d u s t r y P e r s p e c t i v e s 197
Before that we used optical flow for a movie called What Dreams May Come. We
extracted optical flow vectors and used them to “modulate” several scenes. In one
scene, the main character slides down a hill into a lake, and the optical flow motion
vectors from the clean plate were used to re-render the scene as if he was smearing
the surface of a brightly colored oil painting. A related application of optical flow
was in a kids’ movie called Clockstoppers. We estimated the motion vectors from the
scene, and used them to create and enhance the motion blur on CGI objects like cars
inserted into the scene.
In another movie called Practical Magic we used it in a shot where a man’s ghost
comes up out of the ground. The director wanted mud stuck to him, but when they
shot it there was no actual mud. We painted the mud in one keyframe and used the
optical flow vectors to track the mud onto him in the rest of the frames. At some point
the ghost vanishes into smoke; we used optical flow there as well. We recorded the
lighting of a match, reversed the video, and estimated the motion vectors as the full-
grown flame shrank to a single spark. We applied those motion vectors to the pixels
on the ghost character to create the dissolving effect. We used a similar approach
to create the “bamfing” effect of the Nightcrawler character in X2: X-Men United, by
shooting smoke against a black background, computing the optical flow vectors, and
applying them to the character.
Ryan: We were inspired by Black and Anandan’s earlier optical flow work, but
more recently we’ve also implemented and had good success with Bruhn et al.’s
optical flow algorithms. Adding the gradient consistency assumption really made it
successful.
RJR: How does academic research on optical flow differ from the requirements of
feature films?
Ryan: There are many reasons why movie frames violate the usual assumptions of
optical flow. The biggest difference is the size of the frames. Academic optical flow
sequences tend to be a few hundred pixels wide, but we need to track HD plates that
are at least 1080p — it’s a whole different world. There’s often motion blur, which
affects heavy movement. Directors love anamorphic lens flare. With a lot of effects-
heavy shows, it’s very common that action sequences are shot at night, like the big
198 Chapter 5. Dense Correspondence and Its Applications
fight sequence that we worked on from The Incredible Hulk, to convey a moody, dark
feeling. You have very intense bright lights and a lot of murky blacks. There are a lot
of tricky things to deal with!
RJR: What are the different ways stereo movies are created?
Beier: There are two kinds of stereo movies: those that are shot natively in stereo
with two cameras like Avatar, and movies that are shot “flat” with a single camera
and converted to 3D like Alice in Wonderland. Transformers: Dark of the Moon is
unique in that it’s half of each.
The main approach to 3D conversion is based on rotoscoping tools. Regions are
segmented into layers, each of which is assigned a depth. They may also add a little
bit of shape to each layer — for example, on a face, they might extrude the nose out
a little bit. Then the “slivers” of background that were occluded in each eye need
to be inpainted and the entire scene is re-rendered. The whole process is extremely
labor-intensive. It’s a place where computer vision really can pay off. An automatic
algorithm doesn’t have to solve the whole problem completely; I don’t think it should,
since there’s a lot of artistry involved in creating a good stereo composition.
RJR: What does a stereo rig for a feature film look like?
Beier: It depends on the movie, but typically you have two cameras, such as Sony
F35’s, mounted at roughly right angles on a rig, viewing the scene through a beam-
splitter. Since the cameras and lenses are big and heavy, unfortunately the rig isn’t
very rigid; if you rotate it slightly, it puts a tremendous amount of torque on the rig
and the cameras go out of alignment. There’s no time on a busy set to recalibrate the
rig, so it has to be fixed in post-production to bring each frame pair back into a perfect
stereo alignment. There’s a very expensive device on set doing real-time analysis of
the videos that will alert you if it sees substantial vertical or rotational disparity, but
it doesn’t really fix the problem.
The critical parameter for stereo filming is called the interocular distance, the
equivalent physical separation between the cameras, which is often really small, like
three centimeters (it would be more accurate to call this the interaxial distance). This
is related to the distance between the viewer’s eyes by similar triangles. That is, the
ratio between the camera interocular distance and the physical distance to a subject
should be the same as the ratio between the viewer’s interocular distance and the
movie screen. It’s important to get this parameter right, because you can’t change it
in postproduction without messing up the parallax. The only thing you can do after
the fact is apply a corner-pin, or projective transformation, to rotate the camera’s
view. This is important so that when the viewer focuses on far-away points in the
5.9. I n d u s t r y P e r s p e c t i v e s 199
background, they don’t go walleyed (that is, the gazes of each eye diverge), which is
very uncomfortable.
When your eyes see something that’s wrong in stereo, it sort of “buzzes,” an
uncomfortable feeling. I used it to my advantage in one shot in Transformers: Dark
of the Moon, and I think it’s an artistic tool people can use. There’s a shot where the
camera’s dollying in front of the headlights of a car and the lens flare’s very different
in the two eyes. The result is that it sort of hurts to look right into the headlights,
which makes sense!
Beier: In our terminology, objects that appear to lie on the plane of the movie screen
have zero disparity. We say that objects “beyond” the screen have positive disparities,
and objects “in front of” the screen have negative disparities. For a 2K image, we allow
a horizontal disparity range from about negative forty pixels to about positive twenty
pixels. The average person can resolve disparity differences of 0.2 pixels or so; that’s
the minimum stereo difference you can recognize. So our total stereo budget is about
60×5 = 300 depth levels.
Sometimes you can have an object with ridiculous negative disparity. That’s okay
if it’s going too fast for you to try to focus on it. For example, if it’s just particles going
past your head, you can sort of feel them going past your head, but you don’t try to
focus on them and it doesn’t become a problem. If it was a big ball going past your
head, that would hurt, because you’d try to focus on it and track it as it goes by.
For effects shots with big robots and spaceships, almost everything in the scene
has positive disparity. This makes sense; if an object’s in front of the screen, first, it
hurts to look at, and second, it necessarily looks smaller because it’s in the room with
us, so it can’t be bigger than that. If the object is beyond the screen, it can be as big
as you want it to be. In a shot in space, the stars are your interocular distance apart,
and appear to be at infinity.
RJR: How does the interocular distance change over the course of a shot?
Beier: There have always been people on the set called focus pullers. While the
director of photography or cameraman’s looking through the camera, the focus puller
is a second person who either has a hand on the camera or a remote control to set the
focus. Focus pulling is a big job and good focus pullers are well compensated. These
days, there’s another person called the convergence puller who has a similar remote
control and can adjust the convergence or interocular during the shot.
It turns out that this is very important in scenes where the camera is moving. For
example, in one shot in Transformers: Dark of the Moon we’re on the moon while
robots are emerging from its surface. We start out a long way away from everything,
and the interocular here is probably half a meter, since these robots are about ten
feet tall. Once you we get close to them, the interocular has gone down to maybe one
centimeter. If we didn’t do that, then these robots would be so in your face that they’d
be hugely separated in negative disparity space out in front of the camera. Then, we
start to widen the interocular again to give some depth to the scene, otherwise it
would look totally flat. If you don’t change the interocular throughout a scene like
200 Chapter 5. Dense Correspondence and Its Applications
this, then either it always looks flat or it hurts in some places, and you don’t get a
sense of stereo throughout the shot.
RJR: Do you ever need to touch up the disparity field after it’s been acquired by the
stereo rig?
Beier: We only do that to fix problems, like if the rigs were badly misaligned. We
would never take a live-action element and move it further back. In a few cases, we’ll
move a CG element to line up with an actor’s eyeline so he seems to be looking at
it. It would be nice to be able to move stereo elements around after the fact — for
example, to bring people in the background forward a bit so they don’t all seem to
be the same depth — but that’s really hard to do since it requires stereo rotoscoping
and inpainting. That’s one advantage of 3D conversion.
We generally assumed in this chapter that the dense correspondence field between
a pair of images reflects an underlying physical reality — that is, that each pair of
corresponding points arises from some single point in the scene. However, dense
correspondence doesn’t have to be physically meaningful to be useful. For example,
in some view synthesis applications, all that really matters is whether the synthesized
image is plausible, not whether it’s physically consistent with the underlying scene.
This is especially true in applications like video coding, where we often just want a
good prediction of what an intermediate image will look like.
We focused on dense correspondence in a generic sense, meaning that we assumed
no knowledge about the contents of the images. When we know the images come from
a certain class, then it would be advisable to use class-specific detection algorithms
to obtain better correspondence. For example, if we know the images are close-
up views of faces (e.g., for a morphing application), we could apply a customized
active appearance model (e.g., [180]) to immediately obtain a meaningful dense
correspondence map between them, matching eyes, noses, mouths, and so on.
Outside of visual effects, the medical imaging community is extremely interested
in algorithms for deformable image registration, which can be viewed as a type of
optical flow problem. Generally, the goal is to warp one image to the coordinate sys-
tem of a second, for example, to compare disease progression in images of of the
same patient over time, or to compare similar images of different patients to create
an “atlas.” The algorithm proposed by Joshi and Miller [228] in Section 5.2.3 is one
example of this application. Holden [201] gives a review of deformable image registra-
tion techniques for medical image analysis. Medical image registration is often posed
using the framework of fluid flow; that is, the image pixels are treated as a viscous fluid
that deforms subject to the rules of continuum mechanics. The deformation field is
usually obtained by solving a partial differential equation, and the process may be
very slow to converge (see, e.g., [94]). A popular method that resembles an iterative,
multiscale optical flow algorithm was proposed by Thirion [488]. However, like the
methods in Section 5.2, methods for medical image registration are not designed to
handle occlusions or discontinuities in the flow fields, and could generally benefit
5.10. N o t e s a n d E x t e n s i o n s 201
from the modern advances in optical flow and stereo correspondence discussed in
the rest of the chapter.
Recent research in both optical flow and stereo correspondence has substantially
benefited from high-quality ground-truthed data sets and benchmarks hosted at
Middlebury College. Reports by Baker et al. [27] (on optical flow) and Scharstein
and Szeliski [425] (on stereo) detail the data generation and testing methodol-
ogy that is now used as a worldwide benchmark for comparing new flow and
stereo algorithms. The high-quality datasets and constantly-updated benchmarks
for hundreds of algorithms are available at http://vision.middlebury.edu/flow/ and
http://vision.middlebury.edu/stereo/.
Barron et al. [31] proposed an earlier benchmark for optical flow, now superceded
by the Middlebury database but responsible for driving much earlier research in
the field. Sun et al. [475] investigated the effects of several simple refinements to the
classical Horn-Schunck algorithm (such as the choice of robust penalty function, the
type of image interpolation, and the use of a median filter in the warping step) to arrive
at a set of recommendations and best practices that result in simple, competitive
optical flow algorithms. Brown et al. [70] gave a good survey of advances in stereo
algorithms, although many of the competitive algorithms discussed in Section 5.5
were proposed after this publication. While we didn’t mention it in Section 5.3, the
graph-cut stereo methods in Section 5.5.2 can be easily extended to estimate discrete
optical flow. In this case, the label at each pixel is multi-valued (i.e., a flow vector
(u, v) instead of a disparity d).
Szeliski et al. [485] surveyed methods to efficiently minimize cost functions of
the form of Equation (5.50), including the α-expansion and loopy belief propagation
methods discussed in Section 5.5, as well as a promising variant of belief propagation
called tree-reweighted message passing [525]. They also provided efficient reference
implementations for the various algorithms at http://vision.middlebury.edu/MRF/.
Felzenszwalb and Zabih [139] recently gave a good survey of dynamic programming
and graph-based techniques, with common applications to computer vision.
So far, our discussion of dense correspondence has been motivated almost entirely
as a 2D-to-2D matching problem. However, the remainder of the book focuses on 3D
considerations, and we will see that disparity is often interpreted as an inverse depth
map. That is, objects that are closer to the pair of cameras have high disparities and
far-away objects have low disparities.22 Recovering the actual depths from a disparity
map requires further physical knowledge about the camera setup, which we discuss in
the next chapter. A high-quality optical flow field or disparity map can greatly improve
the results of other visual effects algorithms, such as matting, and is at the heart of
algorithms for converting monocular films to stereo in post-production. Conversely,
an independent estimate of depth (e.g., sparse measurements from a low-resolution
range sensor) can definitely improve the results of a dense correspondence algorithm
(see Yang et al. [564]).
22 As a simple experiment, hold a finger close to your face and alternate winking your eyes. The
relative position of your finger changes substantially compared the the position of a fixed point in
the background. If your finger is centered in front of your nose, you will also observe the double
nail illusion — that is, a violation of the monotonicity constraint.
202 Chapter 5. Dense Correspondence and Its Applications
5.11 Suppose that we assume the optical flow field (u, v) is constant within a
window centered at a pixel. Show that summing the Horn-Schunck Euler-
Lagrange equations (5.22) over this window is equivalent to solving the
Lucas-Kanade equations (5.25) using a box filter corresponding to the
window.
5.12 Show that replacing the anisotropic diffusion tensor D in Equation (5.29)
with the identity matrix gives the original Horn-Schunck smoothness term.
5.13 Interpret the anisotropic diffusion tensor D(g ) in Equation (5.30) when (a)
the magnitude of g is 0 and (b) the magnitude of g is much larger than β.
5.14 Show that the Lorentzian robust cost function in Table 5.1 is redescending.
Is the generalized Charbonnier cost function redescending for any positive
value of β?
5.15 Explain why there are seven degrees of freedom in the nine entries of the
fundamental matrix.
5.16 Show why the fundamental matrix constraint in Equation (5.34) leads to the
row of the linear system given by Equation (5.40).
5.17 Suppose the fundamental matrix F relates any pair of correspondences (x, y)
in I1 and (x , y ) in I2 . Determine the fundamental matrix F̂ that relates the
correspondences in the two images after they have been transformed by
similarity transformations T and T , respectively.
5.18 The basis of the RANSAC method [142] for estimating a projective transfor-
mation or fundamental matrix in the presence of outliers is to repeatedly
sample sets of points that are minimally sufficient to estimate the parame-
ters, until we are tolerably certain that at least one sample set contains no
outliers.
a) Let the (independent) probability of any point being an outlier be ε, and
suppose we want to determine the number of trials N such that, with
probability greater than P, at least one random sampling of k points
contains no outliers. Show that
log(1 − P)
N= (5.66)
log(1 − (1 − ε)k )
I1 I2
Figure 5.32. The square appears to move five pixels to the right between the left and right
images.
5.22 Which generally has a larger disparity: a small object close to the camera, or
a large object far from the camera?
5.23 Consider the rows of pixels A, B, and C in Figure 5.33.
a) Compute the Birchfield-Tomasi measure between rows A and B at the
highlighted pixels, assuming a 1 × 1 window.
b) Compute the Birchfield-Tomasi measure between rows A and C at the
highlighted pixels, assuming a 1 × 1 window.
c) Interpret the results. Why do they show that the Birchfield-Tomasi
measure is insensitive to differences in sampling of up to half a pixel?
A B C
Figure 5.33. Three rows of pixels.
5.26 Explain the effects of increasing K in the Potts model, and β in the intensity-
adaptive Potts model.
5.27 Provide a counterexample to show that the truncated quadratic function is
not a metric (in particular, that it does not satisfy the triangle inequality).
5.28 The grid graph structure common to many computer vision problems
(including stereo) is bipartite. That is, we can partition the vertices V into
disjoint sets V1 and V2 such that V = V1 ∪ V2 and each edge in E connects a
vertex in V1 to a vertex in V2 .
a) Determine the sets V1 and V2 when E is the usual set of all 4-neighbors.
b) Show how the belief propagation algorithm in this case can be sped up
by a factor of two, since only half the messages need to be computed in
each iteration (see [138]).
5.29 Given a segmented region of roughly constant-intensity pixels, determine
the linear least-squares problem to estimate the disparity plane parameters
ai , bi , ci in Equation (5.56).
5.30 Compute the parallax (in the sense of Equation (5.59)) for the pair of
correspondences {(−2, 5), (−9, 7)} and {(−1, 4), (−7, 6)}.
5.31 Explicitly derive the position of the striped dot in Figure 5.20. That is, show
how the quadratic’s parameters are obtained using a linear-least-squares
problem and determine its minimizer.
5.32 Explicitly determine the affine transformation between image planes
induced by a single line segment correspondence {(p, q), (p , q )} in field
morphing.
5.33 Show that field morphing produces a different dense correspondence field
when the order of the input images is switched. That is, create a simple coun-
terexample using two pairs of control lines in which the forward mapping
is not the inverse of the backward mapping.
5.34 Describe how to modify the cross-dissolve equation (5.62) so that the top
half of an image morphs more quickly to its destination than the bottom
half.
5.35 A simple way to show that morphing is not physically consistent is to con-
sider an image of a planar object; thus the dense correspondence is defined
by a projective transformation H . Show that the weighted average of corre-
spondences in the first and second images is not a projective transformation
of the first image plane, and thus that intermediate images are not physically
consistent.
5.36 Determine a two-camera configuration in which view morphing cannot be
applied (i.e., when does the rectification process fail?)
6 Matchmoving
1 Scientists who study geodesy, or the measurement of different aspects of the earth.
207
208 Chapter 6. Matchmoving
The first step in matchmoving is the detection and tracking of features throughout
the image sequence to be processed. Remember that features are regions of an image
6.1. F e a t u r e T r a c k i n g f o r M a t c h m o v i n g 209
that can be reliably located in other images of the same environment, and that feature
matches ideally result from the image projections of the same 3D scene point.
We can apply any of the detection and matching methods from Chapter 4. If
the images are generated by video frames close together in time and space (which
is usually the case), we typically select single-scale Harris corners (since a square of
pixels is likely to remain a square in an image taken a fraction of a second later) and use
a fast matching algorithm like the KLT tracker described in Section 4.1.1.2 and [442].
If the images are taken further apart (which often occurs in hierarchical methods for
long sequences, discussed in Section 6.5.1.2), then an invariant detector/descriptor
like SIFT will give more reliable matches (but will also be slower). On a closed set
or green-screen environment, artificial markers (e.g., gaffer-tape crosses or more
advanced markers as discussed in Section 4.5) can be added to the scene to introduce
reliably trackable features. Figure 6.1 illustrates tracked features in these various
scenarios.
High-quality matchmoving relies on the assumption that matched image fea-
tures correspond to the same 3D scene point. Therefore, estimating the fundamental
matrix (Section 5.4.2) and automatically removing matches inconsistent with the
underlying epipolar geometry is very important. Regardless of how the features are
obtained, they should generally be visually inspected and edited for correctness, since
Figure 6.1. Tracking features for matchmoving. Top row: When the images are close together,
single-scale Harris corners can be easily detected and tracked. Middle row: When the images
are further apart, the SIFT detector/descriptor can be used for wider baseline matching. Bottom
row: In a green-screen environment, gaffer-tape crosses and artificial markers can be placed on
surfaces to aid feature detection and tracking.
210 Chapter 6. Matchmoving
Figure 6.2. A false corner introduced by a coincidence of perspective between two surfaces:
one in the foreground and one in the background. From the positions of the chimney and the
corner of the background wall in the two images, we can see that the white feature points don’t
correspond to the same 3D location, even though the pixel neighborhoods are almost identical.
a bad feature track can throw off the rest of the matchmoving process. Furthermore,
it’s necessary to remove features that may appear mathematically acceptable (i.e.,
low tracking error and consistent with the epipolar geometry) but nonetheless can
cause problems for matchmoving. For example, Figure 6.2 illustrates an example of
a mathematically reasonable feature match introduced by a coincidence of perspec-
tive. A “corner” has been detected by the visual intersection of two different surfaces:
one on the foreground building and one on the background building. As the camera
moves, so does the apparent “corner,” but it does not correspond to a fixed 3D point.
These kinds of false corners need to be removed before matchmoving.
In addition, an implicit assumption about the input video in matchmoving is
that we’re estimating the camera motion with respect to a stationary background.
Even though a feature detector may find high-quality matches on foreground objects,
these should not be used for matchmoving if they undergo independent motion
compared to the larger environment. For examples, features on pedestrians and cars
moving down a street should be removed (even if they generate high-quality tracks),
leaving only features on the stationary road and buildings. Such situations can also
be detected using a robust estimate of the epipolar geometry, but some manual effort
may be required to ensure that the right group of matches is used (for example, a large
moving foreground object might generate many feature tracks and be mistaken for
the “scene”). Finally, features only visible in a few frames (short tracks) are generally
not very useful for matchmoving and can be deleted.
While the previous paragraphs addressed the removal of bad or confusing features,
we can also add new features to the image sequence in several ways. First, the feature
detector should always be generating new matches — especially as new parts of the
scene come into view at the edges of the image as the camera moves. In natural,
6.2. C a m e r a P a r a m e t e r s a n d I m a g e F o r m a t i o n 211
image plane
camera center
image plane
camera center
scene
scene
(a) (b)
Figure 6.3. The pinhole model of perspective projection. (a) Light rays pass through the cam-
era center and impinge on the image plane (a piece of film or a CCD). (b) For mathematical
convenience, we model the image plane as lying between the camera center and the scene.
212 Chapter 6. Matchmoving
from the image plane to the camera center is called the focal length, denoted by f .
From Figure 6.3a we can see that the film/CCD image is “upside down” with respect
to the scene because of the way light rays hit the image plane. In computer vision,
we usually make the mathematically convenient assumption that the image plane
actually lies between the camera center and the scene. As we can see from Figure 6.3b,
this results in the same image, but one that is already “right-side up.”
(Xc , Yc , Zc )
Y Z Y
~ ~
(x, y) y image
coordinates
Yc
x
image
y~
plane
f
Z
f
Zc
X
camera coordinates
(a) (b)
Figure 6.4. (a) The camera coordinate system and the image coordinate system. (b) Side view
of (a), showing that the projections of scene points onto the image plane can be computed by
considering similar triangles.
6.2. C a m e r a P a r a m e t e r s a n d I m a g e F o r m a t i o n 213
Also, the origin of the physical image plane is in its center, while we usually index
the pixels of an image from the upper left-hand corner. Finally, the physical sensor
elements of a CCD may not actually be square, while in digital image processing we
assume square pixels.2 For these reasons, we need to transform the physical image
coordinates in Equation (6.1) to determine the pixel coordinates in the digital images
we actually obtain, namely:
x̃ ỹ
x= + x0 y= + y0 (6.2)
dx dy
Here, dx and dy are the width and height of a pixel in the physical units used to
measure the world (e.g., meters). The quantity dx /dy is called the aspect ratio of a
pixel. The point (x0 , y0 ) is the location, in pixel units, corresponding to the ray from the
camera center that is perpendicular to the image plane, called the principal point.
The principal point is usually very near the center of the image, but it may not be
exactly centered due to camera imperfections.
The parameters in Equation (6.1) and Equation (6.2) can be neatly encapsulated
by the camera calibration matrix K :
αx 0 x0
K = 0 αy y0 (6.3)
0 0 1
where αx = f /dx and αy = f /dy represent the focal length in units of x and y pixels.
The four parameters (αx , αy , x0 , y0 ) are called the internal or intrinsic parameters of
the camera, since they define the operation of the camera independently of where it
is placed in an environment.3
We can see that the camera calibration matrix K relates the camera coordinates of
a scene point (Xc , Yc , Zc ) to the homogeneous coordinates of the corresponding pixel
(x, y) via the simple equation
x Xc
y ∼ K Yc (6.4)
1 Zc
The symbol ∼ in Equation (6.4) means that the two vectors are equivalent up to
a scalar multiple; that is, to obtain actual pixel coordinates on the left side of
Equation (6.4), we need to divide the vector on the right side of Equation (6.4)
by its third element (corresponding to the perspective projection operation in
Equation (6.1)). For example, suppose a camera described by K = diag(10, 10, 1)
2 Most digital cameras today use physically square sensor elements, but this was not always the case.
3 It is theoretically possible that the sensor elements are not physically rectangles but actually par-
allelograms, leading to a fifth internal parameter called the skew that appears in the (1,2) element
of K . We assume that the skew of all cameras considered here is exactly 0, which is realistic for
virtually all modern cameras.
214 Chapter 6. Matchmoving
x̃dist 2 2 2 2 2 x̃
= (1 + κ1 (x̃ + ỹ ) + κ2 (x̃ + ỹ ) ) (6.5)
ỹdist ỹ
where the κ’s are coefficients that control the amount of distortion.4 Then the
affine transformation in Equation (6.2) is applied to the distorted parameters. That
is, the observed (distorted) pixel coordinates (xdist , ydist ) are related to the correct
(a) (b)
Figure 6.5. An example of lens distortion. (a) An ideal image. (b) Barrel distortion observed using
a camera with a wide-angle lens. Note how the straight edges of the checkerboard, monitor, desk,
and light are bowed outward as a function of distance from the image center.
4 We can add coefficients on higher powers of (x̃ 2 + ỹ 2 ) for a more accurate model, but one or two
terms are often sufficient for a high-quality camera. This formulation also assumes the center of
distortion is the principal point, which is usually sufficient.
6.2. C a m e r a P a r a m e t e r s a n d I m a g e F o r m a t i o n 215
This suggests a simple method for estimating the lens distortion coefficients if
we know the other internal parameters. We obtain a set of ideal points (x, y) and
corresponding observed points (xdist , ydist ); each one generates two equations in the
two unknown κ’s:
(x − x0 )(x̃ 2 + ỹ 2 ) (x − x0 )(x̃ 2 + ỹ 2 )2 κ1 xdist − x
= (6.7)
(y − y0 )(x̃ 2 + ỹ 2 ) (y − y0 )(x̃ 2 + ỹ 2 )2 κ2 ydist − y
When we have a large number of ideal points and corresponding distorted obser-
vations, the simultaneous equations given by Equation (6.7) can be solved as a linear
least-squares problem. The lens distortion parameters for the image in Figure 6.5
were estimated to be κ1 = −0.39, κ2 = 0.18.
In Section 6.3.2 we’ll discuss methods for estimating the other internal parameters
for a camera. In the remainder of the chapter, we assume that the images have already
been compensated for any lens distortion, so that the image formation process is well
modeled by Equation (6.4).
P = K [R | t] (6.9)
216 Chapter 6. Matchmoving
Here, the | notation denotes the horizontal concatenation of matrices that have the
same number of rows. The camera matrix P completely determines how a cam-
era obtains its image of a point in the world coordinate system. It operates on the
homogeneous coordinate of a 3D point in the world coordinate system by simple
multiplication:
x Xc
y ∼K Yc
1 Zc
X
= K R Y + t
(6.10)
Z
X
Y
=P
Z
1
From Equation (6.10), we can see that the camera matrix P is a homogenous quan-
tity, since any multiple of P will produce the same projection relationship. Thus, there
are at most eleven degrees of freedom in the twelve entries of P (in fact, using our
standard assumptions there are ten degrees of freedom: four for the internal param-
eters and six for the external parameters). We commonly abbreviate the relationship
in Equation (6.10) as
x ∼ PX (6.11)
where the 3 × 1 vector x and the 4 × 1 vector X are the homogeneous coordinates of
corresponding image and world points, respectively.
Therefore, matchmoving is equivalent to determining the camera matrix P corre-
sponding to each given image. The following sections discuss different circumstances
of this general estimation problem.
In this section, we discuss the estimation of the parameters of a single camera that is
fixed in position. We’ll consider two scenarios. In the first, the camera takes a single
image of an environment that contains 3D landmarks with known locations in world
coordinates. For example, these landmark points may result from an accurate survey
of a movie set acquired using a range sensor (see Section 8.1). In this case, we’re given
both the known 3D world coordinates and corresponding 2D image coordinates, and
it’s straightforward to estimate the camera matrix using a process called resectioning.
In Section 6.3.1 we describe resectioning and show how the internal and external
parameters of the camera can be recovered from the camera matrix.
The second problem we consider is the estimation of the unknown internal param-
eters of a camera from several images of a plane — for example, a checkerboard
pattern with known dimensions that is shown to the stationary camera in several
6.3. S i n g l e - C a m e r a C a l i b r a t i o n 217
6.3.1 Resectioning
Let’s assume we’re given a set of 3D points with known world coordinates
{(X1 , Y1 , Z1 ), . . . , (Xn , Yn , Zn )} and their corresponding pixel locations in an image
{(x1 , y1 ), . . . , (xn , yn )}. As mentioned earlier, these known 3D locations typically arise
from some type of external survey of the environment. The corresponding 2D loca-
tions may be hand-picked, or possibly automatically located (e.g., the points could
correspond to the centers of unique ARTag markers [140] affixed to the walls of a set).
We assume that the 3D points don’t all lie on the same plane and that six or more
correspondences are available.
The relationship in Equation (6.11) between a 3D point (Xi , Yi , Zi ), the corre-
sponding 2D pixel (xi , yi ), and the twelve elements of the camera matrix P is:
Xi
xi P11 P12 P13 P14
Yi
yi ∼ P21 P22 P23 P24 (6.12)
Zi
1 P31 P32 P33 P34
1
That is, the vectors on the left-hand side and right-hand side of Equation (6.12)
are scalar multiples of each other; thus, their cross-product is zero. This observation
leads to two linearly independent equations in the elements of P (see Problem 6.5):
Ai [P11 P12 P13 P14 P21 P22 P23 P24 P31 P32 P33 P34 ] = 0 (6.13)
where
0 0 0 0 Xi Yi Zi 1 −yi Xi −yi Yi −yi Zi −yi
Ai = (6.14)
X i Yi Z i 1 0 0 0 0 −xi Xi −xi Yi −xi Zi −xi
5 However, we slightly abuse this terminology in this chapter, using “calibration” to refer to the
process of estimating both internal and external parameters.
218 Chapter 6. Matchmoving
These can be easily extracted by considering Equation (6.9). If we denote M as the 3×3
matrix on the left side of P, we can see that M = KR; that is, an upper triangular matrix
with positive diagonal elements multiplied by a rotation matrix. From linear algebra,
we know this factorization is unique and can be easily computed, for example, using
the Gram-Schmidt process [173].6 Once we have factored M = KR, we obtain t by
multiplying K −1 by the last column of P.
Here, we’ve denoted the columns of the rotation matrix as r1 , r2 , r3 in Equation (6.16),
and denoted H in Equation (6.18) to be the 3 × 3 matrix
H ∼ K [r1 r2 t] (6.19)
6 The key linear algebra concept is the RQ decomposition, which is a little confusing in this context,
since the R stands for an upper (right) triangular matrix and the Q stands for an orthogonal (rotation)
matrix.
6.3. S i n g l e - C a m e r a C a l i b r a t i o n 219
Figure 6.6. Example images used for plane-based internal parameter estimation, with corre-
sponding features at the square corners automatically detected and matched.
In this form, we can see that the relationship between the point on the planar sur-
face, specified by its world coordinates (X , Y ), is related to the image coordinate (x, y)
through a projective transformation H . This stands to reason, since in Section 5.1
we noted that images of a planar surface are related by projective transformations.
For any given position of the camera, we can estimate the projective transforma-
tion relating the world planar surface to the image plane by extracting features in each
picture of the plane and matching them to world coordinates on the planar surface.7
The actual physical coordinates of the world points aren’t important as long as the rel-
ative distances between them are correct. For example, for a checkerboard of squares,
we can define the corners of the upper left square to be (0, 0), (0, 1), (1, 0), (1, 1),
and so forth.
At this point, we’ve estimated a projective transformation Hi for every view of the
planar calibration pattern. Let’s see how these projective transformations will help
us estimate the internal parameters. Rearranging Equation (6.19), we have:
[r1i r2i t i ] = λi K −1 Hi
(6.20)
= λi K −1 [h1i h2i h3i ]
where we’ve denoted the columns of Hi as h1i , h2i , h3i . We also introduced a scale factor
λi to account for the ∼ operation. The parameters r1i , r2i , t i are columns of the rotation
matrix and the translation vector corresponding to the i th position of the camera.
Recall that the camera calibration matrix K is fixed for all views.
We know that in a rotation matrix, each column vector is unit norm and that the
columns are orthogonal. That is:
From Equation (6.20), these constraints turn into constraints on the columns of Hi :
We see that Equation (6.22) directly relates the projective transformation for each
view to the internal parameters of the camera. If we define the special 3 × 3 symmetric
matrix
ω = (KK )−1 (6.23)
we can see that Equation (6.22) is linear in the elements of ω.8 Since we know the
form of the camera calibration matrix in terms of the internal parameters from
Equation (6.3), we can verify that
1
0 − x02
αx2 αx
1 y0
ω= 0 − (6.24)
αy2 αy2
y0 x02 y02
− x02 − + +1
αx αy2 αx2 αy2
That is, there are five unique parameters (ω11 , ω13 , ω22 , ω23 , ω33 ) in ω, and each
projective transformation Hi puts two linear constraints on them via Equation (6.22).
Thus, we need at least three images of the plane in different positions to estimate ω.9
Since these linear equations are of the form
we can again use the Direct Linear Transform to estimate the values of ω up to a scalar
multiple. Finally, we can recover the actual internal parameters by taking ratios of
elements of ω, namely:10
ω13
x0 = −
ω11
ω23
y0 = −
ω22
1
ω11 ω22 ω33 − ω22 ω13
2 − ω ω2 2 (6.26)
11 23
αy =
ω11 ω22
2
1
ω22 2
αx = αy
ω11
This algorithm was proposed by Zhang [575] and independently by Sturm and
Maybank [473], who also analyzed planar configurations where the method fails.
The solution obtained from the linear estimate can be used as the starting point for
subsequent nonlinear estimation of the camera parameters with respect to image
reprojection error (see Section 6.5.3). Once the camera calibration matrix K is deter-
mined, we can determine the parameters r1i , r2i , r3i , t i corresponding to the i th position
of the camera (see Problem 6.9). If the images of the plane suffer from lens distortion,
we can alternate the estimation of the internal parameters using Zhang’s algorithm
with the estimation of the lens distortion parameters κ using Equation (6.7).
Figure 6.7 shows an example camera calibration result using this approach with
nine images of a checkerboard (four of these images are shown in Figure 6.6). In this
example, the focal length was computed as 552 pixels (corresponding to 3.1mm for
this camera), the principal point was found to be at the center of the image, and the
pixels were found to be square (dy /dx = 1).
8 For reasons we won’t go into here, ω is also called the image of the absolute conic. See Section 6.8.
9 These positions must be non-coplanar to avoid linear dependence of the equations.
10 These equations can be viewed as the explicit solution of the Cholesky decomposition ω−1 = KK .
6.4. S t e r e o R i g C a l i b r a t i o n 221
Extrinsic parameters
(world−centered)
world
−50 Oc Zc
8
Z
−150 Xc
150
−200 Yc
500
0 300
200 100
0
0 200
100 Y
X 150 0 world
world
(a) (b)
Figure 6.7. Results of camera calibration. (a) The estimated positions of the calibration pattern
for the views in Figure 6.6, assuming the camera is in fixed position. (b) Alternately, we can
think of the plane being in fixed position and show the estimated positions and orientations of
the cameras. The units of the coordinate systems are mm.
Next, we discuss the calibration of two rigidly mounted cameras, also known as a
stereo rig. That is, the cameras maintain the same relative orientation and position
with respect to each other by being secured together in a fixed housing. Such cameras
have recently become popular for filming 3D movies, as described in Section 5.9.
The key estimation problem is the determination of the internal parameters of each
camera and the relative rotation matrix and translation vector relating the pair. Once
we know this information, we can determine the 3D location of a point from its 2D
projections in each image using triangulation.
As we discussed in Section 5.4.1, the relationship between correspondences for a
pair of cameras is entirely encapsulated by the fundamental matrix, which defines
the epipolar geometry. We can immediately see a problem based on counting degrees
of freedom: we have fourteen degrees of freedom for the cameras (four for each
camera calibration matrix and six for the relative rotation and translation), but the
fundamental matrix only has seven degrees of freedom. Even if we assume that both
cameras have exactly the same (unknown) calibration matrix K , we still have extra
degrees of freedom. Therefore, while the fundamental matrix for an image pair is
unique (up to scale), there are many (substantially) different camera configurations
that result in the same fundamental matrix. Therefore, we will inevitably face ambi-
guities in the camera matrices unless we obtain additional information about the
cameras or the environment.
222 Chapter 6. Matchmoving
Now that we’ve defined the process of image formation, we can determine the
fundamental matrix F relating the two cameras in terms of the camera parameters
K , K , R, and t. Since the two camera centers are at [0, 0, 0] and −R t, respectively,
we can compute the epipoles by directly applying P and P to these points:11
e ∼ KR t e ∼ K t (6.28)
We can show (see Problem 6.10) that the fundamental matrix for the image pair is
given by
F = [K t]× K RK −1 (6.29)
where we used the notation [·]× defined in Equation (5.39). This proves a claim we
made earlier in Equation (5.38), stating that F = [e ]× M for some rank-3 matrix M .
When calibrating a stereo rig from feature matches, we first robustly estimate
the fundamental matrix (Section 5.4.2), and then extract consistent camera matrices
from F .12 Unfortunately, this is where we run into the projective ambiguity referred
to earlier. That is, consider a projective transformation of the world coordinate sys-
tem given by a 4 × 4 non-singular matrix H . If we consider a camera matrix P and
homogeneous world point X, from Equation (6.11) we have the projection x ∼ PX.
Now consider an alternate camera matrix P̂ = PH and world point X̂ = H −1 X; since
P̂ X̂ = PHH −1 X = PX ∼ x (6.30)
we get the same projected point on the image. Thus, from image correspondences
alone, we have no way to determine whether our estimates of the camera matrices
11 We removed the negative sign from the expression for e in Equation (6.28) since ∼ accounts for
this scalar multiple.
12 Zhang [574] and Fitzgibbon [144] described methods for simultaneously estimating lens distortion
coefficients and the fundamental matrix from a set of feature matches between an image pair.
6.4. S t e r e o R i g C a l i b r a t i o n 223
are off from the truth by an arbitrary 3D projective transformation.13 Some of these
degrees of freedom are relatively harmless; for example, six of them account for an
arbitrary rigid motion of the world coordinate system (which can be removed by
fixing the coordinate system of the first camera as in Equation (6.27)). Another degree
of freedom corresponds to an unknown scale factor of the world; for example, an
image of a given object will look exactly the same as an image of an object that is
twice as large and twice as far away. This uncertainty can be resolved if we know the
physical length of some line segment in the scene that appears in one of the images
(e.g., the height of a table or wall).
However, the class of 3D projective transformations also includes generaliza-
tions of the shear and nonlinear distortions corresponding to the last two images
in Figure 5.1. These distortions can have a serious effect on the structure of the scene
implied by a pair of camera matrices, as illustrated in Figure 6.8c–d. In particular, a
general 3D projective transformation can make the underlying true structure almost
unrecognizable, since angles and ratios of lengths are no longer preserved.
Without any further information about the cameras or the environment, this is
the best we can do for determining the camera matrices from feature matches in a
single image pair. A useful general formula for two camera matrices consistent with
a given F is
where v is any 3 × 1 vector and λ is any nonzero scalar. This form shows that if we fix
P to the canonical form in Equation (6.31), 4 degrees of projective ambiguity remain.
Beardsley et al. [34] recommended choosing v so that the left-hand matrix [e ]× F +
e v is as close to a rotation matrix as possible, resulting in a “quasi-Euclidean”
reconstruction.
Figure 6.8. Projective ambiguities inherent in the calibration of a stereo rig from feature matches
alone. These four scene configurations all differ by a 3D projective transformation and hence
can all produce the same image pair. (b) is a similarity transformation (rotation, translation,
and scale) of (a), (c) is a 3D shear, and (d) is a general 3D projective transformation. With-
out further information about the cameras or the environment, there is no way to resolve the
ambiguity.
13 Again, we can apply a counting argument: eleven degrees of freedom in each camera matrix
(allowing for nonzero skew) minus seven degrees of freedom for the fundamental matrix leaves
fifteen degrees of freedom in the 3D projective transformation H .
224 Chapter 6. Matchmoving
F = [t]× R (6.33)
Only one of the four candidates is physically possible (i.e., corresponds to scene
points that are in front of both image planes in the stereo rig). This can be tested by
triangulating one of the feature matches — that is, projecting lines from the camera
centers through the corresponding image locations and finding their intersection in
three-dimensional space. In practice, the feature matches are noisy, so the two rays
will probably not actually intersect. In this case, we estimate the point of intersection
as the midpoint of the shortest line segment connecting the two rays, as illustrated in
Figure 6.9.14
When using noisy feature matches to estimate the relative rotation and translation
between a pair of calibrated cameras within a RANSAC approach, the minimal five-
point algorithm of Nistér [350] should be used instead of sequentially estimating and
factoring F .
14 This method only makes sense when we have calibrated cameras; the midpoint of the segment
has no meaning in a projective coordinate frame. In the projective case, the triangulation method
described by Hartley and Sturm [189] should be applied. This method is based on minimizing the
error between a feature match and the closest pair of conjugate epipolar lines, and involves finding
the roots of a sixth-degree polynomial. See also Section 7.2.
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 225
x x'
Figure 6.9. Once the camera matrices are estimated, the 3D point corresponding to a given
feature match can be estimated by triangulation. The point is chosen as the midpoint of the
shortest line segment connecting the two rays.
We now have the ingredients to discuss the main topic of the chapter, the problem
of estimating the varying internal and external parameters of a camera as it moves
freely through an environment. This problem of image sequence calibration is the
core of the matchmoving process required for virtually any visual effects shot that
composites 3D computer-generated elements into footage from a moving camera.
In computer vision, the process is also called structure from motion, since we’re
estimating the coordinates of 3D points corresponding to image features (i.e., struc-
ture) as the camera position is varied (i.e., motion). We assume the camera is always
translating between frames, so that any frame pair is related by a fundamental matrix.
Just as in the stereo case, image sequence calibration contains an inherent pro-
jective ambiguity analogous to Equation (6.30). Therefore, the first step is usually to
estimate a projective reconstruction of the cameras and scene points that matches
the image feature locations as well as possible. This projective reconstruction is
then “upgraded” to what is called a Euclidean or metric reconstruction that dif-
fers from the true configuration by an unknown similarity transformation. Again,
Euclidean reconstruction is not possible without some additional assumptions about
the camera calibration matrices or the structure of the environment — but in practical
scenarios these assumptions are usually easy to make. Once we have a good estimate
226 Chapter 6. Matchmoving
of a Euclidean reconstruction, all of the camera parameter and scene point estimates
are refined using a non-linear estimation step called bundle adjustment to make
the reconstruction match the image features as well as possible. Finally, we discuss
several practical issues for image sequence calibration, which becomes difficult for
very long sequences.
xij ∼ Pi Xj (6.36)
Given the image projections {χij , xij } as input, we want to determine the
unknown camera matrices {P1 , . . . , Pm } and scene points {X1 , . . . , Xn }. Since we want
to find camera matrices and scene points that reproduce the projections we observed,
a natural approach is to minimize the sum of squared distances
m
n
χij d(xij , Pi Xj )2 (6.37)
i=1 j=1
where d is the Euclidean distance between two points on the image plane (i.e., after
we convert from homogeneous to unhomogeneous coordinates).
This minimization problem is generally called bundle adjustment, for the reason
illustrated in Figure 6.10. That is, we are adjusting the “bundles” of rays emanating
from each camera to the scene points in order to bring the estimated projections onto
the image planes as close as possible to the observed feature locations. The quantity
in Equation (6.37) is also called the reprojection error.
We will discuss the numerical solution to this nonlinear problem in Section 6.5.3.
However, the first consideration is determining a good initial estimate of the unknown
variables so that the bundle adjustment process starting from this initial guess con-
verges to a reasonable answer. We discuss two methods: one based on a factorization
approach for all the cameras at once, and one built on sequentially estimating camera
matrices, exploiting the knowledge that the images come from a sequence.
15 This approach was inspired by a classic algorithm by Tomasi and Kanade [493], who showed how
factorization applied to a simpler method of image formation, orthographic projection.
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 227
cameras
observed
feature locations
scene points
Figure 6.10. Bundle adjustment for projective reconstruction. We want to adjust the “bundles”
of rays emanating from each camera to the scene points in order to bring the estimated projections
onto the image plane as close as possible to the observed feature locations. The reprojection error
corresponds to the sum of squared distances between every observation and the intersection of
the corresponding line and image plane.
Here, the λij are the unknown scalar multiples such that the ∼ in Equation (6.36) is
an equality: λij xij = Pi Xj . These are called the projective depths, and the matrix of
feature locations on the left-hand side of Equation (6.38) is called the measurement
matrix.
Since Equation (6.38) expresses the 3m × n measurement matrix as the product of
a 3m × 4 matrix containing the cameras {Pi } and a 4 × n matrix containing the scene
points {Xj }, we can see this large matrix has rank at most four. This suggests a natural
factorization algorithm based on the SVD. That is, given a guess for the projective
depths, we form the measurement matrix on the left-hand side of Equation (6.38)
(call it M ) and determine the SVD M = UDV . U is 3m × n, V is n × n, and D is a
n × n diagonal matrix of singular values, which from this reasoning should ideally
only have four nonzero elements. Therefore, we define D4 as the left-hand n × 4
matrix of D. We estimate the 3m × 4 matrix of cameras on the right-hand side of
Equation (6.38) as UD4 and the 4 × n matrix of scene points on the right-hand side of
Equation (6.38) as the first four rows of V .
As for previous algorithms, normalization of the data prior to applying the algo-
rithm is critical to get good results if the data is noisy. A simple approach is to first
normalize the feature locations in each image in the usual way (i.e., apply a similarity
transformation to each image plane so that the features have zero mean and average
√
distance from the origin of 2). Next, we rescale each row of M to have unit norm,
228 Chapter 6. Matchmoving
and then rescale each column of M to have unit norm. This process can be iterated
until the measurement matrix stops changing significantly.
To start the process, we need reasonable initial estimates of the projective depths.
One possibility is simply to initialize λij = 1 for all i and j.16 Then we apply the
factorization algorithm to obtain a candidate collection of P’s and X’s, and compute
the homogeneous reprojections x̂ij . A new estimate of λij is obtained as the third
element of x̂ij . We then iterate the factorization algorithm until the reprojection error
stops changing significantly.
A key problem with the approach is that in practice, all of the 3D points are unlikely
to be seen in all of the images. In this case, we eliminate cameras and points until
we have a “nucleus” of 3D points that are seen in all of the images from a subset
of the cameras. When the factorization algorithm has converged, we can use the
resectioning algorithm described in Section 6.3.1 to estimate new camera matrices
that see some of the 3D nucleus points based on several of their feature locations.
We also use the triangulation algorithm described in Section 6.4.2 to estimate new
3D scene point positions based on feature matches in the camera subset.17 These
processes are sketched in Figure 6.11.
Therefore, the overall projective factorization algorithm is:
(a) (b)
Figure 6.11. Interpolating projective camera matrices and scene structure using resectioning
and triangulation. (a) If the projective cameras and scene points given by white planes/circles are
known, the shaded camera matrix can be computed by resectioning, since the world coordinates
and image projections are both known. (b) If the projective cameras and scene points given by
white planes/circles are known, the shaded points can be computed by triangulation, since the
camera matrices and image projections are both known.
16 Sturm and Triggs also described a more complicated approach to initializing the projective depths
based on the estimated fundamental matrices between image pairs.
17 Since this is only a projective reconstruction, Hartley’s algorithm [189] should be used.
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 229
7. Let the the 3m × 4 matrix of cameras on the right-hand side of Equation (6.38)
be UD4 and the 4 × n matrix of points on the right-hand side of Equation (6.38)
be the first four rows of V , where D4 is the left-hand n × 4 matrix of D.
8. Compute the reprojection x̂ij for each camera and point.
9. If the average reprojection error has converged, stop. Otherwise, let λij be the
third element of x̂ij and go to Step 4.
10. Un-normalize the camera matrices and world coordinates.
11. Resection and triangulate the non-nucleus cameras and scene points.
m
n
χij λij xij − Pi Xj 2 (6.39)
i=1 j=1
is linear in the elements of P if the X’s are known and vice versa. This suggests a
natural algorithm of alternating the estimation of one set of quantities while the
other is fixed, and has the advantage that not all points must be seen in all images.
Along the same lines, Hung and Tang [210] proposed to cycle through updating the
cameras, scene points, and inverse projective depths; fixing two of the quantities and
estimating the third is a linear least-squares problem. Both algorithms were shown
to provably converge.
Clearly, minimizing Equation (6.37) or Equation (6.39) can only result in a recon-
struction of the cameras and world points up to a 3D projective transformation, by the
same argument as in Equation (6.30). That is, we can replace all the camera matrices
and world points by
for any 4 × 4 non-singular matrix H , and still obtain the same measurement matrix.
This means that at the end of projective reconstruction, we will most likely have a very
strange set of camera matrices and 3D points (e.g., resembling Figure 6.8d), which are
not immediately useful. Section 6.5.2 addresses the estimation of an H that upgrades
the projective reconstruction to a Euclidean one (e.g., resembling Figure 6.8b).
(a)
image frames
image frames
(b)
keyframes
Figure 6.12. (a) Sequential updating of cameras uses overlapping pairs of images to succes-
sively estimate projective camera matrices. (b) Hierarchical updating uses a subset of keyframes:
images chosen to give wider baselines. Intermediate cameras and scene points can be esti-
mated using resectioning and triangulation. In both cases, triples of images can be used instead,
leveraging the trifocal tensor.
images in the sequence. Beardsley et al. [34] described this process in detail, taking
into account the problem of maintaining good estimates of the 3D structure used
for resectioning as the sequence gets longer and feature matches enter and leave the
images.
Avidan and Shashua [22] and Fitzgibbon and Zisserman [145] described methods
that “thread together” triples of images to estimate the next camera matrix instead of
using pairs of images, so that all the cameras are represented in a common projective
frame. These methods are based on the trifocal tensor, a 3 × 3 × 3 matrix that relates
feature correspondences in image triples similarly to how the fundamental matrix
relates feature correspondences in image pairs. Methods based on triples of images
are often preferred since the trifocal constraint is stronger, making it easier to reject
outlier feature matches; also, each triple overlaps the previous one by two images,
adding robustness to the solution. Fitzgibbon and Zisserman also described how to
enforce a constraint if an image sequence is known to be closed — that is, the first
and last camera matrices are the same.
When successive images are very close together spatially (which is not unusual),
the decomposition in Equation (6.31) is unstable; that is, the fundamental matrix
may be poorly estimated since t is so small in Equation (6.29). In these cases, a
global projective transformation may better express the relationship between the
two views (since the motion is nearly pure rotation). Torr et al. [495] discussed this
problem of degeneracy in calibrating image sequences, and proposed methods for
“surviving” these situations when they are encountered in practice. The key idea is
to incorporate a robust model selection criterion at each frame that decides whether
the relationship between an image pair is better modeled by a fundamental matrix or
a projective transformation. The same problem occurs when the scene in an image is
primarily comprised of a single, dominant plane; Pollefeys et al. [369] extended Torr
et al.’s approach to operate on image triples in this situation.
Alternately, we can take keyframes from the sequence that are spatially far enough
apart to enable robust estimation of F , but not so far apart that feature matching
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 231
This makes sense; we’ve removed seven degrees of freedom (a similarity transfor-
mation) to fix the first camera, leaving eight degrees of freedom: the unknown five
entries of K1 and the 3 × 1 vector c, which is related to the 3D projective distortion
of the environment.18 Since K1 is nonsingular, we can define a vector v = −K1− c,
so that
K1 0
H= (6.46)
−v K1 1
Now, if we denote
Pi = [Ai | ai ] (6.47)
and recall the definition of ωi = (Ki Ki )−1 from Equation (6.23), it’s straightforward
to show (see Problem 6.18) that
∼ Pi QPi
If we denote the rows of Pi as Pi1 , Pi2 , Pi3 and expand Equation (6.48) in this special
situation, we obtain four linear equations in the five unknowns of Q, corresponding
18 It is also related to what is called the plane at infinity for the projective reconstruction.
19 Q is also known as the absolute dual quadric [499].
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 233
Therefore, we need at least three views (i.e., eight equations) to obtain a solu-
tion. If we have many views, then we have an overdetermined linear system that
can be solved using the same Direct Linear Transform approach we’ve used for pre-
vious problems. We need to make sure that the Q we obtain is rank-3 (using the
same approach we used to make sure the estimated fundamental matrix was rank-2
in Section 5.4.2). Once we obtain Q, then the relationships in Equation (6.46) and
Equation (6.48) allow us to recover the elements of H , and thus obtain all the cam-
eras in a Euclidean frame via Equation (6.41).20 Figure 6.13 illustrates a sketch of
upgrading a scene containing a wireframe house from a projective reconstruction
to a Euclidean reconstruction. That is, we are illustrating the estimated points Xj
before and after applying the 3D projective transformation H −1 . We can see that
qualitatively, the projective reconstruction is not useful while the Euclidean one is.
The algorithm just discussed was proposed by Pollefeys et al. [366] and is
widely used to solve the self-calibration problem, though other methods exist (see
Section 6.8). If we know less about the cameras’ varying internal parameters (for
example, only that the skew is zero), then we can still apply constraints based on
Equation (6.48); however, the algorithms are not typically linear. For example, we
can directly minimize a nonlinear function of the unknown 3 × 1 vector v and the
(a) (b)
Figure 6.13. Upgrading a projective reconstruction of a set of points in (a) to a Euclidean recon-
struction in (b). Even though the upgrade is obtained by analyzing the projective camera matrices,
it’s easier to visualize the effects of the upgrade by looking at the reconstructed scene points.
20 However, since we are typically solving an overconstrained problem with noisy input camera
matrices, the solution we obtain is only approximate and the resulting Ki may not exactly be in the
required form. We’ll take care of this during bundle adjustment in the next section.
234 Chapter 6. Matchmoving
m
2
Ki Ki
Pi QPi
− (6.55)
Ki Ki F Pi QPi F
i=1 F
prior to self-calibration, where w and h are the width and height of the images. They
also recommended weighting Equations (6.51)–(6.54) to reflect reasonable estimates
about the cameras’ unknown internal parameters. For example, we are usually much
more certain that ωi (1, 2) = 0 (i.e., the cameras have zero skew) than we are that the
principal points (ωi (1, 3), ωi (2, 3)) = (0, 0), so Equation (6.52) should have much more
weight in the linear system than Equations (6.53)–(6.54).
We should keep in mind that self-calibration from an image sequence fails in
certain critical configurations, enumerated by Sturm [471, 472]. Unfortunately, some
of these critical configurations are not unusual when it comes to video shot for visual
effects. For the variable focal length case described earlier, critical configurations
include a camera that translates but does not rotate, as well as a camera that moves
along an elliptical path, pointing straight ahead (e.g., a camera pointing straight out
of a car’s windshield as it takes a curve in the road).
As we discuss next, we always follow up a Euclidean reconstruction with the non-
linear process of bundle adjustment over all internal and external parameters, so
it’s not always necessary to obtain a highly accurate Euclidean reconstruction at
this stage.
m
n
χij d(xij , Pi Xj )2 (6.57)
i=1 j=1
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 235
over all of the unknown camera matrices {P1 , . . . , Pm } and 3D world points {X1 , . . . , Xn },
where d is the Euclidean distance between two points on the image plane (i.e., after
we convert from homogeneous to unhomogeneous coordinates).21
6.5.3.1 Parameterization
A critical aspect is the minimal parameterization of the camera matrices to have
exactly the number of degrees of freedom we know they should have. For example,
in the case of a zero-skew camera, we can explicitly represent each camera matrix Pi
using ten numbers as follows.
We parameterize the rotation matrix Ri in terms of a 3 × 1 vector r i using the
Rodrigues formula:22
1 − cos r i i i
Ri = cos r i I3×3 + sincr i [r i ]× + r r (6.58)
r i 2
Here, the direction of r i gives the axis about which the world coordinate system is
rotated, and the magnitude of r i gives the angle of rotation. This is also known as the
axis-angle parameterization of a rotation matrix.
Then, each Pi is minimally parameterized by ten numbers (αxi , αyi , x0i , y0i , r1i , r2i , r3i ,
t1i , t2i , t3i ), using:
Pi = Ki [Ri | ti ]
αxi 0 x0i t1i
− r
i
1 cos
= 0 αyi y0
i i i i
cos r I3×3 + sincr [r ]× +
i
r r i
t2i
r i 2
0 0 1 t3i
(6.59)
21 It’s also possible to use a distance function that incorporates the underlying uncertainty about
each feature measurement; this is called the Mahalanobis distance.
22 Here we use the sinc function, sinc x = sinx x .
236 Chapter 6. Matchmoving
where F is the function we want to minimize. Let’s expand the cost function F (θ ) in
a Taylor series approximation about some point θ t :
∂F t 1 ∂ 2F
F (θ) ≈ F (θ t ) + (θ ) (θ − θ t ) + (θ − θ t ) 2 (θ t )(θ − θ t ) (6.61)
∂θ 2 ∂θ
The minimizer θ ∗ of this function is given by setting the gradient of Equation (6.61)
to zero:
−1
∗ t ∂ 2F t ∂F t
θ =θ − (θ ) (θ ) (6.62)
∂θ 2 ∂θ
This suggests an iterative process for minimizing F : we start with a good esti-
mate of the minimizer (obtained at the end of Euclidean reconstruction), form the
quadratic approximation in Equation (6.61) around this point, and iterate
−1
∂ 2F ∂F
θ t+1 = θ t + − (θ t ) (θ t ) (6.63)
∂θ 2 ∂θ
∂f t
J (θ t ) = (θ ) (6.65)
∂θ
That is, the (j, k)th element of J is the partial derivative of the j th reprojection x̂j with
respect to the k th parameter θ k . We will return to the structure of this Jacobian matrix
in a moment, since it has critical implications for designing a fast bundle adjustment
algorithm.
The other important quantity in Equation (6.63) is the matrix of second partial
derivatives, also called the Hessian. This matrix is impractical (and generally unnec-
essary) to compute exactly, and optimization algorithms differ in how to approximate
6.5. I m a g e S e q u e n c e C a l i b r a t i o n 237
it. The Levenberg-Marquardt algorithm commonly used for bundle adjustment uses
the approximation
∂ 2F
(θ t ) ≈ J (θ t ) J (θ t ) + λt I (6.66)
∂θ 2
where λt is a tuning parameter that varies with each iteration and I is an appropri-
ately sized identity matrix. The reasoning behind this approximation is described in
Appendix A.4.
Therefore, at each Levenberg-Marquardt iteration corresponding to Equation (6.63),
we must solve a linear system of the form
where δ t is the increment we add to θ t to obtain θ t+1 . These are also known as the
normal equations for the problem.
If we treated Equation (6.67) as a generic linear system, we would waste a lot
of computation since the Jacobian matrix J has many zero elements. That is, each
reprojection x̂j only depends on one camera matrix and one scene point, so all of
∂fj
the derivatives will be zero for θ k that don’t involve the corresponding camera or
∂θ k
point. Thus, while there may be hundreds of camera parameters, thousands of scene
points, and tens of thousands of feature matches in a realistic bundle adjustment
problem, the matrix J is very sparse. The matrix J J in Equation (6.67) will also be
sparse (but less so). Figure 6.14 illustrates the structure of J and J J for a simple
problem, to illustrate the sparsity pattern.
We can exploit this sparsity pattern to more efficiently solve Equation (6.67). From
Figure 6.14b, we can see that Equation (6.67) can be written in terms of submatrices
P1 P2 P3 X1 X2 X3 X4
x11
P1 P2 P3 X1 X2 X3 X4
x12
x13 P1
x14 P2
x21
P3
x22
x23 X1
x24 X2
x31 X3
x32
X4
x33
x34
(a) (b)
Figure 6.14. Suppose we have a bundle adjustment problem in which three cameras observe
four points. (a) The structure of the Jacobian J is indicated by dark blocks for nonzero elements
and white (empty) blocks for zero elements. The rows index feature observations while the
columns index camera and scene parameters. (b) The structure of J J for the same problem.
Both matrices are even sparser when all the features aren’t seen by all the cameras (which is
typical in a real matchmoving problem, see Figure 6.16).
238 Chapter 6. Matchmoving
involving the camera parameters and the scene points. That is, it has the form
JPP JPX δP bP
= (6.68)
JPX JXX δX bX
From Figure 6.14b, we know that JPP and JXX are both block-diagonal matrices; that
is, each dark square in JPP is a 10 × 10 matrix involving second derivatives with respect
to the parameters of a single camera, while each dark square in JXX is a 3 × 3 matrix
involving second derivatives with respect to the parameters of a single scene point.
Since a bundle adjustment problem usually involves many more scene points than
cameras, we assume that JXX is larger than JPP .
Now we apply a trick based on the Schur complement of JXX ; we multiply both
sides of Equation (6.68) by
−1
I −JPX JXX
(6.69)
0 I
where I and 0 are appropriately sized identity and zero matrices, respectively. The
result is:
−1 −1
JPP − JPX JXX JPX 0 δP bP − JPX JXX bX
= (6.70)
JPX JXX δX bX
which is a relatively small, easily solved linear system for the camera update δP . Once
we have obtained δP , we plug it into the bottom half of Equation (6.70) to obtain:
JXX δX = bX − JPX δP (6.72)
which is also easily solved since JXX is block diagonal with small blocks.
This approach is the basis of the sparse Levenberg-Marquardt bundle adjustment
algorithm proposed by Lourakis and Argyros [305], which is widely used. However,
the same authors [303] also observed that a sparse implementation of Powell’s dog-
leg algorithm might be an even more efficient optimization algorithm for bundle
adjustment. Recently, Agarwal et al. [4] advocated the use of a preconditioned con-
jugate gradient algorithm for efficiently solving Equation (6.67), instead of using the
Schur complement approach. This method is especially useful for the extensions in
Section 6.6.2, where the matrix on the left-hand side of Equation (6.71) may be diffi-
cult to construct and not sparse. Steedly et al. [468] also discussed related issues for
solving Equation (6.71), as well as methods for efficient incremental bundle adjust-
ment as new information becomes available [467]. Triggs et al. [500] discuss further
practical optimization issues for bundle adjustment.
Figure 6.15. Example images from a video sequence obtained using by moving a handheld
camera around a building.
To obtain the input for matchmoving, 19,780 unique features were automatically
detected and tracked over the course of the camera motion. Since the camera is
moving only slightly between frames, the tracker uses single-scale corners, which are
automatically checked to make sure the matches are consistent with the projections
of underlying 3D scene points. Figure 6.16 illustrates the presence of a random subset
of 200 of the features in each frame; we can see that features constantly enter and leave
the camera field of view, and that no single feature lasts very long. On the average,
each feature track had a duration of 12 frames, and the maximum duration was 122
frames. An average of 616 tracked features appeared in each frame. Since Figure 6.16
corresponds to the sparsity pattern of the matrix JPX in Equation (6.68), we can see
that in practice, the matrix J J is much sparser than what Figure 6.14b suggests.
Visualizing the estimated camera path with respect to the estimated 3D points is
critical for determining whether a matchmoving solution makes sense. Figure 6.17
illustrates a camera tracking result for the video sequence, in which the cameras
are represented as red dots and the scene points are represented as blue dots. Each
camera’s principal axis is indicated by a red line. The camera track was obtained
using a combination of the projective reconstruction, metric reconstruction, and
sequential updating algorithms described in the previous sections. Even though the
reconstructed scene is only sparsely sampled (i.e., we only obtain 3D estimates of
points corresponding to tracked features), we get a strong sense of the environment
and can be confident that the camera positions are well estimated.
240 Chapter 6. Matchmoving
Feature Tracks
Frames
Figure 6.16. A random sample of 200 of the 19,780 features tracked over a 383-frame sequence.
A dark box appears in the (i, j)th position if feature i was detected in frame j. We can see that
features have various lifetimes and constantly enter and leave the field of view of the moving
camera.
(a) (b)
Figure 6.17. The camera tracking result obtained for the video sequence. (a) Top view, illus-
trating the camera track and rough geometry of the building. The ground plane was manually
inserted based on the recovered scene points. (b) Side view, including the image and viewpoint
for a selected camera.
The quality of camera tracking can also be verified by adding test objects into the
reconstructed 3D coordinate system. For example, Figure 6.18 illustrates the same
frames as Figure 6.15, with added synthetic geometric solids aligned to surfaces in the
scene.23 This is an extremely simple example of adding computer-generated imagery
to a real video sequence using matchmoving. In a visual effects company, a match-
move expert will typically fit a large number of planar surfaces to the scene to aid 3D
animators and compositors further down the pipeline.
23 Note that sometimes the objects “show through” physical surfaces in the scene in this simple
picture. The added synthetic objects are simply rendered on top of the original images using the
estimated camera perspectives. There is also no attempt to match the lighting of the scene, which
is essential for realism.
6.6. E x t e n s i o n s o f M a t c h m o v i n g 241
Figure 6.18. Test objects rendered into the scene are used to verify the quality of a matchmove
result. The new objects appear to “stick to” the scene at the correct locations. Feature points
tracked through the sequence to obtain the matchmove result are also illustrated as yellow dots.
Matchmoving is typically solved off-line (i.e., well after the data is collected) and
applied to a sequence of closely spaced images. Here, we briefly discuss extensions
in which each constraint is relaxed. First, we address real-time matchmoving, which
could be used for adding computer-generated 3D elements to live video from a mov-
ing camera. Another application is live pre-visualization of how real video elements
interact with computer-generated ones (e.g., augmented reality). The second exten-
sion is to structure from motion on large, unordered image datasets that don’t come
from a video sequence, such as Internet photo collections. We discuss new tools for
camera localization by exploiting such large collections.
assumed to lie on planar surfaces. The world coordinate system and its scale are
initialized by placing a calibration pattern with known dimensions in front of the
camera before it begins to move. The unknown depths of features in the environ-
ment are estimated with greater accuracy as the camera views them from different
positions. A smoothness prior that the camera moves with constant velocity and
angular velocity is imposed to make the camera parameter estimation robust to video
segments that contain few features. We discuss probabilistic methods for state esti-
mation in more detail in the context of motion capture in Chapter 7. The book by
Thrun et al. [489] is an excellent reference for probabilistic robot localization, though
it does not emphasize vision-based SLAM.
While these techniques are promising, a key consideration is drift — that is, the
accumulation of errors as the sequence gets longer and longer. It’s more likely that
production-quality real-time matchmoving is accomplished with a special-purpose
hardware system, as discussed further in Section 6.8.
Doug Roble, creative director of software, and Som Shankar, integration supervisor of
Digital Domain, in Venice, California discuss the role of matchmoving on a movie set.
In 1999, Doug Roble won a Scientific and Technical Achievement Academy Award for
writing Digital Domain’s in-house 3D tracking/scene reconstruction program, called
“Track.”
Shankar: To undistort real movie camera lenses on feature films, we shoot a known
square grid with each camera setup and lens, and process those grids to create the
mapping between distorted and ideal coordinates. Sometimes we don’t even use a
parametric model for the lens; we just use the nonparametric mapping obtained from
the grid.
Part of the reason is that movie lenses do weird things. Even though two of the
same lenses from the same manufacturer are supposed to have exactly the same
distortion, there are tiny anomalies between those lenses because they’re still pieces
of glass. Anamorphic lenses, which stretch a widescreen image vertically to cover
the entire recorded film frame, are even more complicated since they’re oval. Even
the lenses on new high-end digital cameras behave very interestingly; the images are
huge but the quality falls off toward the edges since you’re not really meant to see
image through those parts of the lenses.
For final movie frames, we never undistort the original images. In our 3D com-
positing pipeline, we create all of our effects over the “flat,” undistorted version of
the plate, and then at the very end use the estimated lens distortion for that shot to
re-distort our 3D elements to match the original plate.
Shankar: We’re often invited to a built set that will be filmed from different angles
for a whole sequence of shots. Our team goes to the set with a Leica total station (see
Chapter 8) — it’s the same kind of device you see surveyors using on roadsides and
construction sites. We survey a sparse set of important locations in the room, such as
corners of objects, and record their 3D coordinates with this very precise device. We
also take a lot of photographs of the set, and then using our in-house software, Track,
we line that survey up to those photographs. It’s also becoming much more common
to scan entire sets with LiDAR, which gives us a much denser sampling of 3D points
(see Section 8.1).
6.7. I n d u s t r y P e r s p e c t i v e s 245
(a) (b)
(c)
(d)
Figure 6.19. (a,b) Matchmoving is the basic tool for inserting realistic visual effects into back-
ground plates. In this example from Transformers: Dark of the Moon, spaceships fly around the
building and land on its right side. (c,d) Matchmoving is also commonly used for set extension.
In this example from A Beautiful Mind, only a small piece of the set was built on a soundstage,
and the rest of the building was digitally generated. Transformers: Dark of the Moon ©2011
Paramount Pictures. All Rights Reserved. A Beautiful Mind ©2001 Universal Studios and DW
Studios L.L.C. All Rights Reserved. Courtesy of Universal Studios Licensing LLC.
Many companies do pure photogrammetry for set reconstruction — they just take
a lot of pictures, and use structure from motion tools to determine 3D points in the
scene. You can get a lot done that way, but spatial accuracy is often a problem.
246 Chapter 6. Matchmoving
RJR: How do you use resectioning and structure from motion to get the camera track
and 3D positions of 2D features in the images?
Roble: An artist starts with the on-set survey and the plates for a given shot. They
then manually connect up the 3D points that were surveyed with the corresponding
2D points on the image. We initialize the resectioning problem with the Direct Linear
Transform, which gets us really close. Then we formulate a nonlinear cost function
based on the weighted reprojection error in the 2D image plane and use gradient
descent to find the optimal camera parameters. Using weights on each 2D-3D corre-
spondence gives the artist a little bit more control about where they want the error to
end up. That’s the essence of Track.
When we don’t have an accurate 3D survey, it’s basically an iterative structure
from motion problem. The artist chooses corresponding points in the images. We
can initially estimate the fundamental matrix for each pair of images and then back
the camera matrices out. Then we go back to that gradient descent algorithm where
we’re solving for the rotation and translation parameters of the camera and the depths
of the feature points at the same time.
As with all structure from motion algorithms, it works a lot better if you know
the focal length beforehand. Then you’re using the essential matrix instead of the
fundamental matrix and you don’t have that projective ambiguity. Without a good
knowledge of the focal length, the solve can sometimes drift a little bit and the artist
may need to put in more constraints, for example that two 3D lines are known to
come together at a right angle.
You can never really tell what the camera is going to be doing. Sometimes the
director will say, “This is a nodal move — just camera rotation — so you don’t have
to worry about the camera track.” Then we’ll look at it and find out that actually,
it’s not quite nodal, the camera’s moving just a little bit, and that needs to be solved
for in order to get CGI elements to stick to the scene. We’ve found that off-the-shelf
software can have difficulty in some of these almost nodal situations, but they aren’t
a problem for Track as long as we have enough precision in our 3D set survey and 2D
point locations.
Another important issue is that, in a feature film, the cameras are often moving
like crazy, and points that you’re tracking will move in and out of the frame. Since our
artists aren’t selecting and tracking a ton of points in each frame, when a point leaves
the frame, you’re basically releasing a constraint, and the camera track will do a wiggle
because that constraint is gone. In those cases, the artist will often “hallucinate” the
position of the 2D feature point after it moves out of the frame, and lower the weight
on that point gradually to prevent that wiggle.
Most smaller visual effects houses use commercial software such as boujou to do
camera tracking. Michael Capton, 3D supervisor at LOOK Effects in Los Angeles, CA,
discusses matchmoving in this context.
Capton: Matchmoving has always been a visual effects problem, anytime you have
CG objects. In the early days they shot a lot more plates that were just static, so you
didn’t have to worry about camera tracking as long as you got the perspective right.
When I first started in the industry, they didn’t have these automatic camera track-
ing programs. You had to do it by hand, starting with the background plate, and just
manually put the camera in by eye, trying to match the motion of different features.
What now takes a simple program an hour to do would take weeks.
I remember an old commercial I worked on to advertise an SUV. The idea was
to show the SUV in place of a lifeboat on a cruise ship, to get across the idea that
it was so safe. They shot the Queen Mary with a fairly long lens, using a passing
shot to make it look like it was traveling through the ocean. That was a really long
shot, like 900 frames. It took three or four weeks to do a solid matchmove to put
that SUV into the plate and make it not move or shake. I was lucky enough to find
a book about the ship that had actual drawings of the rigging that held the lifeboats
in place, which helped a lot in figuring out the real shape, size, and dimensions of
things.
It was very tedious; you’d position the camera in space in the first frame, rough
in the shot and go twenty frames more, and then try to position the camera in space
again, and do that over the entire shot. Then you’d go back and fill in the moments in
between and the moments in between until the track was good enough. Depending
on how bad the camera move is, you might be doing it frame by frame through the
whole thing. Often you’d get to a point doing it by eye and then notice that it wasn’t
working — you were slowly falling off to one side or the camera was doing something
it shouldn’t have, and you had to throw that out and start again. It was terrible!
Nowadays, you’d be able to run that kind of shot through a piece of commercial
software and it would give you, if not an entirely solid track, something good enough
that you could tweak a little bit by hand to get the final result. On the other hand,
that early tedious experience is beneficial to me, since when the commercial software
doesn’t work or I need to improve its result, I have that background of being trained
to do it by eye. Some other people have only ever known commercial software, and
if it doesn’t track something they’re like, “Well, I can’t do anything.” No, you can try
and do it by eye!
RJR: Can you describe how an artist interacts with a matchmoving program to track a
difficult shot?
Capton: These days, it’s great that there are a variety of commercial software packages
for camera tracking. There’s boujou, PFTrack, SynthEyes, and several more. I’ve found
that if one program doesn’t work on a shot, sometimes I’ll take it into another program
and for whatever reason that one is able to track it. Another common trick on a difficult
shot is reversing it in time and feeding it back into the program — tracking the shot
backward. Over the years, you get a sense of the best way to approach a shot. For
example, if the shot starts out with a whip pan and ends on something more stable,
you know your program may fail since it won’t know what feature points to pick until
it gets to the very end, so you reverse the shot to start off with solid points. You get
accustomed to what the program likes and dislikes and learn to work with it.
248 Chapter 6. Matchmoving
It’s definitely a million times better than doing it by hand, but as far as the per-
centage of shots that track well right out of the gate, without any user interaction, it’s
surprisingly inconsistent. Some shots I think oh, this’ll be easy, like a day of my time
with an hour of tracking, and it turns out I’ll have to tweak the software for a day and
a half just to track the shot. And some shots, I think, wow, this is going to be really
hard and I send it through the autotracker and it’s like click, I’m done. A lot of it has
to do with the amount of parallax going on in the shot, how deep the distance is —
sometimes if the shot is really shallow the software gets confused.
RJR: Once you get the camera track, are the 3D point locations that were simultaneously
estimated useful?
Capton: Sometimes. We worked on the visual effects for the last season of Lost.
There were a couple of shots in the final episode where a cliffside was supposed to be
crumbling and falling away. Based on the camera track we had a good idea of the 3D
locations of a sparse set of points on the cliffside. Those enabled us to build a simple
3D model, so we could create small rocks that appeared to bounce off the nooks and
crannies of that cliff all the way down. We probably could have done that by eye, but
having a sparse point cloud to base it on definitely helped, instead of having to guess
how far away the cliff was from the camera. When the autotracker gives you a decent
point cloud, it’s almost like a poor man’s LiDAR scan.
As another example, say we tracked a room and need to create an artificial semi-
reflective object in the middle of it. We could use the 3D points from the autotracker
to build really simple geometry to mimic the room, project the plate back onto the
geometry, and use that to create reflections onto the object. It’s kind of a cheat to cre-
ate a sense of the environment when we don’t have any other reference information
about the scene.
to the sequence to introduce parallax and add stability to the solution. The match-
mover also must incorporate on-set measurements whenever they are available, such
as the height of the camera or 3D locations of surveyed points, and be able to assess
what information should be collected while a shot is being acquired to simplify the
camera tracking solution. The book by Dobbert [122] is an excellent reference on the
practical aspects of matchmoving for the visual effects industry.
There are now several software packages for production-quality camera match-
moving based on the algorithms discussed in this chapter. These include boujou (sold
by Vicon), PFTrack (sold by The Pixel Farm), Matchmover (sold by Autodesk), Syn-
thEyes (sold by Andersson Technologies), and the freeware packages Voodoo (created
at the University of Hannover) and Bundler (created at the University of Washington).
It is possible to relax the assumption that the scene observed by the cameras is
rigid (i.e., static) in all views. For example, Bregler et al. [66] extended a factorization
approach to allow the observed scene points to be an unknown linear combination
of unknown basis shapes. The camera pose, basis shapes, and linear coefficients are
obtained by successively factoring a measurement matrix similar to Equation (6.38),
assuming the camera is orthographic. Torresani et al. [496] proposed a probabilistic
approach to the same problem, assuming that shapes are drawn from a Gaussian
probability distribution with unknown parameters. These methods are only appro-
priate when a single deformable object (e.g., a face) dominates the scene. We will
address similar issues in more detail in the next chapter.
We mentioned the concepts of image-based video stabilization and re-
cinematography in Section 22. If we apply matchmoving to an image sequence, we
can then smooth the camera path in 3D to remove translational and rotational jit-
ter [296], re-rendering the sequence to make it more pleasing. Camera localization
techniques can also help automate rephotography, the attempt to exactly duplicate
the vantage point of a historical photo in modern day [24], as well as algorithms to
automatically infer the order in which historical photos were taken [428].
While we focused exclusively on image-based methods for camera tracking in
this chapter, the same problem can also be solved with high precision by several
additional means. Welch and Foxlin [542] give a good survey of many different real-
time camera tracking systems. These systems can be based on mechanical sensing
(e.g., using potentiometers or shaft encoders), inertial sensing (e.g., gyroscopes and
accelerometers), acoustic sensing (e.g., ultrasonic ranging), magnetic sensing (com-
mon in head-mounted displays), and optical sensing (e.g., active lighting using visible
or infrared LEDs). We will discuss several of these technologies in the context of
motion capture in the next chapter.
6.1 Find the values of dx and dy in Equation (6.2) for a consumer digital camera.
6.2 Consider a 4096 × 2160 digital image taken using a camera with principal
point at the center (i.e., x0 = 2048, y0 = 1080) and pixels that are physically
6.7µm square. Suppose the lens distortion parameters for the camera are
κ1 = 10−3 , κ2 = 0. What is the observed (distorted) position of a pixel whose
ideal projection is at (x, y) = (3000, 200)?
6.9. H o m e w o r k P r o b l e m s 251
These correspond to two cameras whose centers are on the world plane
Z = 0 and whose image planes are parallel to this plane (rectified images
are a special case of this situation).
a) If X = [X , Y , Z , 1] is the homogeneous coordinate of an arbitrary scene
point and x and x are the homogeneous coordinates of its projections
in the resulting images, show that for any fixed value of s ∈ [0, 1],
(1 − s)x + sx ∼ Ps X (6.75)
6.7 Determine the 2 × 5 linear system for the elements of ω implied by one
planar projective transformation (i.e., determine Ai in Equation (6.25) as a
function of the elements of Hi ).
6.8 Show how Equation (6.25) can be simplified (that is, the estimation can be
taken over fewer parameters) if either:
a) the aspect ratio αy /αx is known, or
b) the principal point is known.
252 Chapter 6. Matchmoving
6.9 Show how the rotation and translation parameters r1i , r2i , r3i , t i in
Equation (6.20) corresponding to the camera position for each view of the
stationary plane can be obtained once the camera is calibrated (i.e., K and
Hi are known). What is a possible problem with this technique when dealing
with noisy data?
6.10 In this problem, we’ll derive the form of the fundamental matrix given in
Equation (6.29). Remember that the equation of the epipolar line in the
second image for a fixed (x, y) in the first image is given by Equation (5.34):
x x
y F y = 0 (6.77)
1 1
F = [K t]× K RK −1 (6.79)
b) Show that the image projections are unchanged by this similarity trans-
formation — that is, that x ∼ PX ∼ P̂ X̂. (Note: this is just a special case
of Equation (6.30)).
6.12 Assuming a calibrated stereo rig in the canonical form of Equation (6.32),
determine the solution to the triangulation problem for a correspondence
between the points (x, y) in the first image and (x , y ) in the second
image implied by Figure 6.9. Hint: the 3D location of the midpoint can
be computed as the solution to a simple linear least-squares problem.
6.13 Determine P and P in the canonical form of Equation (6.31) if the
fundamental matrix for an image pair is given by
1.7699 × 10−6 1.0889 × 10−5 −1.6599 × 10−2
F = −3.3788 × 10−6 7.1503 × 10−10 8.0432 × 10−4 (6.81)
1.4372 × 10 −2 −2.9790 × 10 −3 1
and continue onward through the sequence. What’s the problem with this
approach? (Again, this is an issue that can be addressed with the trifocal
tensor; see, e.g., [145].)
6.17 Verify that combining Equations (6.42)–(6.44) results in the form of H given
by Equation (6.45).
6.18 Verify the steps of Equation (6.48) that relate ωi to Q. Hint: first show that
(Ai −ai v )K1 = Ki Ri . Then note that ωi−1 = (Ki Ri )(Ki Ri ) since Ri is a rotation
matrix.
6.19 Explicitly determine the first row of the 4 × 5 linear system for the elements
of Q in terms of the elements of Pi corresponding to Equation (6.51).
6.20 Show that the matrix Q in Equation (6.48) is related to the projective-to-
Euclidean upgrade matrix H by
254 Chapter 6. Matchmoving
I3×3 03×1
Q=H H
0
3×1 0
Ri ri = ri (6.83)
2 cos(ri ) = trace(R) − 1 (6.84)
2 sinc(ri )ri = (R32 − R23 , R13 − R31 , R21 − R12 ) (6.85)
1 The end-to-end process from suiting up a performer to animating a character is sometimes called
performance capture. The term motion capture generally refers to the specific technology of 3D
marker acquisition, independent of its subsequent use.
255
256 Chapter 7. Motion Capture
Figure 7.1. The motion capture problem. (a) A performer wearing a suit of markers is imaged by
several calibrated cameras. (b) The markers are detected and triangulated to determine their 3D
positions. (c) The markers are fit to a skeleton that can help to drive (d) a 3D animated character.
Today, there are two primary types of production-quality motion capture technol-
ogy. The first approach is magnetic: the performer wears a suit instrumented with
small receivers that can accurately determine their three-dimensional position and
orientation with respect to an external electromagnetic field. The second approach is
optical: in this case, the performer’s suit is fitted with special markers whose three-
dimensional position is inferred by an array of surrounding cameras. While magnetic
motion capture systems are relatively inexpensive and each receiver is always “visi-
ble,” they are sensitive to metal in their environment (commonly found in soundstage
walls and computer monitors), which can degrade the tracking output. Earlier mag-
netic systems also required wires and cables snaking around the performer and a
physical attachment to a computer, which can impede natural motion.2
In this chapter, we focus exclusively on optical motion capture, exemplified by the
industry-standard systems produced by Vicon Motion Systems. The performance
to be recorded takes place in a large room containing a defined capture volume,
a space several meters in each dimension. The capture volume is surrounded by
between six and fifty cameras, each of which is circled by a strobe light of infrared
LEDs, as illustrated in Figure 7.2. These LEDs strongly illuminate the capture volume
with light not visible to the human eye. The cameras are all temporally synchronized
with each other and with the strobe lights.
(a) (b)
Figure 7.2. (a) A typical motion capture volume. (b) A motion capture camera surrounded by
infrared LEDs.
2 Other possibilities for motion capture include inertial systems based on gyroscopes and accelerom-
eters, or exoskeletons that directly measure joint angles. Neither are commonly used to collect data
for animation and visual effects.
258 Chapter 7. Motion Capture
(a) (b)
Figure 7.3. (a) A calibration device used to precisely spatially calibrate cameras for motion cap-
ture. (b) An image of the calibration device (circled) from one of the infrared cameras, showing
the highly retro-reflective markers. The other bright spots in the image are the infrared strobe
lights from different cameras in the capture volume.
The camera system is precisely spatially calibrated before each capture session
using a special device, as illustrated in Figure 7.3a. This device is usually a rigid
wand with several markers at measured intervals along its length. As the wand is
moved through the capture volume and observed by the surrounding cameras, it
generates feature matches across all the images. As illustrated in Figure 7.3b, finding
and tracking the markers on the device is an easy image processing problem since
under infrared light they appear as bright dots on a dark background. Therefore, in
each camera at each point in time, we can uniquely identify several image feature
points (which can be disambiguated due to the uneven spacing of the markers).
Collecting all the device observations from each camera together gives a set of image
feature matches {(xij , yij )} where i ranges over the cameras and j ranges over the
unknown 3D device marker locations. This collection of feature matches provides the
input for a multicamera calibration problem that can be solved exactly as described in
Section 6.5. The known physical distance between the markers is used to recover the
scale ambiguity resulting from Euclidean reconstruction. The reconstruction errors
obtained with this controlled procedure are very low — less than a millimeter.
The performer wears a tightly fitting body suit with spherical markers carefully
attached near joints of his or her body. The markers range from five to thirty mm in
diameter, and are retro-reflective, meaning that they strongly reflect light rays along
the vector pointing directly back at the light source.3 Therefore, the markers are easy
to distinguish in each camera’s infrared image, since they appear to be extremely
bright, as illustrated in Figure 7.4. The retro-reflectivity is also important since each
camera has little time to gather light from the scene due to the very high frame rate
of motion capture systems (e.g., 120 Hz) required to capture fast motion.4
3 This phenomenon is similar to a bicycle reflector or a cat’s eye, and is typified by 3M’s Scotchlite
material.
4 An alternate optical approach gaining in popularity is the use of active-lighting markers, such
as small red LEDs that encode a unique identifier by blinking at high speeds, as in the systems
produced by PhaseSpace, Inc.
7.1. T h e M o t i o n C a p t u r e E n v i r o n m e n t 259
(a) (b)
Figure 7.4. (a) A visible-spectrum image of markers on a performer. (b) An aligned infrared
image, showing that the retro-reflective markers appear as extremely bright dots on a dark
background.
Figure 7.5. An example of marker placement for motion capture. This configuration uses forty-
five markers.
The performer’s motion capture suit is typically outfitted with between thirty and
fifty markers. Motion capture technicians use principles from biomechanics to care-
fully and repeatably place markers on the performer’s joints to give the most useful
information about the motion of his or her underlying skeleton. For this reason,
the motion capture suit must be tight-fitting, and the markers placed so that they
don’t slide around the joints. In some cases, markers are directly attached to the
skin. Figure 7.5 illustrates a standard configuration for marker placement, designed
260 Chapter 7. Motion Capture
so that groups of markers work together to define the position and rotation of joints.
For example, the four markers on the front and back waist form a quadrilateral that
defines the motion of the pelvis. We’ll discuss the relationship between the mark-
ers and the underlying skeleton in more detail in Section 7.4. It’s also common to
add additional redundant markers to help the capture system cope with occlusions,
or to easily distinguish one performer from another. Menache [323] gives a detailed
discussion of marker placement and its biomechanical motivation.
The first problem is to determine the three-dimensional locations of the markers from
their projections in the cameras’ images. This may seem difficult since all the markers
in a typical motion capture setup look exactly the same. However, since the camera
array has been precisely calibrated, we can compute the epipolar geometry between
any pair of cameras (see Section 6.4.1), as well as higher-order image relationships
like the trifocal tensor. This means that correct correspondences are generally easy to
obtain, since there are only tens of markers visible in each image and it’s unlikely that
incorrect matches will be consistent with the epipolar geometry between multiple
pairs of images.
Therefore, the problem of 3D marker estimation is one of triangulation, as dis-
cussed in Section 6.4.2. More precisely, let’s assume that a marker is observed at image
coordinates (xi , yi ) in the image from camera i, and that we have M total images of
the marker. M is usually less than the total number of cameras in the system due to
the substantial self-occlusion of the human body (for example, the sternum marker
will not be visible to a camera looking at the back of a performer). Of course, M must
be at least two to obtain a triangulation; we deal with the case of missing markers
later in this section.
A good initial guess for the marker location is the point in 3D that minimizes the
sum of squared distances to each of the M rays from the camera centers through the
observed image coordinates, as illustrated in Figure 7.6.
The ray for camera i can be expressed as Ci + λVi , where Vi is the 3D unit vector
pointing from the camera center Ci to the 3D point given by (xi , yi ) on the image
plane, and Ci is computed as Ci = Ci − (Ci Vi )Vi .5 Then the point X that minimizes
the distance
M
min X − (Ci + λi Vi )2 (7.1)
λi
i=1
is given by
−1
1 1
M M
X = I3×3 − Vi Vi Ci (7.2)
M M
i=1 i=1
This solution can be refined by using it as an initial point for the nonlinear min-
imization of the sum of squared reprojection errors of X onto the image planes, as
described by Andersson and Betsis [14].
5 Ci is not the camera center, but a different point on the ray so that Ci Vi = 0.
7.2. M a r k e r A c q u i s i t i o n a n d C l e a n u p 261
C2 C3
C1 C4
Figure 7.6. Triangulation for motion capture. A good initial guess is the 3D point that minimizes
the sum of squared distances to the M rays from each camera.
If the number of cameras that see a certain marker is too small (e.g., due to body
self-occlusions) or the images of the markers are of low quality (e.g., due to very fast
motion resulting in marker blur), then some of the markers’ 3D locations may be
noisy, or in the worst case, entirely missing. Raw motion capture data often must
be semiautomatically processed after acquisition to ensure that each marker has a
complete 3D trajectory. The most straightforward approach is to treat the triangu-
lated positions of a particular marker j as samples of a three-dimensional time series,
Xj (t). We can then apply all the tools of one-dimensional signal processing to the X ,
Y and Z samples. For example, a B-spline can be fit through the 3D sample loca-
tions and used to estimate a missing marker, as illustrated in Figure 7.7a.6 Fitting
smooth curves to partial marker trajectories and extrapolating can also help deter-
mine whether broken trajectories caused by a long string of missing markers should
be merged, as illustrated in Figure 7.7b.
Alternately, Liu and McMillan [295] proposed to leverage a large set of training
data and approach filling in missing markers as a learning problem. First, a database
of K 3D marker sets is collected; each set is represented as a complete 3N × 1 vector
corresponding to the N observed 3D marker locations {Xj = (Xj , Yj , Zj ), j = 1, . . . , N }.
Principal component analysis (PCA) is applied to this collection of vectors to build
a linear model
X = Uβ + µ (7.3)
6 Any other form of scattered data interpolation can be applied, as described in Section 5.2.
262 Chapter 7. Motion Capture
(a) (b)
Figure 7.7. (a) Interpolation of known marker positions (white dots) can help estimate missing
marker positions (gray dots). (b) Extrapolation of partial trajectories can help determine when
broken trajectories should be merged.
Solving the top half of Equation (7.4) for β using the known information gives
and plugging this back into the bottom half of Equation (7.4) gives the unknown
marker locations as
Since a global linear model like Equation (7.3) will probably do a poor job of
representing the underlying nonlinear relationships in motion capture data, we can
replace Equation (7.6) with a weighted combination of linear models, each learned
locally on a small segment of simple behavior. Since missing markers are likely only
weakly correlated with faraway known markers (e.g., an ankle marker will not help
much to estimate a missing wrist marker), better results may be obtained by learning
a PCA model only over the missing marker and the markers nearby on the body.
However, this approach works best when the markers are spaced very densely on the
body (e.g., as described in the experiment by Park and Hodgins [360]). Lou and Chai
[302] proposed to incorporate time lags of the vectors into the PCA, as well as a robust
estimator to reduce the influence of outliers.
If a marker trajectory contains high-frequency jitter, it can be smoothed, but we
must be careful not to remove the nuanced motions that motivate motion capture
in the first place; these details are inherently high-frequency. Sharp motions like
punches or kicks also contain important high-frequency components whose removal
would make the captured motion useless. We should also be aware that a too-low-
dimensional PCA basis may abstract away subtle performance details that project
onto the higher-order modes.
7.3. Forward Kinematics and Pose Parameterization 263
While a motion capture system returns the 3D trajectories of each marker, this infor-
mation is rarely directly useful to an animator. Instead, we prefer to estimate the
pose of a character, as described by the joint angles of an articulated skeleton of the
human body, as pictured in Figure 7.8. The skeleton is made of rigid elements (i.e.,
bones) connected by joints. Each joint is classified as spherical, meaning that it has
three degrees of rotational freedom (such as the ball-and-socket joints of the hip and
shoulder), or revolute, meaning that it has one degree of rotational freedom (such as
the hinge joints of the knee and elbow).7 Together, the skeleton and classification of
its joints form a kinematic model of the body. A kinematic model is generally much
Figure 7.8. A kinematic model of the human body. Spherical joints are indicated as gray circles
and revolute joints are indicated as white circles. The root of the kinematic model is shown as
the large gray circle. End effectors are shown as black circles.
7 These are in fact highly simplified approximations to the underlying biomechanics of the skeleton.
For example, the human shoulder complex is actually composed of three joints, some of which slide
as well as rotate, and the wrist and ankle have more degrees of freedom than a simple hinge.
264 Chapter 7. Motion Capture
Xe Xw
more useful to animators, since the skeleton and joints can be mapped onto an ani-
mated character. Joints with zero degrees of freedom at the end of each kinematic
chain (i.e., the head, hands, and balls of the feet) are called end effectors, a term from
robotics.
We also need to specify the absolute orientation and position in world coordinates
of the root of the body, usually chosen as the center of gravity near the pelvis (the large
gray circle in Figure 7.8). In total, a kinematic model of the body typically has between
thirty and fifty degrees of freedom (i.e., independent parameters), depending on
the level of detail. The bone lengths of the skeleton may also be treated as degrees
of freedom to be estimated, though they are often estimated once prior to motion
capture and treated as constant throughout the session.
We use forward kinematics to determine the 3D coordinates of a point on the
skeleton given the joint angles of the kinematic model; we can think of this as a
change of coordinates. For example, consider the simple model of the human arm
in Figure 7.9, which has a spherical joint at the shoulder and a revolute joint at the
elbow, specified by rotation matrices Rs and Re respectively. Suppose the length of
the upper arm is given by lu and the length of the forearm is given by lf . We assume
that the shoulder is fixed in place at the world origin and has a coordinate system
aligned with the world coordinate system.
We compute the 3D position of the wrist joint Xw given values of Rs , Re , lu , and
lf by composing two transformations: one to determine the elbow’s position based
on the shoulder location/rotation and upper arm length, and one to determine the
wrist’s position based on the elbow location/rotation and forearm length:
lf lu
Xw = Rs Re 0 + 0 (7.7)
0 0
That is, if all the angles are 0, the arm points straight along the world x-axis and the
wrist is located at (lu + lf , 0, 0). If the elbow angle is 0, the arm points straight along
the axis specified by the first column of Rs .
Forward kinematics for the full kinematic model of the body are similar; the world
coordinates of a point on the skeleton can be determined by following the kinematic
chain from the root along the bones to the given point. We simply apply the general
formula for a kinematic chain,
RK1 OK1 R21 O21 RKK −1 OKK −1
= • ··· • (7.8)
0 1 0 1 0 1
7.3. Forward Kinematics and Pose Parameterization 265
j
where Ri is the rotation matrix specifying the orientation of the j th coordinate frame
j
with respect to the i th coordinate frame, and Oi gives the coordinates of the j th joint
in the i th coordinate frame.
A key issue for working with motion capture data is the parameterization of the
rotation matrix at each joint. Using three Euler angles (i.e., rotations about the x, y,
and z axes) is a poor choice since they are difficult to naturally interpolate and suffer
from “gimbal lock,” a singularity (loss of a degree of freedom) that results when one
of the angles is near a critical value. Instead, pose is typically parameterized using
quaternions or twists.
A quaternion represents a rotation with a unit vector in R4 , and is closely related
to the axis-angle parameterization discussed in Section 22. In particular, the unit
quaternion given by
θ θ θ θ
q = cos , v1 sin , v2 sin , v3 sin (7.9)
2 2 2 2
w = ρ ×v (7.10)
Y
Z
Coordinate system
after rigid motion
Y
Z
Figure 7.10. Any rigid motion can be expressed as
a rotation around some axis followed by a transla-
tion along the same axis.
X
Z
screw axis
Y X
Origin of world
coordinate system
The notation exp in Equations (7.11)–(7.12) denotes the exponential map of a matrix,
which can be computed by a matrix Taylor series (see Problem 7.6).8 In this case,
the required exponential map in Equation (7.12) is given by the Rodrigues formula
in Equation (6.58) and is simply the matrix rotation corresponding to the axis-angle
parameters. Twists were introduced to the animation community by Bregler et al. [67],
and are commonly used in robotics applications [342].
Conversion between quaternions, twists, axis-angle representations, and rotation
matrices is straightforward (see [526], and Problems 6.21, 7.3, and 7.5). Therefore, we
can assume that a kinematic model for the human body is generally represented using
six degrees of freedom for the body’s root, and some number of “angles” (parame-
terized by either quaternions or twists) to represent the joints. In total, the number
of degrees of freedom in typical parameterizations of the human kinematic model
ranges from thirty to fifty.
In the next section, we address the critical problem of the change of coordinates
from the Euclidean domain of 3D marker locations to the hierarchy of joints of a
kinematic model. This coordinate transformation is more difficult than the forward
transformations given earlier; instead we must solve an inverse kinematics problem.
In motion capture, we’re faced with the problem of determining the underlying
parameters of a kinematic model from observations of points on (or near) the
skeleton. Going forward, we’ll denote the kinematic model parameters by a vector θ,
and the observed skeleton points by a vector r. Let’s compactly denote the forward
kinematic relationship in Equation (7.8) by
r = f (θ ) (7.13)
We’d like to determine θ from a set of measured values of r; that is, to invert
Equation (7.13). Therefore, such problems are termed inverse kinematics. Unfor-
tunately, this inversion is problematic for several reasons. First, the relationship in
Equation (7.13) is highly nonlinear, involving products of trigonometric functions of
the parameters. Second, in some applications, the relationship in Equation (7.13) is
many to one; that is, there are more values of θ than of r. In such cases, the inverse
kinematics problem is underdetermined and has many feasible solutions.9
In this section, we describe several basic methods for solving the inverse kine-
matics problem. Algorithms for inverse kinematics were developed in the robotics
community many years before their application to motion capture and animation;
for example, see Chiaverini et al. [92]. The techniques based on dynamical systems
that we discuss in Section 7.7.1 can also be viewed as methods for inverse kinematics.
There’s an offset between the markers and the skeleton that we can’t ignore (see the
left side of Figure 7.8). While some markers can be placed on a performer’s suit/skin
fairly close to a joint, at places where the motion capture technician can easily feel
a bone, other markers are further from the underlying kinematic joint (e.g., those
on the shoulders and spine). This relationship can still be taken into account by
Equation (7.13); however, we need a model for the relationship between the markers’
position on the surface of the skin and the underlying bones. We discuss this issue
more in Section 7.4.4.
dr(t) ∂f (θ) dθ
= (t) (t)
dt ∂θ dt
(7.15)
dθ
= J (t) (t)
dt
∂f (θ )
J (t) = (t) (7.16)
∂θ
9 For example, if the wrist and shoulder positions are fixed, there is still one degree of freedom for
the elbow’s position — it can rotate in a circle.
268 Chapter 7. Motion Capture
10 Recall that we also defined a Jacobian in the context of matchmoving (Section 6.5.3.2), although
the Jacobian in that case had N P.
7.4. I n v e r s e K i n e m a t i c s 269
was interactively dragged. However, it’s important to note that if λ is nonzero, the
estimated joint angles will not correspond exactly with the observed marker locations;
that is, Equation (7.14) will not be satisfied exactly.
N
Callpos (θ(t)) = fj (θ (t)) − Xj (t)2 (7.25)
j=1
where fj (θ(t)) is the forward kinematics model that determines the position of marker
j given the joint parameters.
We might also want to specify vector constraints — for example, that the vector
between two observed markers should be parallel to a certain limb in the kinematic
model. A corresponding cost function would look like
Xj (t) − Xi (t)
Cvec (θ (t)) = 1 − · fij (θ(t)) (7.26)
Xj (t) − Xi (t)
where fij (θ (t)) is a unit vector specifying the direction of the limb relating markers i
and j.
A nonlinear cost function can be formed as the weighted sum of such terms
and minimized using an algorithm such as the Broyden-Fletcher-Goldfarb-Shanno
(BFGS) method [351]. This was one of the earliest inverse kinematics approaches for
motion capture [51].
In this framework, we can also impose hard constraints on θ(t) — for example,
to enforce that a joint not exceed the limits of feasible human motion. The expected
ranges of motion for performers of different ages and genders can be determined
from biomechanical studies (e.g., [465]). If we directly parameterize the vector θ
using angles, such limits can be expressed as linear constraints of the form
The BFGS algorithm can handle such linear constraints, but if we used the quater-
nion or twist parameterizations, the constraints would be nonlinear, taking the more
270 Chapter 7. Motion Capture
general form
gi (θ(t)) ≤ bi (7.28)
Problems with nonlinear constraints are generally much harder to solve than those
with linear constraints, especially at interactive rates. An alternative is to include the
constraints as weighted penalty terms in an unconstrained cost function, e.g., by
adding cost function terms of the form
As with Equation (7.23), the result of minimizing a sum of weighted cost func-
tion terms is not guaranteed to strictly satisfy any of the constraints given in each
term. That is, the marker positions will generally not agree exactly with the forward
kinematics model, and the desired range of motion constraints may not be exactly
satisfied. Tolani et al. [491] proposed to mitigate this problem using a hybrid analyti-
cal/numerical approach that was able to solve for most of the degrees of freedom of
a simplified kinematic chain in closed form.
Figure 7.11 illustrates an example result of optimization-based inverse kinematics,
rendering raw motion capture markers alongside the estimated skeletal pose from the
same perspective.
In addition to simple, uncoupled limits on the range of motion of each joint that
are independent at each point in time, we can incorporate dependencies on joint
limits, as discussed by Herda et al. [195]. For example, the range of motion of the
knee joint varies depending on the position and orientation of the hip joint. Such
dependencies can be characterized by analyzing motion capture training data to
(a) (b)
Figure 7.11. An example inverse kinematics result. (a) The original motion capture markers,
which form the constraints for an optimization-based inverse kinematics problem. Lines are
drawn between markers on the same body part to give a sense of the pose. (b) The estimated
skeletal pose from the same perspective. We can see that the markers are sometimes quite far
from the skeleton.
7.4. I n v e r s e K i n e m a t i c s 271
determine feasible combinations of joint angles. We can also incorporate either hard
or soft dynamical constraints on the velocities and accelerations of joints, based on
biomechanical analysis of how quickly humans can move.
Finally, we can take an alternate approach of fitting the kinematic model to motion
capture data using a physics-based cost function (e.g., see [584]). That is, we assume
that the kinematic model is subjected to forces (e.g., from springs that relate the
markers to the skeleton, and from friction with the ground) and its pose corresponds
to achieving the most “comfortable” position.
11 The PCA should only be applied to the relative joint angles, and not the position and orientation
of the root.
272 Chapter 7. Motion Capture
(a) (b)
Figure 7.12. (a) Attaching tapered cylinders to the skeleton of a kinematic model results in a
solid model of the body. (b) The solid model in a given pose can be used to predict which markers
will be occluded from a given viewpoint. A marker will not be visible if either its normal points
away from the camera (e.g., point A) or a body solid lies between it and the camera (e.g., point B).
from a boxer is used to fit the online behavior of a gymnast). Finally, if our goal is to
capture the subtle mannerisms of a skilled performer, a model-based approach may
too strongly bias the result toward the library of precaptured motion, smoothing out
the subtlety that makes the individual performance unique.
12 This can be viewed as a crude skinning of the skeleton, that is, a mapping from points on a
skeleton to the surface of a 3D character model [29]. We mention more sophisticated body models
in Section 7.7 and Chapter 8.
7.5. M o t i o n E d i t i n g 273
online motion capture, the identities of missing markers can thus be predicted by the
body model and filled in with reasonable guesses from inverse kinematics until the
markers are reacquired.
A unique consideration for processing motion capture data is the preservation of
foot contact with the ground, or footplants. Filling in occluded foot markers using
the methods described so far can produce a perceptually distracting phenomenon
called footskate, in which the feet of the resulting kinematic model do not appear
to be firmly planted on the ground (or worse, appear to penetrate or hover above
the ground in a physically impossible way). Footskate can even result from using
inverse kinematics to fit complete motion capture data, since the kinematic model
is a simplified version of how the human body actually works. Kovar et al. [254]
proposed an algorithm for removing footskate artifacts by allowing small changes
to the leg bone lengths in the skeleton. Footplant locations are semiautomatically
identified, and an analytic inverse kinematics algorithm is applied to determine the
skeletal model most similar to the original data that still satisfies the constraints. This
can be viewed as a type of motion editing, which we discuss in the next section.
The goal of motion capture for visual effects is usually to precisely record a per-
former’s action. However, it’s often necessary to modify the recorded motion in a
way that preserves the personality of the performance but achieves a space-time goal
for animation. We call these motion editing problems. For example, we may need
to stitch together multiple motions from the same performer captured at different
times, such as stringing together separately recorded fighting moves. This is a prob-
lem of motion blending or motion interpolation. We may instead need to extend or
alter the path of a performer’s walk, since the motion capture volume may not match
the environment an animated character must traverse. This is a problem of motion
path editing.
In this section, we assume that the raw motion capture data has been transformed
into a time-varying vector of joint angles by means of an inverse kinematics algo-
rithm, as described in the previous section. That is, a given motion capture clip is
represented as {θ (t), t = 1, . . . , T }, where T is the number of frames in the clip.
At the simplest level, we can treat each of the time-varying parameters θ i (t) as
a one-dimensional signal, and apply any one-dimensional signal processing tech-
nique to it, such as filtering. For example, Witkin and Popović [550] discussed motion
warping using functions of the form
Another simple application is to change the frame rate of the motion by fitting splines
through the samples of the joint angles and resampling.
Bruderlin and Williams [75] applied multiresolution filtering to motion capture
signals using a one-dimensional Laplacian pyramid (see Section 3.1.2). This allows
the modification of frequency bands for individual joints to alter the corresponding
motion. For example, amplifying the middle and high frequencies exaggerates the
recorded motion, making it seem more cartoonish.
274 Chapter 7. Motion Capture
walk run
Figure 7.13. Motion interpolation for stitching sequences together. Dots indicate foot contact
with the ground. In this example, a walking motion is blended into a running motion to produce
a jog over the transition.
7.5. M o t i o n E d i t i n g 275
T'
frames in sequence 2
1
1 2 T
frames in sequence 1
Figure 7.14. Dynamic time warping for estimating correspondence between two motion
sequences.
contacts the ground, to ensure that the footplants of the two motions are matched, or
more weight to some joints over others (e.g., the shoulder’s orientation may be more
important than the wrist’s). We could also use a surface-to-surface distance function
applied to skinned skeletons after the roots have been aligned [252].
Once we have a temporal correspondence between frames over the interval to be
blended, we can interpolate between the two sequences, using a weighted average
between corresponding frames.13 We require a weight function wi that monotonically
increases from 0 at i = 1 to 1 at i = L, so that the motion transitions smoothly from
the first motion to the second over the interval.
If we assume that the non-root elements of θ and θ are parameterized using unit
quaternions, then the appropriate interpolation between unit quaternions q and q
is given by spherical linear interpolation or slerp, defined by
where q · q = cos φ; that is, φ is the angle between the two quaternions on the 3D unit
sphere [444].
We also need to know how to place and orient the root at each transition frame.
First, we align the second motion with the first as desired (e.g., so that the two motion
paths are roughly aligned) by applying the same rigid transformation to the root posi-
tion and orientation of all the frames of the second motion. Then for each transition
frame, the root orientations can be determined given the weight wi using spherical
linear interpolation in the way shown earlier. The root position can be interpolated
along a predefined path (e.g., a straight line, a curved line given by the user [361], or
13 This process of warping and cross-dissolving between motions is analogous to the morphing
problem we discussed in Section 5.7.
276 Chapter 7. Motion Capture
Motion 1
w=0
w = 0.25
w = 0.5
Motion 2
w = 0.75
w=1
Figure 7.15. Interpolating the root position and orientation for a transition between two
motions, for different values of the weight w. In this case, the intermediate positions are cre-
ated with linear interpolation and the intermediate orientations are created with spherical linear
interpolation.
a curved line given by linearly interpolating between the starting and ending velocity
vectors [396]). Figure 7.15 illustrates the idea.
The interpolated motion created by this process is aligned to the time indices of
the first motion; we can speed up or slow down the motion in the transition region
as desired. A natural choice is to remap the interpolated motion to have duration
1
2 (T + T ).
Figure 7.16 illustrates an example of a motion blend using this technique, transi-
tioning from a normal walking motion to a sneaking motion. The result is perceptibly
worse without dynamic time warping. For example, in the third frame of Figure 7.16c,
both feet are well off the ground, and in the fourth frame the figure is leaning for-
ward on both toes. The figure seems to stutter and glide across the ground, and never
makes a satisfactory transition to the crouching posture.
Kovar and Gleicher [252] extended this approach by enforcing that the dynamic
time warping path doesn’t take too many consecutive horizontal or vertical steps
(i.e., mapping the same frame from one sequence onto many frames of the other).
They then fit a smooth, strictly increasing spline to the dynamic time warping path,
which they called a registration curve. They also generalized the approach to allow
the blending of more than two motion sequences. Instead of applying dynamic time
warping, Rose et al. [395] modeled the correspondence as a piecewise-linear path
defined by annotated keyframe correspondences (e.g., matching points in a walk
cycle).
The monotonic function defining the weights wi could be a simple linear transi-
tion, or a function with non-constant slope, e.g., arising from radial basis functions
[395], B-splines [266], or the desire for differentiability [254]. In general, when we want
to blend between more than two motions, any of the scattered data interpolation
techniques from Section 5.2 can be applied.
7.5. M o t i o n E d i t i n g 277
(a)
(b)
(c)
(d)
Figure 7.16. Motion interpolation between a walking motion and a sneaking motion. (a) The
original walking motion. (b) The original sneaking motion. (c) Interpolation without dynamic
time warping. (d) Interpolation with dynamic time warping.
Sequence 1
Sequence 2
Sequence 3
Figure 7.17. An example motion graph for three sequences. The vertices (large dots) are time
indices into each subsequence. The edges (directed arrows) indicate viable transitions within
or across subsequences. The horizontal edges already exist within the original motion capture
data, while the remaining edges must be synthesized by motion interpolation. The thicker edges
indicate an example walk on the graph that generates a natural motion.
frames in sequence 2
frames in sequence 1
Figure 7.18. Minima of the pose distance function c(θ(t), θ (t )) can be used to identify vertices
and edges of the motion graph (white dots).
As illustrated in Figure 7.17, the resulting motion graph can create cycles within
the same sequence (for example, between similar footplants in several steps of a
walking sequence) as well as create transitions across different sequences. Kovar
et al. [254] also described heuristics to prune vertices from the motion graph that are
not very well connected such as dead ends (e.g., the last vertex in the second row of
Figure 7.17).
After estimating the motion graph for a motion capture database, we can estimate
the transitions along edges that best satisfy a higher-level constraint. For example, we
may want to create a long sequence in which a character driven by motion capture
data travels along a given path the user has traced on the ground. This amounts to
estimating a walk along the graph defined by a sequence of edges W = (e1 , e2 , . . . , eL )
that minimizes — or at least has a small value of — a goodness-of-fit function F (W )
defined by the user constraints. Kovar et al. [254] described an efficient branch-and-
bound technique for finding a graph walk satisfying user-specified path constraints.
7.6. F a c i a l M o t i o n C a p t u r e 279
Similar approaches to creating and applying motion graphs were described by Lee
et al. [265] and Arikan and Forsyth [17].
We can also impose constraints based on motion type, for example, forcing the
character to run through a given region by only allowing samples from running
motion capture sequences. For this purpose, it may be useful to automatically cluster
and annotate a large database of motion capture sequences with descriptions of the
performance (e.g., see [18, 253]). Finally, in addition to space-time constraints, we
can impose a dynamics-based model for F (W ), such as evaluating the total power
consumption of a character’s muscles [371] or the realism of the recovery from a
sharp impact [583].
Marker-based motion capture is primarily used to record the full body of a performer.
However, it can also be used to focus on a performer’s face, for later use in driving the
expressions of an animated character. The technology and methods for marker acqui-
sition are basically the same as for full-body motion capture, except that the cameras
are closer to the subject and the markers are smaller (i.e., 2–5 mm in diameter). Self-
occlusions and marker loss are also less problematic since the facial performance is
generally captured head-on by a smaller set of inward-facing cameras. Figure 7.19
illustrates a typical facial motion capture setup.
Facial markers aren’t usually related to an underlying skeletal model as in full-
body motion capture. Instead, facial markers are commonly related to a taxonomy
of expressions called the Facial Action Coding System (FACS), developed by Ekman
(a) (b)
Figure 7.19. A sample facial motion capture setup. (a) The camera configuration. (b) The marker
configuration.
280 Chapter 7. Motion Capture
et al. [129]. FACS decomposes an expression into “action units” related to the activity
of facial muscles, which an animator can control to create a character’s expression.
Sifakis et al. [448] related facial markers to a highly detailed anatomical model
of the head that included bones, muscle, and soft tissue, using a nonlinear opti-
mization similar to the methods in Section 7.4.2. Alternately, the facial markers can
be directly related to the vertices of a dense 3D mesh of the head’s surface (e.g.,
[54]), acquired using laser scanning or structured light (both discussed in detail in
Chapter 8).
One of the earliest facial motion capture tests was described by Williams [547],
who taped dots of retro-reflective Scotchlite material to a performer’s face and used
the dots’ 2D positions to animate a 3D head model obtained using a laser scanner.
Guenter et al.’s seminal work [183] described a homemade motion capture framework
using 182 fluorescent dots glued to a performer’s face that were imaged under ultra-
violet illumination. The triangulated 3D dot positions were used to move the vertices
of a 3D head mesh obtained using a laser scanner. Lin and Ouhyoung [284] described
a unique approach that uses a single video of a scene containing the performer and a
pair of mirrors, effectively giving three views of the markers from different perspec-
tives. In several recent films (e.g., TRON: Legacy, Avatar, and Rise of the Planet of the
Apes), actors performed on set wearing facial markers whose motion was recorded by
a rigid rig of head-mounted cameras, in essence carrying miniature motion-capture
studios along with them (see Section 7.8).
On the other hand, marker-based technology is only part of the process of facial
capture for visual effects today. In particular, the non-marker-based MOVA Con-
tour system is extremely popular and is used to construct highly detailed facial
meshes and animation rigs for actors prior to on-set motion capture. With this
system, phosphorescent makeup is applied to the performer’s entire face. Under
normal lighting, the makeup is invisible, but under fluorescent lighting, the makeup
glows green and has a mottled texture that generates dense, evenly spaced visual
features in the resulting images. The performer is filmed from the front by many
cameras, and dense, accurate 3D geometry is computed using multi-view stereo tech-
niques, discussed in Section 8.3. This technology was notably used in The Curious
Case of Benjamin Button. In related approaches, Furukawa and Ponce [159] painted
a subject’s face with a visible mottled pattern, and Bickel et al. [44] augmented
facial markers with visible paint around a performer’s forehead and eyes to track
wrinkles.
Facial capture techniques that require no markers or makeup are also a major
research focus in the computer vision and graphics communities. Bradley et al. [63]
described a system in which the performer’s head is surrounded by seven pairs of
high-resolution stereo cameras zoomed in to use pores, blemishes, and hair follicles
as trackable features. The performer is lit by a bright array of LED lights to provide
uniform illumination. The 3D stereo reconstructions (i.e., stereo correspondence
followed by triangulation) are merged to create a texture-mapped mesh, and opti-
cal flow is used to propagate dense correspondence of the face images throughout
each camera’s video sequence. This can be viewed as a multi-view stereo algorithm,
discussed in detail in Section 8.3. Another major approach is the projection of struc-
tured light patterns onto a performer’s face, which introduces artificial texture used
for multi-view stereo correspondence. This approach is typified by the work of Zhang
7.7. M a r k e r l e s s M o t i o n C a p t u r e 281
et al. [570] and Ma et al. [309]. We’ll discuss structured light approaches in detail in
Section 8.2.
If the goal is simply to record the general shape and pose of the performer’s face at
each instant, then a lower-resolution approach such as fitting an active appearance
model to single-camera video [316] is more appropriate than full motion capture.
individual nuances of joint motion (especially around wrists and feet, or in cases
of rapid motion) that are essential for an animator [168].15 The techniques in this
section lead into the more general algorithms for 3D data acquisition discussed in
Chapter 8.
These are usually simplified using the Markov property and the assumption that
the current observation only depends on the current state to
We therefore take a Bayesian approach, searching for the maximum (or multiple
modes) of a posterior probability distribution
p(θ (t) | r(1), . . . , r(t)) ∝ p(r(t) | θ (t)) p(θ (t) | r(1), . . . , r(t − 1))
!
∝ p(r(t) | θ(t)) p(θ (t) | θ(t − 1)) p(θ(t − 1) | (7.36)
Therefore, we can recursively update the posterior density based on its previous
estimate and our models for the state transition and observation likelihoods. Mark-
erless motion capture approaches differ in how the observation r(t) is extracted from
the current image and related to the state, how the various probability densities are
represented, and how the posterior is used to obtain the current state estimate.
15 To be fair, many algorithms in this section aren’t designed for highly accurate motion capture but
for robust human detection, pose estimation, and tracking in video sequences, where the results
are sufficient.
7.7. M a r k e r l e s s M o t i o n C a p t u r e 283
When the probability densities in Equation (7.35) are modeled using Gaussian
distributions, the computation of the posterior reduces to the Kalman filter, a well-
known signal processing algorithm [165]. However, in the motion capture problem,
both densities are poorly modeled by Gaussians (in particular, they are multimodal)
and a more appropriate approach is particle filtering [212]. In particle filtering, the
posterior density is represented as a set of samples {sk } of the distribution, each with
a probability {πk }. This allows us to easily extract a single estimate of the current
state (either by selecting the sample with the highest probability or by computing a
weighted average of the samples based on their probabilities) or to retain multiple
hypotheses about the current state (given by the top modes of the sample set).
However, since the state space for human pose is very large (that is, the vector
θ(t) is usually at least thirty-dimensional), a standard particle filter would require
an intractable number of samples to accurately represent the posterior density.
Deutscher and Reid [119] proposed a modified particle filter for pose estimation
that borrows ideas from simulated annealing and genetic algorithms to successively
refine the estimate of the posterior with a viable number of samples. An alternate
approach proposed by Sminchisescu and Triggs [457] focuses the samples in regions
with high uncertainty.
Another way to deal with the large state space is to reduce its dimensionality.
For example, a specific action such as walking has fewer degrees of freedom than
a generic pose, which can be revealed by analyzing a training dataset using prin-
cipal component analysis [447] or a more sophisticated latent variable model (see
Section 7.4.3).
Modeling the state transition likelihood p(θ (t) | θ(t − 1)) in Equation (7.36) is
similar to the methods discussed in Section 7.4.3. For example, we can use single-
frame and dynamical constraints based on biomechanical training data, in addition
to incorporating character- or activity-specific learned models. In the rest of this
section, we briefly overview typical features for markerless motion capture, which
are used to form the observation likelihood p(r(t) | θ(t)) in Equation (7.36).
(a) (b)
difficult to individually segment, and the right knee has a bulge of fabric far from the
actual joint. Image edges can help separate an image of a human into body parts for
a bottom-up segmentation, but edges can also be confounding (e.g., a striped shirt).
We also face kinematic ambiguities. For example, in Figure 7.20 the left arm is
foreshortened and the position of the left elbow joint is unclear; the right wrist and
hand are completely obscured. Even with high-resolution cameras, it’s also difficult
to resolve rotations of the arm bones around their axes; for example, the orientation
of the left hand in Figure 7.20 is difficult to guess. Sminchisescu and Triggs [457] esti-
mated that up to a third of the underlying degrees of freedom in a kinematic model
are usually not observable from a given viewpoint due to self-occlusions and rota-
tional ambiguities. The poses of hands and feet are especially difficult to determine,
which is why markerless systems frequently don’t include degrees of freedom for the
wrist and ankle joints.
Many markerless algorithms discard the original image entirely in favor of a silhou-
ette of the human, estimated using background subtraction (i.e., matting; see Chapter
2). This exacerbates the problems illustrated in Figure 7.20, and introduces new ones.
In Figure 7.21a-b, we can see that it’s impossible to disambiguate the right limbs of
the body from the left limbs by looking at the silhouette, leading to major ambiguities
in interpretation. It’s also difficult to resolve depth ambiguities (e.g., whether a fore-
shortened arm is pointed toward the camera or away from it). More generally, we can
see that two very different poses can have similar silhouettes. Consequently, small
changes in a silhouette can correspond to large changes in pose (e.g., an arm at the
side (Figure 7.21a) versus an arm pointing outward (Figure 7.21c)). Therefore, several
silhouettes from different perspectives are required to obtain a highly accurate pose
estimate.
Figure 7.22 illustrates that some uncertainties can be mitigated if we also use edge
information inside the silhouette. For example, the location of an arm crossed in
front of the body might be better estimated if the boundary between the forearm and
torso can be found with an edge detector.
7.7. M a r k e r l e s s M o t i o n C a p t u r e 285
where λ1 , λ2 are weights and Ds and De are distance functions between binary sil-
houette images and edge maps, respectively. A natural choice for Ds is the Hamming
distance — that is, the number of pixels that have different labels in the two binary
images. A common variant called the chamfer distance, suggested in this context by
Gavrila and Davis [164], penalizes pixels in one image that are further from the sil-
houette in the other image more severely. We can also construct a weighted average
of the pixels seen in one silhouette but not the other [449]. Ren et al. [388] proposed
to learn an effective Ds based on a database of labeled poses; the distance function
was composed of computationally efficient rectangular binary filters. We could also
create Ds based on the correspondence between estimated matching points on the
silhouettes.
The edge distance function De can be defined similarly to Ds ; for example, we
can count the number of edge pixels in Êi (t) that are observed in Ei (t) [119]. We can
see from Figure 7.23c that there are likely to be many edges in the real image not in
the model, but most of the model edges should appear in the image if the model is
correctly posed. The edge pixels can also be weighted by their gradient magnitude
and orientation as a measure of importance [236].
Figure 7.23. Silhouettes and edge maps for observed images (top) are compared to predicted
silhouettes and edge maps corresponding to a candidate model position (bottom).
7.7. M a r k e r l e s s M o t i o n C a p t u r e 287
If we have a strong appearance model for the performer — for example, a model for
the expected color of each body part [447] — this information can also be incorporated
into p(r(t) | θ (t)). Shaheen et al. [438] compared the performance of marker-
less motion capture algorithms as the choices of image features and optimization
approaches were varied.
Instead of explicitly specifying a generative model from a pose to image features,
Agarwal and Triggs [3] used nonlinear regression on training data to directly predict
pose as a function of an image silhouette. Sigal et al. [450] used belief propagation on
a graphical model of body part relationships to estimate pose from an observation
likelihood model.
where V (t) is the visual hull at time t and V̂ (t) is the solid model correspond-
ing to pose θ(t). Dv is a distance function defined over sets of voxels in 3D. Mikić
et al. [324] described an early approach in which ellipsoids representing limbs were
fit to the visual hull and their centroids and endpoints were used to fit a kinematic
model parameterized with twists. Cheung et al. [91] described a similar hierarchical
288 Chapter 7. Motion Capture
(a) (b)
Figure 7.24. (a) The solid human model should be tangent to the 3D region created by back-
projecting each observed silhouette. A distance function can be constructed as the sum of
distances (short, thick lines) between rays from the camera center through points on the sil-
houette and the closest points on the 3D model. (b) An example backprojected silhouette region
for a real image.
approach of fitting body parts of the kinematic model to the visual hull. Kehl and Van
Gool [236] computed Dv as a weighted sum of squared distances between points on
the model and the closest points on the visual hull. Corazza et al. [105] used a variant
of the Iterative Closest Points (ICP) algorithm (see Section 8.4.2), commonly used for
registering 3D point sets, to compute Dv . Vlasic et al. [520] took a slightly different
approach of first fitting a kinematic skeleton directly inside the visual hull in each
frame, and then refining a high-quality mesh model V̂ (t) based on the skeletal poses
and a skinned model.
Figure 7.25. Four cameras observe the striped object. The visual hull is the shaded region
formed by intersecting the backprojected silhouettes (gray lines on the image planes). The visual
hull is always larger than the actual object.
(a)
(b)
Figure 7.26. (a) Images of a mannequin acquired by seven calibrated cameras, and the corre-
sponding automatically extracted silhouettes. (b) The resulting visual hull from a frontal and side
view. The coarse 3D approximation is reasonable, but more cameras would be required to carve
away the extraneous voxels.
290 Chapter 7. Motion Capture
the context of motion capture, Ganapathi et al. [162] fit a full-body skinned kinematic
model to a stream of monocular depth images from a time-of-flight sensor in real
time. Since the sensor observations are directly comparable to the model surface, the
observation likelihood is relatively straightforward and is based on the noise model
for the sensor. However, the proposed inference algorithm produced errors with
respect to conventional motion capture markers that were still fairly high.
A more familiar consumer technology is the Kinect sensor introduced by Microsoft
in 2010 as a new game-controlling interface for the Xbox 360, which uses a infrared
structured-light-based sensor to produce a stream of monocular depth images. As
described by Shotton et al. [445], hundreds of thousands of training images (both
motion-captured and synthetic) were used to build a finely tuned classifier that maps
each depth image to a set of candidate joint locations. The offline learning process is
incredibly computationally intensive, but the resulting online classifier is extremely
fast and can be hard coded into the device. The system is impressive for its ability to
robustly succeed across a wide range of body types and environmental conditions in
real time, though the goal is general pose estimation and not highly accurate motion
capture (the depth sensor only has an accuracy of a few centimeters).
Senior software engineers Nick Apostoloff and Geoff Wedig from Digital Domain in
Venice, California discuss the role of body and facial motion capture in visual effects.
Digital Domain is particularly well known for creating photo-realistic digital doubles
using facial motion capture, as in the movies The Curious Case of Benjamin Button
and TRON: Legacy.
RJR: The popular perception is that motion capture directly drives the performance of
an animated character. Can you comment on how accurate this perception is?
Apostoloff: For early all-digital characters, like Gollum in the Lord of the Rings trilogy,
mocap was completely used as reference material for the animators; it didn’t directly
drive the character at all. Today, we’re getting much closer to applying mocap directly
to animation. No company ever shows you how far you actually get, but I think we’re
at the point where, toward the end of a production, you get seventy to eighty percent
of the character motion from mocap. In some cases, the animator’s just touching up
data you get back from the mocap. In other cases, there’s lots of art direction that
happens after the actual shoot. It’s common that the actor will do something the
director’s not happy with looking at the data after the capture session, so they’ll have
to reanimate a lot of that. You might only use timing information from the mocap —
for example, making sure the jaw moves at the right time — but they’re going to
animate a lot on top of that. There are certain things like eyes that they do from
scratch all the time, just because eyelines change whenever a person moves around
in a scene. We don’t even bother capturing that here at the moment.
It also comes back to the complexity of the animation rig and the mapping from the
motion capture data onto the animation controls. For TRON: Legacy, both the rig and
the motion capture were modeled as two different linear systems, so that mapping
7.8. I n d u s t r y P e r s p e c t i v e s 291
(a)
(b)
Figure 7.27. Motion capture was used extensively to help create the lifelike apes in Rise of
the Planet of the Apes. In this example, actor Andy Serkis performs on a full-scale set of the
Golden Gate Bridge, wearing a state-of-the-art body suit instrumented with infrared LED markers
that were imaged by forty motion capture cameras. He also wears a facial motion-capture rig
comprised of a head-mounted camera and green facial marker dots. The arm prosthetics allow
Serkis to more accurately simulate an ape’s proportions and gait. Rise of the Planet of the Apes
©2011 Twentieth Century Fox. All rights reserved.
292 Chapter 7. Motion Capture
was very easy. But the actual solution for the final geometry might not have looked as
good as what you get from complex rigs that contain a lot of nonlinear functions like
skin sliding. The more complex the animation rig, the better the final result looks, but
it makes the job of mapping the mocap data onto the rig harder.
Wedig: In terms of facial motion capture, for the most part we were never aiming
for a 100 percent solution of directly mapping mocap data onto a final animated
character. Our goal is to produce a tool that gets rid of the boring work and the things
that annoyed the animators on previous shows. The animator spends so long on a
shot, and probably eighty to ninety percent of that time is just getting to the point
where it’s fun, where the animator is working on all the subtleties, emotional cues,
and micro-expressions of the character. That’s where we want our animators focused;
we don’t want them focused on how the jaw moves or whether the character’s making
a pucker with their lips.
RJR: Can you comment on the amount of cleanup that mocap data requires these days?
What are the common types of errors and ways to fix them?
Apostoloff: In terms of body mocap, it’s fairly straightforward. The capture system
often works at about 120Hz, so you have a lot of temporal information in there that
you can use to clean up missing markers.
We often use what we call a “rigid fill” to fill in gaps in marker trajectories. That is,
if a marker appears in one frame but not the next, we solve a least-squares problem to
align nearby known markers in both frames with a rigid transformation. Then we can
fill in the missing marker in the current frame by applying the rigid transformation
to its position in the previous frame. People will also use splines to bridge gaps in
trajectories or to remove noise from individual trajectories.
For facial motion capture, we’ve developed several more advanced algorithms.
Prior to going into production, we put the actors into a separate capture volume and
have them go over a set of facial poses that try to cover the entire gamut of what they
can do in terms of facial deformation. These include a mix of visemes, facial action
coding units, and emotional ranges. These facial poses are recorded using the MOVA
Contour system. We remove the head motions and use those to build a an actor-
specific statistical model of what their face does when it deforms, which we put into a
Bayesian framework to estimate missing markers when they occur. The prior term is
built from the captured training data, and the data likelihood term involves what you
get back from the mocap. We use simple Gaussian models for the face and it works
well, but we’d need a different approach for the body since there’s more articulated
motion.
RJR: What kinds of problems arise when mapping mocap data onto an animated
character with a different body size and shape?
Apostoloff: A big problem is scale differences between the actors and characters. If
you have a constant scale change among all of your actors and the characters — say
you’ve got actors playing giants and you’re going to scale all of them four times —
they usually behave well in the same environment. But say you have one actor who’s
being scaled by a factor of 1.2 to make their character bigger and you have another
actor who’s being scaled by a factor of 0.8 to make their character smaller. You can
record them together on the motion capture stage, for example approaching each
other and shaking hands. But when you put them into the virtual environment and
scale everything, they no longer meet up and connect at the same point. This is a
huge issue in mocap.
In particular, you may record an actor on the motion capture stage walking
across the room and interacting with different props. If you scale an actor’s mocap
data down to make a smaller character, they don’t actually make it to the other
side of the room! You have to introduce extra gait or strides somewhere for them
to make it across. That’s a fairly big problem and the process for fixing it is
usually quite manual. Often you capture a lot of generic background motion of
actors so you can then insert these bits into scenes to fix them. I don’t think peo-
ple often use things like automatic gait generation to fix these kinds of issues,
since that involves a different simulation pipeline that’s difficult to insert into a
production.
RJR: Can you describe your system for facial motion capture?
Wedig: We have a head-mounted camera system that uses four grayscale cameras
on carbon fiber rods that sit around the jaw line. You get very good pictures of the
mouth and jaw area. If we were to put the cameras higher up, they would interact
with the actor’s eye line, which some actors find very distracting. By putting the
cameras in a flat sort of curve around the bottom of the head we can be sure that
any part of the face is seen by at least two cameras. We use the images from the four
cameras to individually track dots on the actor’s face, which are placed in a specific
pseudo-random pattern developed over several years. Initially the dots are applied
by hand, and afterward we use a plastic mask with holes drilled in it to try to get some
consistency across shots. We can’t put the dots very close to the actor’s lips because
they get messed up when they eat or wipe their face.
Matching the marker set on a given shot with our canonical marker set is still a
problem we have to write a lot of code to address. Another big issue is stabilization.
No matter how good the carbon fiber rods are, they’re going to bounce. They bounce
when the actor breathes, when they walk, or when they move their head. Filtering the
unwanted motion is very difficult.
Finally, we map the motion of the dots onto the actor-specific face model created
from training data described earlier.
294 Chapter 7. Motion Capture
RJR: Has markerless motion capture made any inroads into visual effects production?
Apostoloff: It’s used sometimes in the film industry, mostly for body double work.
You can get a very high-resolution dense mesh from that kind of setup. If you’re going
to have a character that looks very similar to your actor, and you want to capture all
the nuanced performance of your actor without doing a lot of animation on top of
that, it’s great.
For example, you may want some high-profile actors to appear to fall out of a
plane and perform some stunts, but you don’t actually want to throw your high-
profile actors out of the plane. Instead, you might take them to one of those free-fall,
vertical wind tunnel places and capture a bunch of footage from twenty different
cameras around them, reconstruct the volume around them, and insert it into a new
scene like a blue-screen kind of effect. That kind of voxel-carving effect was used for a
sequence in Quantum of Solace. While the reconstructed geometry may not be great,
you’d be amazed at how believable it looks if you’ve got great texture. While it wasn’t
for a feature film, the video game L.A. Noire used a kind of facial markerless mocap
for a digital body double effect. They captured high-resolution textured geometry of
people and just streamed it directly into the game to get some amazing results.
On the other hand, markerless mocap isn’t a viable technique for replacing tra-
ditional mocap on a large film production at this point in time. What you can get
from a traditional motion capture system like a Vicon is incredibly accurate, and the
retargeting to different characters works very well. It’s a seamless pipeline and there
are real-time systems that can do live pre-visualization of the animated characters
on set in the virtual production environment, which is very useful for the directors.
While markerless systems are incredibly convenient, because you don’t have to
dress somebody up in a ridiculous outfit, what you get back from such systems is
often quite jerky and noisy. Also, you’re usually restricted in the size of the volume
that you can film the actors in. If you want to shoot a scene with twenty different
actors doing a dance sequence you can’t do that with markerless mocap.
One of the biggest issues is that we need to have a system that we can use for
animation after we process the mocap data. If your character looks a lot different
from your actor, it’s not often clear how you would use the high-resolution mesh data
from markerless motion capture to animate a character. Taking traditional mocap
data and mapping it to a complicated network of animation controls is already a
really hard problem! We haven’t reached the point where we can map the data across
from motion capture into animation accurately enough to worry about that little bit
of information that we lose from not having a complete mesh.
human walking, climbing, dancing, and so on, even when less than ten markers were
used. This remarkable human ability to recognize human motion from very little
information suggests that the sparse input of marker motion is sufficient to capture
a recognizable, individual performance.
The book by Menache [323] gives an excellent first-hand account of the history
and filmmaking applications of motion capture, as well as practical advice for set-
ting up a motion capture session and processing the resulting data files. Moeslund
and collegues [331, 332] and Poppe [372] surveyed and categorized the literature
on markerless motion capture. In particular, [332] gives an exhaustive taxonomy of
vision-based pose estimation through 2006.
Today, a large amount of motion capture data is freely available to researchers
and animators. In particular, the Carnegie Mellon University motion capture
database (http://mocap.cs.cmu.edu) is an excellent resource containing thousands
of sequences from more than 100 performers in motion capture suits. The captured
activities range widely, including walking, dancing, swordfighting, doing household
chores, and mimicking animals. The data is available in many formats, and useful
tools for interacting with it are provided. The newer HumanEva datasets, also made
available by CMU [449], have a smaller range of activities, but include multi-view
video sequences synchronized with motion capture trajectories from markers placed
on normal clothing. These datasets are valuable for developing markerless techniques
and have become benchmarks for the computer vision community.
In this chapter, we focused on motion capture technology using infrared lighting
and retro-reflective markers, but the same algorithms for marker triangulation and
processing apply to any markers (e.g., table tennis balls or ARTags), provided that
they can easily be detected and tracked in video. For example, the Imocap system
designed by Industrial Light and Magic uses white balls and binary patterns fastened
to a gray bodysuit to allow performances to be captured in natural environments
instead of on a motion capture stage. This technology was notably used in the Pirates
of the Caribbean and Iron Man movies. Also, facial motion capture markers are often
directly drawn or painted on the skin. Prototype technologies for new motion capture
sensors include tiny ultrasound microphones coupled with miniature gyroscopes
and accelerometers [519] and lightweight photosensitive tags that respond to coded
optical transmitters [382]. These new technologies carry the promise of accurate
motion capture in real-world outdoor environments instead of carefully controlled
indoor stages.
We assumed that the kinematic models for humans were known; however, it’s
also possible to learn these models and their relationship to motion capture markers
(e.g., [394, 241]). Ross et al. [399] even showed how the kinematic models for unusual
objects (e.g., giraffes or construction cranes) could be estimated from tracked 2D
features alone. Such methods might be useful in situations where we have no good
prior model for the kinematic skeleton (e.g., motion capture of an unusual animal
like a kangaroo).
Yan and Pollefeys [560] extended factorization techniques for structure from
motion to estimate an underlying articulated skeleton from tracked features. Indeed,
the calibration and triangulation problems in motion capture are closely related
to aspects of the matchmoving problem we discussed in Chapter 6, and famil-
iar techniques for structure from motion can be extended to markerless motion
296 Chapter 7. Motion Capture
capture to allow for moving cameras. For example, Liebowitz and Carlsson [281]
extended metric reconstruction in the case where the scene points lie on a dynamic
articulated skeleton whose bone lengths are known. Hasler et al. [190] automated the
synchronization of the cameras and improved the feature detection and body model.
Brand and Hertzmann [65] discussed how a database of performances of the same
action (e.g., walking) by different performers could be used to separate the style of
a motion from its content and estimate style parameters for each performer. This
approach enabled previously recorded activities to be rendered in the style of a dif-
ferent performer, as well as the generation of new styles not in the database. Liu
et al. [290] estimated biomechanical aspects of a performer’s style (e.g., relative pref-
erences for using joints and muscles), assuming that the recorded motion capture
data optimizes a personal physics-based cost function.
The more extensively component motions are edited, the greater the risk the
synthesized motion appears unnatural to human eyes. Ren et al. [387] designed a
classifier trained on a database of both natural and unnatural motions that could
predict whether a new motion had a natural appearance. The classifier is based on
a hierarchical decomposition of the body into limbs and joints, so that the source of
an unnatural motion can be automatically pinpointed. In addition to motion editing,
this approach could also be used to detect errors in raw motion capture data and to
determine markers and intervals that need to be fixed. Safonova and Hodgins [415]
proposed an algorithm that analyzed the physical correctness of a motion sequence,
taking into account linear and angular momentum and ground contact, which can
improve the appearance of interpolated motions.
Cooper et al. [104] applied an adaptive learning algorithm to direct the sequence of
actions that a motion capture performer should execute in order to efficiently build
a good library of clips for motion editing and synthesis. Kim et al. [239] discussed the
extension of motion editing to enable multiple characters to interact in a task (e.g.,
carrying objects in a relay).
The main goal of full-body motion capture is to record the geometric aspects of a
performance. However, many applications of facial motion capture require not only
the recording of facial geometry but also high-resolution facial appearance — for
example, to make an entirely convincing digital double. Alexander et al. [11] gives an
interesting overview of the evolution of photorealistic actors in feature films. They
used the Light Stage at the University of Southern California to acquire the detailed
facial geometry and reflectance of a performer, producing an incredibly lifelike facial
animation rig.
We generally assumed that motion capture data is processed for production well
after it is acquired. However, in a live setting, we may need to use motion capture
data to drive an animated character in real time, which is sometimes called computer
puppetry. This process is now commonly used on motion capture stages to allow a
director to crudely visualize the mapping of an actor’s performance onto an animated
character in real time, notably for movies like Avatar.
The main problem is delivering fast, reliable inverse kinematics results to drive
a rigged character at interactive rates. Shin et al. [443] described one such algo-
rithm, which makes instantaneous choices about which end-effector motions are
most important to preserve in the inverse kinematics, and leverages analytic solu-
tions for speed. Chai and Hodgins [85] showed how a performer wearing only a
7.10. H o m e w o r k P r o b l e m s 297
few markers attached to normal clothing could drive an animated character in real
time. The low-dimensional input is used to quickly search a database of high-quality
motion capture subsequences that are seamlessly strung together and played back
in real time.
While we only discussed motion capture for full bodies and faces, biomechanical
engineers often use markered systems to study hands (e.g., [64]) for understand-
ing dexterity and grasping. These can be augmented with force-feedback sensors to
study how fingers interact with objects they contact [257]. Park and Hodgins [360]
used about 350 markers finely spaced over a performer’s body to collect accurate
data about the motion of skin, muscle, and flesh (e.g., bulging, stretching, jiggling).
Feature films such as the Lord of the Rings trilogy have even incorporated motion cap-
ture data from horses; of course, this requires an entirely different kinematic model.
Rosenhahn et al. [398] studied markerless motion capture of athletes interacting with
machines (e.g., bicycles and snowboards), which makes the kinematic skeleton more
complicated (i.e., since the legs are now connected by the machine into a closed
chain).
7.1 a) Show that if P, Ci , and Vi are fixed, then the inner term of Equation (7.1)
for one camera
P P − 2Ci P + Ci Ci − (Vi P)2 (7.40)
(I3×3 − R )t
ρ= (7.41)
2(1 − cos ψ)
1 1
exp(M ) = I3×3 + M + M 2 + M 3 + · · ·
2 6
∞ (7.42)
1 k
= M
k!
k=0
In Equation (7.12) we have the special case of exp[v]× , where [v]× is the
skew-symmetric matrix defined in Equation (5.39).
a) Show that [v]2× = vv − v2 I3×3 and [v]3× = −v2 [v]× .
b) By considering the first 6 terms of the exponential Taylor series,
conclude that
1 − cos v
exp[v]× = cos vI3×3 + sincv[v]× + vv (7.43)
v2
ror
two at angles).
mir
ror
mirror
7.14 Provide sketches to show that (a) the visual hull cannot resolve concavities
and (b) the visual hull can be substantially larger than the actual object.
7.15 Sketch an eighth perspective of the human in Figure 7.26 that would carve
away the voxels protruding from the chest area.
Three-Dimensional Data
8 Acquisition
300
8.1. L i g h t D e t e c t i o n a n d R a n g i n g ( L i D A R ) 301
light projected onto his or her face. An up-and-coming alternative from traditional
computer vision is the passive technique of multi-view stereo (MVS). Multi-view
stereo algorithms combine the natural images from a large set of calibrated cameras
with dense correspondence estimation to create a 3D dataset, typically represented as
a texture-mapped mesh or a set of colored voxels (Section 8.3). While MVS techniques
are about an order of magnitude less accurate than active lighting methods, they can
still produce convincing, high-resolution 3D data.
Finally, we discuss common algorithms required for registering 3D datasets, since
several scans from different locations may be required to see all sides of an object
to build a complete model (Section 8.4). As in the 2D case, we detect, describe, and
match features, and use these as the basis for automatically registering two scans of
the same scene from different perspectives. We then address the fusion of a large
number of scans into a single coordinate system and data representation.
We can think about a LiDAR scanner2 as an advanced version of the “laser measuring
tape” that can be found in a hardware store. The basic principles are similar: a laser
pulse or beam is emitted from a device, reflects off a point in the scene, and returns to
the device. The time of flight of the pulse or the phase modulation of the beam is used
to recover the distance to the object, based on a computation involving the speed of
light. While the hardware store laser measuring tape requires the user to manually
orient the laser beam, a LiDAR scanner contains a motor and rapidly spinning mirror
that work together to sweep the laser spot across the scene in a grid pattern.
Figure 8.1 depicts two 3D scanners based on the main methodologies for LIDAR
data acquisition. The first scanner, in Figure 8.1a, uses a time-of-flight-based sen-
sor and can measure distances of hundreds of meters, while the second scanner, in
Figure 8.1b, is a phase-based system with a maximum range of about eighty meters.
Despite the long distances involved, both types of scanners are accurate to within
a few millimeters. An added advantage is that the distance to each point is mea-
sured directly, as opposed to inferred using a vision-based method like multi-view
stereo. For these reasons, laser scanning is considered the gold standard for 3D data
acquisition. We’ll discuss the physical principles behind both scanners shortly.
As illustrated in Figure 8.2, LiDAR data is usually collected in a spherical coordinate
system. For every azimuth and elevation angle (θ , φ), the scanner returns a distance
d(θ , φ), measured in physical units like meters, to the first point in the scene encoun-
tered along the specified ray.3 For given intervals of θ and φ, the d(θ , φ) values can
be interpreted as a range or depth image, which can be manipulated using standard
image processing algorithms.4 Well before their application to visual effects, LiDAR
2 In military applications, the acronym LADAR (LAser Detection And Ranging) is often used instead.
3 Some LiDAR scanners report multiple distance returns per ray, which can occur due to transparent,
reflective, or quickly moving surfaces in the scene.
4 Scanners also frequently report the return intensity at each ray, which is related to the reflectance,
material properties, and orientation of the corresponding surface. This return intensity image can
also be processed like a normal digital image.
302 Chapter 8. Three-Dimensional Data Acquisition
(a) (b)
Figure 8.1. (a) A time-of-flight-based LiDAR scanner. (b) A phase-based LiDAR scanner.
d
φ
LiDAR θ
scanners were mounted in airplanes and used to generate high-quality terrain maps
for military and geospatial applications.
Figure 8.3 illustrates an example LiDAR scan of a large building. We can appreciate
the millimeter accuracy of the scan, even though the scanner was approximately
fifty meters away from the sixty-meter-wide building. Since the scanner’s laser can’t
penetrate solid objects, LiDAR scans have characteristic “shadows” of missing data
produced by occluding foreground objects, which can be seen in Figure 8.3a. These
shadows can be filled in with data from scans from different perspectives once they
have been registered to a common coordinate system, as described in Section 8.4.2.
While the 3D data acquired from a LiDAR scanner is generally of very high quality,
it’s important to note that some materials are problematic for laser-ranging technolo-
gies. Highly reflective surfaces such as glass generally result in missing or incorrect
distance measurements, since the laser beam can easily bounce off the object and
hit another surface in the scene. Depending on the type of glass, the laser beam may
8.1. L i g h t D e t e c t i o n a n d R a n g i n g ( L i D A R ) 303
(a) (b)
Figure 8.3. (a) An example LiDAR scan of a large building. This scan contains approximately
600,000 points and the building face is approximately 60m across. Note the characteristic “shad-
ows” on the building due to occlusions by the tree and truck in the foreground, and missing
returns in several of the windows. (b) A detail of the scan.
glass
LiDAR
Figure 8.4. Characteristic problems with LiDAR scanning. (a) Missing data (thick gray lines)
due to occlusion (back side of circular object) and “shadows” (hidden area of rear wall). (b) False
or missing returns from rays directed at glass surfaces. (c) Grazing-angle errors from surfaces
nearly parallel to the incident ray.
also pass directly through it, returning the distance to an object on the other side.
Also, dark or highly absorptive surfaces may be difficult to scan since an insufficient
amount of light is reflected back to the scanner. Finally, surfaces nearly parallel to
the laser beam often produce poor-quality distance measurements, since most of
the light is reflected away from the scanner; these are called “grazing-angle” errors.
Figure 8.4 illustrates these characteristic problems with LiDAR scanning.5
Therefore, the best-case scenario for LiDAR scanning is a scene containing bright,
matte surfaces that are all roughly perpendicular to the laser beam. Man-made struc-
tures like buildings typically scan well (except for the windows, as can be seen in
Figure 8.3b). Accurately scanning a difficult object like a shiny car typically requires
it to be treated with an even coat of matte white spray-on powder beforehand.
5 There are many smaller issues as well. For example, the spread of the laser spot over a long distance
can affect the accuracy of the return, as well as the interaction between the color of the laser and
the color of the scene surface. That is, a green laser may be very accurate for a faraway white surface
but give an inaccurate or missing return for a red surface at the same distance.
304 Chapter 8. Three-Dimensional Data Acquisition
Figure 8.5. A co-registered color camera can be used to texture-map the points in a LiDAR scan,
making it more understandable.
While LiDAR scanners are designed only to return distance measurements, many
systems include an integrated RGB camera whose pixels are precisely aligned with the
(θ , φ) rays. This allows the raw 3D points to be texture mapped with color for a more
pleasing and understandable rendering, as illustrated in Figure 8.5. Furthermore, the
color information can be extremely useful when registering multiple LiDAR scans, as
discussed in Section 8.4. If a LiDAR sensor doesn’t come with a co-registered camera,
a rigidly mounted auxiliary RGB camera can be calibrated against the LiDAR sensor
based on the resectioning algorithms discussed in Section 6.3.1 and an appropriate
calibration target.
LiDAR
Figure 8.6. The time of flight of a laser pulse can be used to infer the distance to an object in
the scene.
LiDAR
Figure 8.7. The phase shift of a sinusoidally-modulated continuous laser waveform can be used
to infer the distance to an object in the scene.
φ can take tens of minutes with a pulse-based system — time that’s precious on a
busy movie set, and raises the likelihood of significant changes occurring in the scene
during the scan.
Combining Equation (8.1) with Equation (8.2) yields the key phase-based equation
cψ
d= (8.3)
2ω
Since the phase difference can only be measured modulo 2π , this introduces a
constraint on the maximum range that can be measured before introducing range
ambiguity. That is, we require that 0 < ψ < 2π , which in practice imposes a maximum
range of forty to eighty meters, depending on the modulation frequency [42].6 On the
other hand, phase-based systems are quite a bit faster than pulse-based systems, a
great advantage when time is of the essence.
6 It’s possible to design algorithms to resolve this ambiguity using phase unwrapping techniques, if
we put constraints on the extent or spatial gradient of objects in the scene. Alternately, multiple
modulating frequencies can be used with the downside of increasing scanning time.
8.2. S t r u c t u r e d L i g h t S c a n n i n g 307
(a) (b)
Figure 8.9. (a) A body scanner based on structured light. The performer stands on the middle
platform as four structured-light scanners move up and down the surrounding gantry. (b) A head
scanner based on structured light. The performer sits in the chair as the gantry moves around
his or her head.
308 Chapter 8. Three-Dimensional Data Acquisition
Light stripe
projector Camera
Figure 8.10. The principle of structured light scanning. The object is illuminated with a plane
of light that intersects the surface of the object as a deformed stripe, which is observed by an
offset camera. The projector and camera are accurately calibrated so that 3D locations can be
recovered by triangulation.
once, enabling so-called one-shot scanning (Section 8.2.3). Finally, we discuss struc-
tured light systems targeted at capturing dynamic scenes in real time using fringe
projection (Section 8.2.4).
8.2.1 Calibration
The earliest range scanners used a visible-spectrum laser passing through a cylindri-
cal lens to cast a strong, bright stripe on the object to be scanned, and this technique
is still used today (e.g., for the scanners pictured in Figure 8.9). The key problem is to
determine the geometric relationship between each plane of laser light and the cam-
era’s image plane, as illustrated in Figure 8.11. As we’ll see, this process is very much
like calibrating two cameras, except that one of the cameras is replaced by a laser.
As we know from Sections 5.1 and 6.3.2, the relationship between the coordinates
of two 3D planes is given by a 2D projective transformation. We assume that the
plane of laser light has world coordinates given by (X , Y , Z ) with Z = 0, so that the
coordinates of the plane are given by (X , Y ) and we have
X h11 h12 h13 x
Y ∼ h21 h22 h23 y (8.4)
1 h31 h32 h33 1
where the image plane coordinates of the camera are given by (x, y). Here, we’ve
represented the 2D quantities as homogeneous coordinates and the projective
transformation as a 3 × 3 matrix H defined up to scale.
Z y
x
Laser
Camera
Figure 8.11. The image plane of the camera (with coordinates (x, y)) and the plane of laser
light (with coordinates (X , Y )). The two 2D coordinate systems are related by a projective
transformation.
310 Chapter 8. Three-Dimensional Data Acquisition
Laser
Calibration object
Camera
Figure 8.12. An object with known 3D geometry can be used to calibrate a light plane with
respect to an image plane. The black 3D dots on different surfaces of the calibration object are
all coplanar since they lie on the laser plane.
There are several ways to obtain the correspondences between the image plane
and the light plane that we need to estimate the parameters of the projective transfor-
mation, all of which require a calibration object with known 3D geometry. Typically
we exploit the knowledge that a given image point lies on a known world plane [385]
or world line [86], or matches exactly with a known world point [211].
A straightforward approach uses a calibration object made up of cubes with
checkerboard faces of known dimensions, as illustrated in Figure 8.12. With the laser
turned off, the checkerboards supply the information required to calibrate the cam-
era using the plane-based method of Section 6.3.2 or the resectioning method of
Section 6.3.1. With the laser turned on, the light plane intersects multiple planes in
the image whose equations are known. By intersecting rays from the now calibrated
camera’s center with these planes, we obtain a set of co-planar points, to which we
can fit a 3D plane (at least three points are required on two different surfaces). Once
we have the equation of the plane in 3D, we can change coordinates to obtain the
projective transformation in Equation (8.4).7
Now we have a direct mapping given by Equation (8.4) between any point in the
image plane and 3D world coordinates, but only for a single position of the light
plane (i.e., a single stripe on the object). To scan the entire object, the laser stripe
projector and camera (which are rigidly mounted to each other) usually move along
a precisely controlled linear (e.g., Figure 8.9a) or circular (e.g., Figure 8.9b) path, such
that the world coordinate transformation between any two positions of the scanner
head is known. Alternately, the projector/camera rig stays in place while a precisely
computer-controlled stage translates the object through the light plane. Handheld
laser stripe scanners (e.g., the Polhemus FastSCAN) use a magnetic sensor to localize
the scanner head in space (using similar technology to magnetic motion capture
systems).
7 Or, we can simply leave the (X , Y , Z ) values in the same world coordinate system in which the
planes were defined.
8.2. S t r u c t u r e d L i g h t S c a n n i n g 311
Camera
Laser Camera
(a) (b)
Figure 8.13. (a) A 3D acquisition rig containing two calibrated cameras (left and right) and a
structured light projector (center). (b) The intersection of the laser stripe’s projection and an
epipolar line in the left image plane generates a unique correspondence in the right image plane,
which can be triangulated to determine a 3D point on the object’s surface.
8 Davis and Chen [113] describe how a single camera and a system of mirrors can be used to avoid
the need for a second camera.
312 Chapter 8. Three-Dimensional Data Acquisition
intensity
object at pixel x
0 1 2 3 t
t=3
t=2
t=1 t = 1.6 x
image Camera
t=0
plane
Laser
Figure 8.14. Space-time analysis for better triangulation. The observed intensity at pixel x as
a laser stripe sweeps across the surface of an object is modeled as a Gaussian function (dotted
line, top right). The mean of the Gaussian gives an estimate of the time at which the laser stripe
is centered directly on the corresponding point on the 3D object.
pattern projected on
empty plane in scene
static
projected
image
camera
image
projective
projector transformation
camera
Figure 8.15. Calibrating an LCD or DLP projector by observing the projections of a checkerboard
on a blank plane.
the image it projects, as illustrated in Figure 8.15. Thus, we can compute the internal
and external parameters of the projector using a clever approach: we project a static
image of a checkerboard or a grid onto an empty white plane that is moved around
the scene [240].
For each position of the plane, the camera views a skewed checkerboard, and we
compute the projective transformation Hi mapping the skewed camera image to the
rectilinear projected image. By collecting the projector-to-camera correspondences
for all the positions of the plane, we can easily estimate the fundamental matrix
8.2. S t r u c t u r e d L i g h t S c a n n i n g 313
and epipolar geometry relating the projector and camera (see Section 5.4.2). The
camera matrix for the projector Pproj is related to the fundamental matrix through
the relationships in Section 6.4 and the projective transformations {Hi } through the
relationships in Section 6.3.2. A method similar to the plane-based calibration of
Section 6.3.2 can then be applied to the set of {Hi } to estimate Pproj [381, 125].9 Once
we have these external and internal parameters, the triangulation process is the same
as in Section 6.4.
1 2 3 5
01010001 = stripe 81
Figure 8.17. Projecting M binary patterns (inset at the lower right of each image) onto an object
allows 2M on/off patterns to be generated, effectively coding the index of the light plane. In this
example, we use eight binary patterns, allowing 256 stripes to be uniquely indexed. The on/off
patterns at the indicated point show that it corresponds to stripe 81 in the finest-resolution
pattern.
Figure 8.18. This slightly different set of binary patterns is based on Gray codes. The codewords
for each pair of adjacent stripes only differ in one bit.
It’s much more common to use a slightly different set of binary patterns illustrated
in Figure 8.18 based on Gray codes. The advantage is that the codewords for each
pair of adjacent stripes only differ in one bit, making it easier to accurately determine
where the stripe transition occurs.10
Determining whether a scene point is illuminated or not illuminated at a
given instant is harder than it may sound. For example, a scene point with low
reflectance under bright illumination from the projector may have a compara-
ble intensity in the camera’s image to a scene point with high reflectance under
no projection illumination. Therefore, using a single global threshold to make the
illuminated/non-illuminated decision for all pixels in the image is a bad idea.
One possibility, as illustrated in Figure 8.19a, is to first acquire two calibration
images: one in which the projector is on for all pixels (i.e., an all-white pattern) and
one in which the scene is not illuminated by the projector at all (i.e, an all-black
pattern). The average of these calibration images gives a spatially adaptive threshold
10 Another way to think about this is that we’re using the same 2M codewords, but just changing their
left-to-right order in the projected image.
8.2. S t r u c t u r e d L i g h t S c a n n i n g 315
(a) (b)
Figure 8.19. (a) Two images in which the projector is fully on and totally off allow the determi-
nation of a per-pixel threshold. In this case the (on,off) intensities at the (darker) red pixel are
(137, 3) and at the (brighter) green pixel are (226, 18), leading to per-pixel thresholds of 70 and
122, respectively. (b) Alternately, each binary pattern and its inverse can be projected, and we
interpret a 1 if the pattern at a pixel is brighter than its inverse.
1
2
3
4
Figure 8.20. Hall-Holt and Rusinkiewicz’s proposed patterns for stripe boundary coding.
that we can use to determine the on/off state for each pixel in the subsequent binary
patterns.
Alternately, Scharstein and Szeliski [426] suggested projecting each binary pattern
followed by its inverse, as illustrated in Figure 8.19b. The codeword bit is assigned
as 1 if the pixel’s intensity is brighter in the original pattern compared to the inverse
pattern, and 0 in the opposite case. They claimed this approach was more reliable
than using all-on and all-off images; on the other hand, it requires projecting twice as
many patterns. Regardless of the approach, scanning objects that are shiny or contain
surfaces with very different reflectances can be difficult. As with LiDAR scanning, the
best-case scenario is a matte object with uniform reflectance.
Hall-Holt and Rusinkiewicz [185] advocated the use of stripe boundary codes.
That is, instead of trying to detect the center of each stripe to use in triangulation,
they proposed to detect the boundary between stripes, which can be more accurately
located. Thus the changing pattern of on/off illumination of the stripes on each side
of the boundary generates the codeword. In particular, they proposed the set of four
patterns illustrated in Figure 8.20; each pattern contains 111 stripes, and each of the
110 quadruples of on-off patterns across the stripe boundaries occurs only once.11
Clearly, we can reduce the number of projected patterns required to define a
codeword if we allow the patterns to have more than two states. One possibility is
to allow grayscale values in the projections, and another is to allow colored stripes.
In either case, if each stripe in a pattern can be in one of N states, then N M unique
stripes can be coded with M patterns. Horn and Kiryati [204] explored the grayscale
approach, using Hilbert space-filling curves to produce well-separated codewords for
a user-specified number of patterns or stripes. However, the more gray-level states
11 Note that we can’t directly resolve a boundary when the illumination is constant across it (e.g.,
white-white). However, the sequence is designed so that the patterns both before and after such
an occurrence have a visible illumination change, which localizes the boundary.
316 Chapter 8. Three-Dimensional Data Acquisition
are allowed, the more difficult it is to correctly resolve the state, especially in the
presence of nonuniform-reflectance objects.
Caspi et al. [83] studied the problem of colored stripes in more detail. They carefully
modeled the relationship between the projected color, the surface reflectance, and
the camera’s color response for each scene point, resulting in a set of color patterns
adapted to the environment to be scanned. The model can be written as
where Ccam is the observed RGB color at a camera pixel, Cproj is the color instruction
given to the projector, and C0 is the observed color with no projector illumination.
The constant 3 × 3 matrix A defines the coupling, or cross-talk, between the color
channels of the projector and the camera, and the pixel-dependent diagonal matrix
D is related to the corresponding scene point’s reflectance. The projection operator
P accounts for the difference between the instruction given to the projector and the
actual projected color. All of the parameters of the model can be estimated prior to
scanning using a simple colorimetric calibration process. Given this model, the goal
is to choose scene-adapted color patterns that can be maximally discriminated by the
camera. The result is a generalized Gray code that uses a different number of levels
for each color channel and was shown to improve on binary Gray codes. In the next
section, we discuss color-stripe methods in more detail.
0001002003011012013021022023031032033111211312212313213322232333
is a de Bruijn sequence of order 3 over an alphabet of four symbols. We can verify that
every possible length-3 subsequence occurs exactly once.
8.2. S t r u c t u r e d L i g h t S c a n n i n g 317
Color
Figure 8.21. Zhang et al.’s color stripe pattern created using a de Bruijn sequence of order 3
over five symbols.
(a) (b)
Figure 8.22. (a) An image of an object illuminated using the stripe pattern in Figure 8.21.
(b) Recovering the correspondence between the projected and observed color patterns using
dynamic programming.
12 In this application, N = 5, not 8, since Zhang et al. did not allow adjacent stripes to be the same
color, and ruled out neighbors in which the red and green channels changed at the same time.
318 Chapter 8. Three-Dimensional Data Acquisition
1
2
3
Figure 8.23. The three phase-shifted images used in the fringe projection algorithm.
Huang et al. [206] described the basic approach of projecting three sinusoidal
images with frequency ω separated by 2π 3 in phase:
1 2π
IR (x, y) = 1 + cos ωx −
2 3
1
IG (x, y) = (1 + cos (ωx)) (8.6)
2
1 2π
IB (x, y) = 1 + cos ωx +
2 3
The images have been scaled in intensity to take up the full [0, 1] range, and are
illustrated in Figure 8.23.
Huang et al. made the clever observation that these fringe patterns could be pro-
jected at extremely high speed (i.e., 240 frames per second) by modifying a single-chip
DLP projector. A DLP projector modulates the white light from a projector bulb into
grayscale intensities using a digital micromirror device (DMD), an array of tiny mir-
rors that rapidly flip back and forth. RGB colors are created at each pixel by placing a
rapidly spinning “color wheel” between the bulb and the DMD. If the color wheel is
removed, then sending a static RGB image to the projector results in moving grayscale
fringes projected at high speed onto an object. This trick has been adopted by many
researchers in the projector-camera community.
The DLP projector is synchronized with a high-speed digital camera. Therefore, a
sequence of three successive images captured by the camera will be given by
2π
I1 (x, y) = A(x, y) + B(x, y) cos ψ(x, y) −
3
# $
I2 (x, y) = A(x, y) + B(x, y) cos ψ(x, y) (8.7)
2π
I3 (x, y) = A(x, y) + B(x, y) cos ψ(x, y) +
3
where A(x, y) is the per-pixel average intensity of the three images, B(x, y) is the per-
pixel amplitude of the observed sinusoid, and ψ(x, y) is the observed phase map. We
can recover this phase map at each pixel by combining the three observed intensities:
√ I1 (x, y) − I3 (x, y)
ψ(x, y) = arctan 3 (8.8)
2I2 (x, y) − I1 (x, y) − I3 (x, y)
calculation in Equation (8.8) with one that depends simply on intensity ratios, and
Zhang and Yau [572] used two fringe images and a flat (projector-fully-on) image to
mitigate measurement errors and increase processing speed. Weise et al. [540] noted
that moving objects inevitably generate “ripple” artifacts in 3D since the assumption
that the same pixel location in all three images in Equation (8.7) corresponds to the
same scene point is incorrect. They proposed a method to estimate and compensate
for the underlying motion to remove the artifacts.
Phase unwrapping is a major challenge for fringe-projection methods, and there
is a vast literature on methods to solve the problem (e.g., see [166]). Luckily, in appli-
cations where real-time performance is required (e.g., real-time 3D measurement
of facial expressions), the surface generally changes sufficiently smoothly (except in
problematic regions like facial hair).
14 Algorithms that only produce a sparse, irregular set of 3D points (e.g., the recovered 3D points
produced by a matchmoving algorithm) are not considered to be multi-view stereo algorithms.
15 Datasets and continually-updated results are available at http://vision.middlebury.edu/mview/.
8.3. M u l t i - V i e w S t e r e o 321
accelerated research in stereo and optical flow. The two main evaluation datasets
are based on roughly constant-reflectance models approximately ten centimeters on
a side, captured from a hundreds of viewpoints distributed on a hemisphere (see
Figure 8.27a). Strecha et al. [470] later contributed a benchmarking dataset for large-
scale multi-view stereo algorithms, using high-resolution images of buildings many
meters on a side.
It’s important to note that while modern multi-view stereo results are qualitatively
quite impressive, and quantitatively (i.e., sub-millimeter) accurate for small objects,
purely image-based techniques are not yet ready to replace LiDAR systems for highly
accurate, large-scale 3D data acquisition. For example, Strecha et al. [470] estimated
that for large outdoor scenes, only forty to sixty percent of the 3D points for a top MVS
algorithm applied to high-resolution images were within three standard deviations
of the noise level of a LiDAR scanner, while ten to thirty percent of the ground truth
measurements were missing or wildly inaccurate in the MVS result. For this reason,
multi-view stereo papers typically use LiDAR or structured light results as the ground
truth for their algorithm comparisons. Multi-view stereo algorithms can also be quite
computationally expensive and hence slow, another drawback compared to near-
real-time structured light systems.
16 This method, and multi-view stereo methods in general, perform best on Lambertian surfaces, as
opposed to specular or translucent ones. Of course, the same is true for LiDAR and structured light
methods.
322 Chapter 8. Three-Dimensional Data Acquisition
Figure 8.24. A top view of one plane of voxels and three cameras with associated image planes.
Gray voxels are occupied, and white voxels are empty. A voxel is photo-consistent if its color
closely matches the color of all the pixels in the images in which it is visible. In this example, the
color of the striped voxel only needs to agree with the colors of the pixels in the left and middle
cameras, since it is occluded from the perspective of the right camera.
Kutulakos and Seitz [258] later proposed the seminal space carving approach,
which formalized the concept of the photo hull, the largest set of colored voxels
that is photo-consistent with the input images. The space carving algorithm provably
achieves the photo hull by iteratively removing voxels from an initial volume (e.g.,
the visual hull), either one by one or along a set of orthogonal sweeping planes. While
there are no restrictions on the camera configuration, many sweep iterations may be
required to reach the photo hull.
Kolmogorov and Zabih [247] proposed an algorithm that can handle the same
camera configurations as voxel coloring, but has the advantages that spatial coher-
ence between the source image pixels is enforced, and that irreversible incremental
decisions about voxel removal are avoided. They posed the problem as a labeling
problem over the pixels in all the source images, where the label corresponded to
a discretized depth from a reference camera. The problem of finding an optimal
labeling is posed as a graph-cut problem in which the data term is related to voxel
photo-consistency and the smoothness term encourages neighboring pixels to have
similar depths. An additional visibility term is required to encode the notion that
voxels are opaque. The multi-label graph-cut problem is solved using α-expansion.
Vogiatzis et al. [522] investigated a different graph-cut approach for more realistic
multi-view stereo problems. Instead of using multiple labels, they posed a two-label
problem in which each voxel was classified as part of the object or not.17 The smooth-
ness term in the problem is related to the photo-consistency of neighboring voxels,
while the data term encourages the volume to expand (to avoid incomplete recon-
structions). Voxels outside the visual hull can be automatically ruled out if image
silhouettes are available. Once the voxels are selected with either graph-cut approach,
they can be colored (e.g., using the average color of the corresponding pixels in the
images in which they are visible).
An interesting feature of volumetric approaches is that neither sparse nor dense
correspondences between the images are typically required, unlike some of the other
algorithms in this section. However, the general problem with volumetric approaches
is that their accuracy is limited by the size of the voxel grid. Even at tabletop scale, the
grid needs to be hundreds of voxels on a side to achieve sub-millimeter accuracy, and
the resulting space carving or graph-cut methods are computationally demanding,
both in terms of speed and required memory. Also, as with voxel carving, a reasonably
large number (tens to hundreds) of calibrated images may be required to get an
acceptable result.
In Equation (8.9), the Etexture term is based on the consistency of each surface point
with the projections in the resulting images, measured using the normalized cross-
correlation of windows around the projected correspondences. That is, for vectors of
image intensities u and v taken from square windows surrounding a correspondence
in a pair of images, we compute
n
1
NCC(u, v) = (ui − µu )(vi − µv ) (8.10)
su sv
i=1
where µu and su are the mean and standard deviation of the elements of u.
The normalized cross-correlation is robust to affine changes in intensity between
the windows. The Esilhouette term forces the surface to project to the silhouettes in
the source images, and the Einternal term acts to smooth the surface by decreasing its
surface area. The overall energy function is minimized by evolving the vertices of the
triangle mesh in the negative direction of the gradient of Equation (8.9) according to
a partial differential equation.
The algorithm by Pons et al. [370] represents the evolving surface using a level-set
function instead of a triangulated mesh. That is, the desired set of 3D points S is
implicitly defined as those points that satisfy f (S) = 0 for some function f : R3 → R.
The values of the function f on a voxel grid are iteratively updated until convergence
according to a partial differential equation, and S is subsequently extracted using
an isosurfacing algorithm (see more in Section 8.4.3). The theory of how to evolve f
when we really want to minimize a cost function on S is discussed in Sethian [436].
An important aspect of Pons et al.’s approach is their cost function, which is based
on reprojection, as illustrated in Figure 8.25. For a set of M source images {I1 , . . . , IM },
and a neighborhood relation on the images N, this reprojection function can be stated
simply as
M
E(S) = Eij (S) (8.11)
i=1 j∈N (i)
where Eij represents the dissimilarity between image Ii and its reprojection on the
image plane of Ij , via the surface S. This approach contrasts with methods that simply
compute the normalized cross-correlation of square windows of pixels, not taking
into account the deformation of the patch induced by the shape of the surface. This
is important since a patch that projects to a square region in one image may project
to a skinny, lopsided region in another, or be partially occluded by another piece of
the surface.
Hiep et al. [198] addressed the challenges of applying surface deformation methods
to large-scale multi-view stereo problems, using the benchmark datasets of Strecha
et al. [470]. They matched and triangulated DoG and Harris features across the image
set, producing a dense set of 3D points that they turned into a mesh using Delaunay
Reference image
Reprojected image
Figure 8.25. A square patch in the left image is reprojected onto the right image via an estimated
3D surface, resulting in the non-square black-outlined area. A square patch in the right image
(dotted outline) centered at the projection would not correctly represent the correspondence
between the two images.
8.3. M u l t i - V i e w S t e r e o 325
triangulation. This mesh is then evolved and adapted using a regularized partial
differential equation.
Surface deformation methods generally have a bias toward computing surfaces
with minimal surface area or bending energy, which can have the effect of smoothing
away sharp details. On the other hand, the continuity of the mesh/level-set enables
the 3D reconstruction to span flat, untextured regions on the underlying surface that
are challenging for the methods we discuss next. Thus, surface deformation results
typically don’t contain missing regions.
n(p)
n(p)
c(p)
c(p')
c(p) patch p
(a) (b)
Figure 8.26. (a) To compute a 3D patch’s score in PMVS, we sample it on a regular grid in 3D,
project these samples to points in each image in which the patch might be visible, and compute
the normalized cross-correlation of blocks of intensities around the projected locations. (b) If a
cell of the coarse image grid (dark gray) has no corresponding patch, we hypothesize a the center
of a new patch p by intersecting the viewing ray through the cell with the plane corresponding
to a nearby patch p from an adjacent cell (light gray).
(a) (b)
Figure 8.27. (a) Six of sixteen input images for a multi-view stereo algorithm. (b) Two views of
the 3D result of PMVS (after creating a triangle mesh from the estimated points).
The next step is to expand the patches generated for high-quality feature
matches into regions where no good features were found. This is accomplished by
finding a cell of the coarse grid in some image that has no corresponding patch but
has a neighbor cell with a well-estimated patch p. We simply create a new patch p for
the patchless cell, estimating c(p ) as the intersection of the viewing ray with the plane
containing the neighbor’s 3D patch, and initializing n(p ) = n(p) and V (p ) = V (p).
The process of refining the patch parameters then continues as shown previously. If
the fit is poor (e.g., p is not visible in enough images, or straddles a depth discontinuity
in the scene) the new patch is rejected.
8.3. M u l t i - V i e w S t e r e o 327
The result of PMVS is a dense collection of small 3D patches with associated nor-
mals. This collection can optionally be turned into a triangulated mesh, for example
using the Poisson surface reconstruction method discussed in Section 21. The result-
ing 3D reconstructions can obtain sub-millimeter accuracy on tabletop-sized objects,
and centimeter accuracy on large-scale scans. Figure 8.27 illustrates an example result
on one of Seitz et al.’s benchmarking datasets.
One drawback of patch-based methods is that they may contain holes, especially
in places where texture information in the images is unreliable. This may require a
3D-inpainting-like method to fill in holes and obtain a complete model. The algo-
rithms can also be quite slow (e.g., hours of running time). Nonetheless, patch-based
methods are quite appealing due to their generality; a patch-based multi-view stereo
approach was used by Goesele et al. [171] to generate high-quality models of land-
marks and large-scale building interiors/exteriors solely using community photo
collections (e.g., by keyword searching for “Trevi Fountain” on Flickr).
d(p)
Reference image
Figure 8.28. A depth d(p) is evaluated for a pixel p in the reference view by considering the
normalized cross-correlation of windows around the projected image locations.
328 Chapter 8. Three-Dimensional Data Acquisition
3D data acquired using LiDAR or structured light from a single point of view suffers
from the shadowing problem illustrated in Figure 8.3. That is, we only get a depth
estimate at a given pixel for the corresponding scene surface closest to the camera.
Therefore, we commonly move the scanner around the scene to acquire scans from
viewpoints that fill in the gaps and make the 3D model more complete.
In this section, we address two key problems associated with this process. The first
is how to align multiple 3D datasets into the same coordinate system. We take a similar
approach to the problem of 2D image alignment: features in each scan are detected,
matched, and used as the basis for estimating a parametric transformation between
each scan pair. However, in 3D we need different methods for feature detection and
registration, as we discuss in Sections 8.4.1 and 8.4.2.
Once we have a method for aligning scans, the second problem is how to cre-
ate a usable triangular mesh from the resulting collection of points. Algorithms for
these problems of multiscan fusion and meshing are overviewed in Section 8.4.3.
Throughout this section, we motivate the algorithms using data acquired from LiDAR
scanners, but the same methods apply to point clouds created from structured light
or multi-view stereo.
(a) (b)
Figure 8.29. (a) 3D data is fundamentally represented as a point cloud. (b) The point cloud
inherits a mesh from the order of scanning.
a close-up of the LiDAR data from Figure 8.3, illustrates the problem. Instead of having
a uniform grid of pixels with associated intensities, we have a nonuniform collection
of data points that all look the same.20
However, the 3D point cloud is not totally unstructured; the way in which the data
is acquired usually imposes a mesh. For example, a LiDAR scan inherits a natural
triangulation based on connecting the measurements from adjacent (θ , φ) bins, as
illustrated in Figure 8.29b. Usually we apply a heuristic to ensure that the triangles
don’t span depth discontinuities; for example, we can remove mesh edges that are
longer than some multiple of the median edge length. Such a triangulation also allows
us to compute an estimate of the normal n(p) at each point p in the point cloud. The
easiest ways to compute the normal are to take the average normal of all the mesh
triangles that meet at the vertex, or to use the normal to a plane fit to the points in p’s
local neighborhood.
The two most common methods for feature description in this type of point cloud
data are spin images and shape contexts. Both methods are based on computing
histograms of points lying within 3D bins in the neighborhood of a selected point,
but differ in the structure of the bins.
Spin images, proposed by Johnson and Hebert [224], consider a cylindrical vol-
ume centered around the selected point, with the cylinder’s axis aligned with the
point’s estimated normal, as illustrated in Figure 8.30a. The cylinder is partitioned
into uniformly spaced bins along the radial and normal directions, with a bin size
roughly equal to the distance between scan points. The number of bins is generally
chosen so that each model point falls in some bin. We then create an “image” h(i, j)
as the number of points falling in the (i, j)th bin, where i corresponds to the radial
direction and j to the normal direction. Only entries that have similar normals to the
center point contribute to each histogram bin, to avoid contributions from points on
the other side of the model. Examples of spin images at various points on an example
mesh are illustrated in Figure 8.30b-c.
If we observe the same 3D object in a different orientation, the spin images at cor-
responding points will agree, making them an attractive basis for feature description.
20 If an RGB camera image is also available, feature detection and matching is more reliable, as we
discuss shortly. While the intensity-of-return image from the LiDAR scanner could theoretically
be used for feature detection, this is rare in practice.
8.4. R e g i s t e r i n g 3 D D a t a s e t s 331
radial
radial
n(p)
normal normal
p
radial
radial
normal normal
Figure 8.30. (a) The cylindrical bins used to create a spin image. (b) Several points on a
mesh. (c) The corresponding spin images. A darker value indicates that more points are in the
corresponding bin.
Note that the cylinder can “spin” around the normal vector while still generating the
same descriptor (hence the name), avoiding the need to estimate a coordinate orien-
tation on the tangent plane at the point. The similarity between two spin images can
be simply measured using either their normalized cross-correlation or their Euclidean
distance. Johnson and Hebert also recommended using principal component analy-
sis to reduce the dimensionality of spin images prior to comparison. Another option
is to use a multiresolution approach to construct a hierarchy of spin images at each
point with different bin sizes [121].
Shape contexts were originally proposed by Belongie et al. [38] for 2D shapes
and extended to 3D point clouds by Frome et al. [155]. As illustrated in Figure 8.31,
a 3D shape context also creates a histogram using bins centered around the selected
point, but the bins are sections of a sphere. The partitions are uniformly spaced in the
azimuth angle and normal direction, and logarithmically spaced in the radial direc-
tion. Since the bins now have different volumes, larger bins and those with more
points are weighted less. As with spin images, the “up” direction of the sphere is
defined by the estimated normal at the selected point. Due to the difficulty in estab-
lishing a reliable orientation on the tangent plane, one 3D shape context is compared
to another by fixing one descriptor and evaluating its minimal Euclidean distance
over the descriptors generated by several possible rotations around the normal of the
other point.
However, neither approach specifies a method for reliably, repeatably choosing
the 3D points around which the descriptors are based. In practice, a set of feature
points from one scan is chosen randomly and compared to all the descriptors from
the points in the other scan. While this approach works moderately well for small,
complete, uncluttered 3D models of single objects, it can lead to slow or poor-quality
matching for large, complex scenes.
As we mentioned earlier, LiDAR scanners are often augmented with RGB cameras
that can associate each 3D point with a color. Actually, the associated image is usually
332 Chapter 8. Three-Dimensional Data Acquisition
n(p)
Registered
LiDAR RGB image
scanner
Figure 8.32. Frequently, a LiDAR scanner is augmented with an RGB camera image calibrated
to be in the same coordinate system.
higher resolution than the laser scan, so it’s more accurate to say that we know where
the image plane is in the scanner’s coordinate system, as illustrated in Figure 8.32.
This additional color and texture information allows us to leverage the techniques
described in Chapter 4 to create feature detectors and descriptors better suited to
large, complex scenes.
One effective approach, as proposed by Smith et al. [460], is to detect DoG fea-
tures in the co-registered RGB images as described in Section 4.1.4. Next, each
detected feature location can be backprojected from the scanner’s perspective into
the scene, until the ray penetrates the scan mesh, as illustrated in Figure 8.33.
A square planar patch is constructed in 3D whose normal agrees with the nor-
mal at the backprojected point and whose orientation is defined with respect to
the dominant gradient of the image feature (Section 4.2.1). A 4 × 4 grid superim-
posed on this 3D patch is reprojected onto the image plane, and these non-square
bins are used to construct a SIFT descriptor (Section 4.2.3). The advantage of these
back-projected SIFT features is that they exploit both image and range information
8.4. R e g i s t e r i n g 3 D D a t a s e t s 333
backprojected
SIFT grid
Figure 8.33. Back-projected SIFT features for 3D data, as proposed by Smith et al. [460].
in the feature detection and descriptor construction. For example, features that seem
appealing based on the image evidence alone can be ruled out if they straddle a depth
discontinuity in 3D.
Smith et al. later observed that the full invariance of SIFT detection and descrip-
tion isn’t necessary for 3D data, since distance measurements produced by LiDAR
already have a physically meaningful scale. That is, even if we view the same object
from different perspectives in two scans, there is no ambiguity about the object’s
scale (unlike two images at different perspectives). This insight led to the develop-
ment of physical scale keypoints [459], which are computed at a predefined set of
physical scales in 3D. Unlike back-projected SIFT features, the keypoint detection and
description takes place directly on the 3D mesh, aided by the backprojected texture
from the co-registered images. An analogue of the LoG detection operator is applied
to the textured mesh, downweighting points whose normals disagree with the normal
of the point under consideration. A SIFT-like descriptor is computed on the tangent
plane to the detected feature, and descriptors are considered for matching only at
the same physical scale. The overall process eliminates many false matches and has
the additional benefit of allowing the correct detection and matching of features near
physical discontinuities.
T(p)
P
(a)
q(p)
Q
(b)
Q
P
Figure 8.34. The basic Iterative Closest Points (ICP) algorithm alternates between two steps to
align two 3D point sets P and Q. (a) For a fixed candidate transformation T , the closest point
q(p) ∈ Q to each point in T (p) ∈ P is determined. (b) A new rigid motion T is computed that
minimizes the sum of distances between the estimated correspondences.
In this section, we review the Iterative Closest Points (ICP) algorithm, the most
commonly used method for 3D scan registration. This fundamental approach to reg-
istration was discovered and described roughly simultaneously by several research
groups, including Besl and McKay [41], Chen and Medioni [89], and Zhang [573].
The basic idea is simple: given two unordered sets of 3D points P and Q to be regis-
tered and an initial rigid motion T , we alternate between two steps, as illustrated in
Figure 8.34:
where e(p, q) is a suitable error function between pairs of 3D points. That is, we alter-
nate between fixing the transformation and estimating the correspondence, and vice
versa. The algorithm stops when E(T ) falls below a user-specified threshold. Besl and
McKay proved that when e(p, q) in Equation (8.15) is the squared Euclidean distance
p − q22 , then the ICP algorithm converges monotonically to a local minimum of the
cost function E(T ).
We must address two additional issues. First, how can we obtain a good initial-
ization T before starting the ICP iterations? This is where the feature detection and
matching results from the previous section come in. Any three feature matches in
3D (e.g., obtained using spin images or shape contexts) define a rigid transformation
T . Furthermore, if co-registered RGB images are available, matching a single pair of
back-projected SIFT or physical scale keypoints between two different scans gives
8.4. R e g i s t e r i n g 3 D D a t a s e t s 335
an immediate estimate of the 3D rigid motion relating them, since each descriptor is
associated with a full coordinate frame.
The second issue is how to minimize the sum-of-distances function in
Equation (8.15). When e(p, q) in Equation (8.15) is the squared Euclidean distance,
we can apply a classic result by Umeyama [504]. Suppose the two ordered point sets
are denoted by {(pi , qi ), i = 1, . . . , N }. We define µp and µq to be the mean values of
the sets {pi } and {qi } respectively, and compute the 3 × 3 covariance matrix
1
N
= (qi − µq )(pi − µp ) (8.16)
N
i=1
Let the singular value decomposition of be given by UDV , where the entries of D
decrease along the diagonal. Then the rigid motion (R, t) that minimizes
1
N
qi − (Rpi + t)22 (8.17)
N
i=1
R = USV (8.18)
t = µq − Rµp (8.19)
• Instead of using all the points from P in the closest-point and distance-
minimization steps, only use a subset of the points (e.g., chosen randomly
in space, at the locations of detected features, or so that the normal vectors’
angles are widely distributed).
• Choose the “closest” point q(p) not simply as the point that minimizes the
Euclidean distance, but as the closest point whose normal vector is within a
specified angle of the normal at T (p).
• Don’t allow points near the boundary of scans to participate in matching, to
prevent many points in one scan being matched to the same point in the other.
This is especially important when the scans represent partial views of a larger
scene.
• Instead of using the Euclidean distance in the distance-minimization step,
use the point-to-plane distance illustrated in Figure 8.35. The error function
in Equation (8.15) is
P
Figure 8.35. The point-to-plane distance for ICP is the square of the length of the dotted line
for each pair of matching points. The gray line indicates a plane through each destination point
using its estimated normal.
where ηq(p) is the estimated unit normal vector at q(p). While this step has
been shown to provide much faster convergence in practice, it no longer per-
mits a simple closed-form solution to minimizing Equation (8.15), and the
convergence proof of Besl and McKay doesn’t hold.
• Instead of treating every pair of points equally in Equation (8.15), weight the
pairs differently. For example, Smith et al. [460] recommended weighting fea-
ture points in proportion to their quality of match and using a robust cost
function related to those discussed in Section 5.3.3.3 to downweight points
with large alignment errors.
use of ICP for registering LiDAR data with a point cloud generated using multi-view
stereo on a video sequence.
Finally, we note that variations of ICP can handle the problem of registering a tex-
tured LiDAR scan to a camera image taken at a substantially different viewpoint. For
example, Yang et al. [561] proposed an algorithm that begins by applying 2D ICP to
the camera image and the scanner’s co-located RGB image, and then upgrades the
problem to a 2D-3D registration when the correspondences are no longer well mod-
eled by a projective transformation. We can think of this as a resectioning problem
(Section 6.3.1) in which the 2D-3D correspondences are iteratively discovered.
Figure 8.36. (a) Many redundant points and triangles exist where two registered 3D scans over-
lap. (b) Overlapping triangles from the edges of the black mesh are removed. (c) New points are
introduced at intersections between the black and gray meshes. Shaded parts of the black mesh
will be removed. (d) A new triangulation is formed.
338 Chapter 8. Three-Dimensional Data Acquisition
Scan 1
Scan 2
(a) (b) (c)
Figure 8.37. (a) Signed distance functions to the triangle meshes (in red) for two range scans
taken from different perspectives. The figure represents a 2D slice through 3D space. Gray indi-
cates a distance near 0; black indicates a large negative distance and white a large positive
distance. The tan background indicates that the distance function isn’t computed because it’s
too far from the surface. (b) Corresponding weight functions. Black indicates zero weight while
white indicates a large weight. (c) The weighted sum of signed distance functions after fusion;
red indicates the VRIP isosurface.
have di (X) < 0, and points behind the mesh have di (X) > 0. The weight function is
roughly constant near the triangular mesh along scanner lines of sight, and falls off
quickly on either side of the mesh. It can also increase with the angle between the
scanner line of sight and the surface normal, or with our confidence in the mea-
surement. The basic idea is that the weight expresses the neighborhood in which a
scanner data point plays a role in the fusion. Examples of these functions are illus-
trated in Figure 8.37a-b for one plane of voxels that intersects the range data. Since
the weights are only nonzero very close to the original 3D samples, we only need to
store the weights and distance functions at a relatively small fraction of voxels in the
volumetric grid; Curless and Levoy used a run-length-encoded volume to efficiently
store the necessary information.
We simply compute f (X) as a weighted sum of the component signed distance
functions:
M
wi (X)di (X)
f (X) = i=1
M (8.21)
i=1 wi (X)
Scanner
Xj-
Xj
Xj+
Figure 8.38. 3D sample points (black dots) and normal constraints (gray dots) for constructing
an interpolating implicit function.
N
f (X) = wj φ(rj ) + a X + b (8.22)
j=1
where a ∈ R3 , b ∈ R, and rj = X − Xj 2 . In 3D applications, we use the function
φ(r) = r or φ(r) = r 3 , both of which produce a smooth interpolation of the data. The
weights on the basis functions and the affine coefficients can be computed by solving
a linear system:
0 φ(r12 ) ··· φ(r1N ) X1 1 w1 f (X1 )
φ(r21 ) 0 ··· φ(r2N ) X2 1 w2 f (X2 )
.. .. .. .. .. .. .. ..
.
. . . . . . = . (8.23)
φ(rN 1 ) φ(rN 2 ) ··· 0 XN 1 wN f (XN )
X1 X2 ··· XN 0 0 a 0
1 1 ··· 1 0 0 b 0
21 In practice, we may not need to provide two normal constraints for every range point, especially if
the normal estimate is not reliable at the point.
340 Chapter 8. Three-Dimensional Data Acquisition
Xj
f(X) = 0
f(X) = 1
Figure 8.39. The setup for Poisson surface reconstruction. 3D sample points (black dots) are
viewed as locations where the gradient is large and points inward.
where rij = Xi − Xj 2 . Now we can compute f (X) at any 3D location we like, and
apply the same marching-cubes technique to obtain the isosurface. This approach
3D to surface interpolation was proposed by Turk and O’Brien [502], though it can be
traced back to the thin-plate spline techniques of Bookstein [53] and earlier.
However, for scans with more than a few thousand data points, forming and solv-
ing the linear system in Equation (8.23) quickly becomes computationally intractable.
Carr el al. [82] showed how such techniques could be made feasible using fast mul-
tipole methods, which use near-field and far-field approximations to compute the
radial basis functions efficiently. Such approaches also allow the specification of a
desired fitting accuracy, which is useful for merging multiple LiDAR scans that may
not overlap exactly after registration. They showed how merged LiDAR datasets con-
taining hundreds of thousands of points could be well approximated with a triangular
mesh in a matter of minutes.
On the other hand, radial basis function approaches may smooth over sharp fea-
tures in the data that we want to preserve, introduce pieces of surface far from the
original data points, and perform badly in the presence of outliers or poorly sampled
data. More recent approaches to surface reconstruction (e.g.,[353, 245]) address these
problems.
Finally, we mention one of the most effective 3D data fusion techniques, Poisson
surface reconstruction, proposed by Kazhdan et al. [232]. Like the previous tech-
niques, we compute a function f (X) defined on R3 ; however, this function has a very
different interpretation, as sketched in Figure 8.39. We define f (X) = 0 for points
outside the surface to be reconstructed, and f (X) = 1 for points inside the surface to
be reconstructed. Therefore, the gradient of the function is identically zero, except
for points X on the surface, at which the gradient is very large (theoretically infinite).
The observed range data points {Xj , j = 1, . . . , n} are viewed as samples where the gra-
dient is known; that is, its norm is large and it points inward along the estimated
normal.
This problem, in which we have samples of the gradient of a function at
several points and want to reconstruct the function everywhere, naturally lends
itself to Poisson reconstruction techniques, as we described for image editing in
Section 3.2. The approach proposed by Kazhdan et al. has several advantages: it’s
relatively robust to sparse or noisy gradient samples, it generates surfaces that stick
closely to the original data without requiring normal constraints, and it allows the
use of a multiresolution (octree) data structure to represent the result instead of
8.5. I n d u s t r y P e r s p e c t i v e s 341
Gentle Giant Studios, in Burbank, California, provides large object, vehicle, and set
scanning for nearly every blockbuster Hollywood movie, and has scanned the faces
and bodies of thousands of actors and performers. Steve Chapman, Gentle Giant’s
vice president of technology, discusses the role of LiDAR, structured light, and multi-
view stereo in visual effects.
RJR: How has the use of LiDAR for large-scale scanning of movie sets changed over the
years?
RJR: What kinds of things will a movie production do with LiDAR data once you deliver
it to them?
Chapman: Early on, few movie crews knew what to do with the data. Now, everyone
from set designers, pre-visualization departments, set extension painters, camera
trackers, particle effects artists, and character placement animators are clamoring for
22 http://www.cs.jhu.edu/∼ misha/Code/PoissonRecon/
342 Chapter 8. Three-Dimensional Data Acquisition
(a)
(b)
(c)
(d)
Figure 8.40. (a) An entire small town outdoor set created for Thor containing many buildings
was LiDAR-scanned from multiple locations. (b) A structured-light scan of a prop helmet for Thor.
(c) A phase-based LiDAR scan of an armored military vehicle for Fast Five. (d) A laser-stripe scan
of actor Darren Kendrick in the Sentry Frost Giant makeup and costume for Thor. Thor images
appear courtesy of Marvel Studios, TM & ©2011 Marvel and Subs. www.marvel.com. Fast Five
courtesy of Universal Studios Licensing LLC.
the data. For example, in the movie 2012, Sony constructed a five-story door of a huge
ship, but the entire remaining vessel had to be 3D modeled and composited to match
film of the actors standing on this door set. The visual effects designers at ImageWorks
knew that the LiDAR data we took of this model would precisely match the camera
8.5. I n d u s t r y P e r s p e c t i v e s 343
footage. They used the LiDAR geometry for several purposes: to help with camera
tracking, to extend the set to cover the whole frame, and to add animated water and
atmospheric elements that realistically interacted with the real-world geometry of
the film set.
For the movie Zookeeper, we scanned an entire real zoo in Boston so that
ImageWorks animators could plant the feet of photorealistic digital animals that
interacted with the filmed surroundings. A giraffe might go under and around a
tree because the animators have the model of the real tree in 3D and the data of
the ground that it needs to step on — it interacts with the actors and environment
seamlessly.
Visual effects companies might also use LiDAR data to make a complete textured
digital duplicate of the same set. For example, in The Matrix Revolutions, there’s
an underground “rave” scene featuring football-field-sized gate mechanisms. They
only built and LiDAR scanned one gate, and used the resulting data as a template
to model and texture all the other gates around the digital set. This required the
processing of the raw LiDAR point cloud into a clean polygonal model that can be
easily manipulated in 3D animation software such as Maya.
RJR: How about laser stripe systems for scanning bodies and props?
Chapman: It’s interesting that the concept behind laser stripe scanning was used
in Renaissance times. An artist would make a maquette of a sculpture that was to
be carved from a huge piece of marble. Then they would lower that maquette into
milk or ink in order to study the waterline and analyze how the contour appeared,
before cutting the unforgiving stone. What’s happening now is that the waterline is
a laser stripe, and digital cameras and computers take the place of the artist’s eyes
and memory. Laser stripe scanners offer instant feedback so that a visual effects
supervisor leaves the set confident that the 3D model is complete and accurate,
and that the animation supervisor will have the expression and muscle movement
reference he or she will need to recreate an actor’s performance.
Laser-stripe technology is the oldest and simplest approach, and most of the
equipment we’ve used for decades has been based on it. The body scanner we use
(Figure 8.9a) uses four calibrated light stripes and cameras that move down the sides
of a person in unison; it takes about twenty seconds to do a complete pass. This
usually gets about ninety percent of the geometry, but there’s always going to be
something occluding the cameras that we must “get hands on” in a software pack-
age like Zbrush and touch up. With any scanning process, there seems to be a point
where you can easily capture a majority of the object or environment in a reason-
able amount of time. Then it becomes a tremendous effort to chase down the rest of
the shape that was hidden from the scanner behind occlusions. It takes some prac-
tice to know when to make the effort to move the scanner to get more viewpoints,
and when the time might be better spent moving on to the next set piece on the long
checklist of things to scan. The scanning service vendor needs to have both the equip-
ment to capture data as well as the talent to process whatever is finally obtained into
something suitable for use by the next person in the production pipeline. This might
mean anything from simple hole filling to recreating a fabric texture and underly-
ing surface curvature to make a seamless model. Since Gentle Giant started out as
344 Chapter 8. Three-Dimensional Data Acquisition
a traditional clay sculpting studio, we have artists who have made the move to 3D
and can easily handle any digital cosmetic surgery needed when scanning organic
subjects.
A decade ago, it wasn’t commonplace to scan actors, but the production compa-
nies that make the movies have now realized how important this is to their vendors.
They might have a scene where the film production company has not yet hired a
visual effects vendor, even though filming is already occurring. The visual effects
supervisors know that the vendor is likely to need 3D reference several months later,
so they’ll ask us to go ahead and scan anything and everything on set. Maybe they’ll
need it and maybe they won’t, but it’s neither feasible to rebuild a set nor get an actor
who’s already filming another movie to come back and get in makeup and costume
for a texture shoot or 3D scan.
Once acquired, the scan data is often used by toy companies for action figure
design and by video game companies for movie tie-in games. Actors often voice
concern about the data being used to later animate them and make them “sell vacuum
cleaners” after they die, but they probably aren’t aware of the inevitability of purely
image-based 3D reconstruction algorithms that will make the whole issue moot. It’s
conceivable that at some point in the near future, 3D information can be extracted
from even the earliest motion picture footage.
Chapman: Structured light systems can also work well for head and body scanning,
but they’re riskier to use in practice compared to laser-stripe systems due to the
complexity of the underlying image processing. Often, an actor is only available for
a few short minutes and may even called back to the set in the middle of a 3D scan.
As one A-list actor put it after a few seconds of delay in starting the body scanning
on the set of a blockbuster, “Let’s go, gentlemen!” As a result, many visual effects
supervisors will lean toward using laser-stripe systems for actor scanning simply
because they know they’re much less prone to failure compared to structured light
devices.
Trying to project structured light onto larger objects is extremely difficult since you
can’t get a projector bright enough and you can’t get the camera far enough away to
make the algorithms work. There are also limitations you run into with the vibration
of the mounting systems and the resolution of the cameras. We have a structured
light system that could theoretically scan a jumbo jet, but realistically that’s a job
much more suited to LiDAR.
We still use structured light for scanning our smallest subjects because it excels
at capturing fine details that laser “blooming” obliterates. For example, we used
structured light to capture a small pirate coin at sufficient detail for it to be projected
three stories high in the opening credits of a film. Real-time performance capture
scanning once required structured light solutions, but stereo matching has evolved
to replace that need, with the added benefit of not blinding the actor with a projector
pattern.
One common issue with structured light is that if you’re scanning a human subject
and they are moving, as inevitably happens, you see a sort of a “cheese grater” effect,
where some parts of the surface will be further out or in than other parts, and you see
8.5. I n d u s t r y P e r s p e c t i v e s 345
bands going across the data. Systems that tie the projector timing into the camera’s
frame acquisition timing help alleviate, but not entirely eliminate, this artifact.
RJR: Have multi-view stereo techniques made an impact in 3D data acquisition for
visual effects yet?
Chapman: We’ve been exploring the photogrammetric solution for quite some time
and think that it’s trending toward the point that it’s likely going to replace most of the
other processes very soon. However, in movies there are a lot of black sets, costumes,
and shiny things for aesthetic reasons — Batman’s outfit or TRON ’s sets, for example.
If you try to use multi-view stereo to capture that you might end up with just the
edges of objects. We’ve attempted to aid the algorithms by projecting a pattern onto
the object so it’s sort of a mix of structured light and multi-view stereo. We often have
to quickly improvise with materials on set, like powdering a sarcophagus in order to
read the reflective gold, taping lines onto a shiny helicopter, or even kicking up some
dirt onto a black waxed pickup truck.
If we need to do something even grander than we could handle with time-of-flight
LiDAR, we’d likely use a photo modeling technique, but today the results are often
simply not yet good enough to deliver as-is to a visual effects company. We have to
do a lot of work to make it presentable. I used one such tool — PhotoSynth — on
a project where we needed to model the Statue of Liberty. We didn’t have the time
or money to do the job in person, and even if we did, getting permission to scan it
would have been very difficult for security reasons. We used PhotoSynth to get the
essential proportions of the statue, and discovered roughly where and how big things
needed to be, but we still needed a sculptor to go in and recreate the accurate likeness
underneath.
On a movie set, we definitely take as much video footage and supplemental pho-
tography as we can and catalog it for reference, since we never know when we’ll scan
something and find that somebody moved it the next day or even destroyed it. We
have terabytes of photos that were once intended solely for reference but that now
might be reprocessed through multi-view stereo software to derive new information.
Chapman: When we started doing LiDAR in the 1990s, we needed to place registration
spheres all over the set, similar to little magnetic pool balls. We would find the centers
of the spheres in the data and use them to do a three-point alignment, and then do a
best-fit registration automatically from that. It took a lot of time to climb around the
set and place these targets, which resulted in being able to take fewer scans.
Since then, commercial software has evolved so that we can quickly pick three
points in one scan, pick roughly the same points in another scan, and the software
will automatically register them. Currently we use custom software to greatly reduce
the data for registration purposes. We usually scan one pass at the farthest possible
distance from the scene to act as a key alignment pass, to which all of the other scan
passes will be aligned. We often devote a single LiDAR scanner solely to perform this
“master” scan while we use other scanners to do the remaining multiple viewpoints.
346 Chapter 8. Three-Dimensional Data Acquisition
After registering all the scans, the data is divided into sub-objects such as lamp-
posts, trees, and cars, which are polygonized and reduced in point density to fit the
standard requirements of visual effects data processing pipelines. Some skill and
practice is needed to know what data will be needed to accurately recreate each
object while discarding redundant information.
In the days before laser range finding, the main way to acquire highly accurate 3D data
was by means of a coordinate measuring machine (CMM), a bulky system in which
a user pressed a pen-like probe against the surface of an object to record (X , Y , Z )
locations.
Outside of visual effects, one of the most striking applications of 3D data acqui-
sition was Levoy et al.’s Digital Michelangelo Project [275]. Piecewise scans of
ten Michelangelo statues were painstakingly acquired using a custom laser stripe
scanner and registered into highly detailed and textured models for use in art
history and preservation. This project highlighted the many practical challenges
of scanning priceless objects on a tight timeline. LiDAR technology is frequently
used for cultural heritage applications in architecture (e.g., [13]) and archaeology
(e.g., http://cyark.org/). In construction applications, LiDAR is important for qual-
ity assurance that an as-built building conforms to an original blueprint [207].
Finally, many of the autonomous vehicles in the recent DARPA Grand Chal-
lenges (http://archive.darpa.mil/grandchallenge/) used LiDAR for real-time terrain
mapping.
Since the data in laser-stripe scanning is usually acquired as a temporal sequence
of stripes, keeping track of this sequence can help with filling in missing data or
fixing poor returns. That is, a human in the loop can fix or recreate a bad stripe by
interpolating the 3D contours acquired just before and after it. This approach was
taken for human body scanning in the Star Wars prequels and the early Harry Potter
movies.
An early classic paper by Bouguet and Perona [56] described a clever structured-
light-inspired system in which images of the shadow of a pencil moving across objects
on a desk acted as the “stripe” for producing 3D measurements. They obtained sur-
prisingly good, sub-millimeter-accuracy results for small objects with this simple
technique. Fisher et al. [143] presented a similar idea using a special striped wand
that also had to be visible in the camera image. Boyer and Kak [57] were among the
first researchers to propose a one-shot, color-stripe-based structured light technique
using an empirically derived pattern of red, green, blue, and white stripes. They used
a region-growing approach to expand the list of identified stripes from a set of reliable
seeds.
Salvi et al. [419, 418] gave excellent overviews of the state of the art in structured
light pattern design. They discussed several techniques not mentioned here, in par-
ticular the class of methods based on extending the idea of locally unique color
subsequences to two-dimensional patterns. For example, the scene can be projected
with a pattern of colored dots, such that each 3 × 3 neighborhood of dots does not
repeat anywhere in the pattern. Such patterns are called pseudorandom or M-arrays,
and a good example of their application was described by Morano et al. [334]. One
8.6. N o t e s a n d E x t e n s i o n s 347
map can be estimated for each image by stereo matching based on reprojecting via
the model, similar to Figure 8.25. This approach can be viewed as an early multi-view
stereo algorithm in which the 3D points are constrained to lie on geometric primitives
interactively created by the user.
Two notable early multi-view stereo algorithms were proposed by Okutomi and
Kanade [354] and Collins [100]. Another approach to multi-view stereo not discussed
here is photometric stereo, in which the 3D shape of a shiny object (e.g., a ceramic
statue) is estimated by acquiring multiple images of it under different illumination
conditions (e.g., [196, 521]). The changing intensity patterns provide clues about
the normal vector at each surface point. Nehab et al. [347] observed that normals
estimated from triangulation-based scanners could be improved by combining the
data with the output of photometric stereo techniques.
Two exciting avenues of research have recently been enabled by the confluence
of commercial 3D scanning technology, ample processing power and storage, and
massive internet photography databases. In one direction, the thousands of images
resulting from a keyword search on Flickr or Google Images can be viewed as the
input to a large multi-view stereo problem. Snavely et al. [464] described how to cali-
brate the cameras underlying such a collection based on correspondence estimation
and structure from motion, and how to then apply multi-view stereo techniques to
obtain a dense 3D reconstruction of the scene. In contrast to conventional multi-view
stereo techniques, this type of approach simply discards entire images that are of low
quality or for which the camera calibration is uncertain; indeed, a key component of
these large-scale algorithms is the careful choice of image sets that are likely to be
productive.
Another exciting direction is city-scale scanning, using a vehicle equipped with
some combination of cameras, laser rangefinders, and GPS/inertial navigation units
to help with its localization. Pollefeys et al. [367] described an impressive system that
generates textured 3D mesh reconstructions in real time using a vehicle mounted with
eight cameras. They used an incremental depth map fusion algorithm to process tens
of thousands of video frames into a single, consistent, and detailed 3D model. Alter-
nately, Früh and Zakhor [158] designed a vehicle equipped with a camera and two
laser rangefinders. One rangefinder acquired vertical 3D strips of building facades,
which were registered using images from the camera and horizontal 3D data from the
second rangefinder. The 3D datasets estimated from the vehicle were then refined
(e.g., to remove drift) based on registration to an aerial map of the scanned area.
This system registered thousands of images and 3D strips to produce an accurate
textured model of streets around the Berkeley campus. Subsequent work addressed
the problem of inpainting façade geometry and texture in LiDAR “shadows” caused
by foreground occlusions [156]. For large holes, a patch-based inpainting approach
inspired by the techniques in Section 3.4.2 might be more appropriate (e.g., [124]).
Finally, we mention that once 3D data has been acquired by any of the means
discussed in this chapter, several image understanding techniques can be applied
to it.23 For example, Verma et al. [513] discussed how to detect and model buildings
23 “Image understanding” is used in a broad sense here; automatic analysis and understanding of 3D
data typically falls under the umbrella of computer vision, even if there weren’t any conventional
images actually involved in the data collection.
8.7. H o m e w o r k P r o b l e m s 349
from aerial LiDAR data, while Dick et al. [120] addressed how to fit architectural
primitives (walls, roofs, columns) to image sets taken at ground level. Golovinskiy
et al. [172] trained a system to recognize objects like lampposts, traffic lights, and
fire hydrants in large-scale LiDAR datasets, while Vasile and Marino [511] addressed
the detection of military ground vehicles from foliage-penetrating aerial LiDAR. Kim
et al. [238] discussed how salient regions could be automatically detected in co-
registered camera and laser scans of an outdoor scene. Huber et al. [208] proposed
an algorithm for part-based 3D object classification (e.g, types of vehicles) based on
spin images. More generally, Chen and Stamos [87] discussed how to segment range
images of urban scenes into planar and smooth pieces, which can be of later use in
registration and object detection.
8.1 A LiDAR scanner reports that a point at azimuth 60◦ and elevation 20◦ is
located 100m away. Convert this measurement into an (X , Y , Z ) Cartesian
coordinate system in which the scanner is located at (0,0,0).
8.2 Compute the time resolution required, in picoseconds, for a pulse-based
LiDAR’s receiver electronics if we want the system to have ±2mm accuracy.
8.3 Compute the distance to a scene point for which 500 nanoseconds was
recorded for the time-of-flight of a LiDAR pulse.
8.4 Prove the relationship between the phase shift ψ and the time of flight t in
Equation (8.2).
8.5 Show that the restriction on the range ambiguity in a phase-based LiDAR
scanner, 0 < ψ < 2π , imposes a maximum range of πc/ω, where c is the
speed of light and ω is the frequency of the modulating sinusoid in radians.
8.6 A phase-based LiDAR with carrier frequency 1.3 × 107 radians/sec is used to
scan a scene. Compute the distance to a scene point for which a π/3 radian
phase shift was recorded.
8.7 Some flash LiDAR systems estimate the phase of an amplitude-modulated
signal using four samples. Suppose the transmitted signal is f (t) = cos(ωt)
and the received signal is g (t) = A cos(ωt + ψ) + B, where A is the attenuated
amplitude of the signal and B is a constant offset. Show that the phase shift
ψ can be recovered as
g3 − g1
ψ = arctan (8.24)
g0 − g2
iπ
where gi = g 2ω .
8.8 We can interpret the triangulation process for a stripe-based structured
light sensor as the intersection of a 3D line (corresponding to the ray from
the camera center through the observed image coordinates (x, y) on the
stripe) with a 3D plane aX + bY + cZ + d = 0 (corresponding to the light
plane from the laser). Determine a closed-form formula for this line-plane
intersection as the solution of a 3 × 3 linear system. (Hint: write the line as
350 Chapter 8. Three-Dimensional Data Acquisition
That is, the first 2k−1 codewords are the same as Gk−1 with a 0 prefix, and
the second 2k−1 codewords are the codewords of Gk−1 in reverse order with
a 1 prefix.
a) Construct G4 .
b) Prove that each of the 2k possible binary codewords appears exactly
once in Gk .
c) Prove that each pair of adjacent entries of Gk (i.e., Gk (i) and Gk (i + 1))
differs in exactly one bit (including the cyclic pair Gk (2k ) and Gk (1)).
8.10 How many patterns would be needed to resolve 600 unique vertical stripe
indices using patterns of red, green, blue, and white stripes?
8.11 In Figure 8.19 we assumed the on/off decision was based on the grayscale
intensity of the two images at a given pixel. Generalize the decision criterion
in the case where a color and its RGB complement are projected onto a
colorful scene surface.
8.12 Construct a stripe boundary code similar to Figure 8.20 of two patterns
containing thirteen stripes each, such that (a) each of the twelve pairs of
on-off transitions occurs exactly once, (b) no stripe is continuously on or
off for more than two time units, and (c) at least one stripe changes at every
time step.
8.13 Determine a de Bruijn sequence of order 3 over an alphabet of three
symbols.
8.14 Verify the three-image phase recovery Equation (8.8).
8.15 This problem is meant to suggest some of the issues that can occur when a
scene is changing as it is being scanned. Consider the scenario illustrated in
Figure 8.41. A fixed-stripe laser scanner is mounted on a vehicle, pointed at
a right angle to the direction of motion. Suppose the scanner vehicle moves
forward at a constant rate of ten meters per second, and the laser acquires a
vertical stripe of range data every 0.25 seconds. We assume that the scanner
knows where to correctly place the range samples it acquires in 3D space
(e.g., using an inertial measurement unit).
a) The scanner vehicle passes a 3m long pickup truck with the profile
sketched in Figure 8.41 traveling at nine meters per second. How many
stripes from the scanner will hit the truck, and where will the resulting
range samples be in 3D? What will be the apparent length and direction
of the truck?
b) What if the scanner vehicle is passed by a truck traveling at twelve
meters per second?
8.7. H o m e w o r k P r o b l e m s 351
c) What if both the scanner vehicle and the truck travel at ten meters per
second?
Note that these phenomena can be viewed as a type of spatial aliasing.
Scanner
vehicle 3m
10 mps
Figure 8.41. A laser stripe scanner mounted on a vehicle (white dot) moves forward at 10 mps
while the truck moves forward at a different speed.
8.16 Following Figure 8.25, sketch examples in which a patch that projects to a
square region in one image
a) projects to a skinny region in another image
b) projects to a wide region in another image
c) is partially occluded in another image by another piece of the scene.
8.17 Show that the normalized cross-correlation is unchanged if the values of
one vector in Equation (8.10) are subjected to an affine transformation
ûi = aui + b, i = 1, . . . , n.
8.18 Provide a sketch to show why the image pairs (i, j) and (j, i) produce different
contributions to the cost function in Equation (8.11).
8.19 Mathematically formalize these two descriptions of how to compute a unit
normal at a 3D point p of a triangular mesh:
a) Compute the normal of a plane fit to all the points within a radius r of
p using principal component analysis.
b) Compute a weighted average of the normals to all the triangles that
have p as a vertex, where the weight for each triangle is proportional to
its area.
8.20 If the radii of a spin image descriptor are given by the increasing sequence
{ri } and the levels of the normal bins are given by the increasing sequence
{zj }, then compute the volume of the (i, j)th bin.
8.21 Determine the 3D transformation defined by 3 point matches in 3D given
by
{(Xi , Yi , Zi ), (Xi , Yi , Zi ), i = 1, . . . , 3}. Why are 2 matches insufficient?
8.22 Reformulate the VRIP fusion Equation (8.21) so that it can be computed
incrementally by merging the component scans one at a time. Does the
incremental result depend on the order of merging?
Optimization Algorithms for
A Computer Vision
M
M −1
i i,i+1
C(L) = Edata (L(i)) + Esmoothness (L(i), L(i + 1)) (A.1)
i=1 i=1
i
where Edata (k) represents the cost of assigning label k to node i (the data term) and
i,i+1
Esmoothness (k, l) represents the cost of assigning labels k and l to adjacent nodes i
and i + 1 (the smoothness term). The data term usually involves the intensities of
image pixels (for example, in seam carving, it’s related to the gradient at the pixel
corresponding to the label). The smoothness term reflects constraints or assumptions
about how similar adjacent labels should be. For example, in seam carving, a vertical
seam must separate the image into left and right parts; thus, the label at node i is
constrained to be within the range {Li−1 − 1, Li−1 , Li−1 + 1}. Similarly, in the stereo
correspondence problem, we usually impose the monotonicity constraint that Li >
Li−1 , and may further weight the allowable disparities at each pixel, for example
by assigning higher costs to larger disparities. We can think of a labeling as a path
through a graph of M × N vertices, as illustrated in Figure A.1.
To find the minimum-cost labeling, we apply a recursive algorithm, building tables
of incrementally optimal costs and corresponding minimizers. That is, we fill in the
entries of two M × N matrices. Each entry of the first matrix, S(i, k), is defined as
the minimum cost of the path that begins in row 1 and ends at vertex (i, k) of the
353
354 Appendix A: Optimization Algorithms for Computer Vision
graph. Each entry of the second matrix, R(i, k), is defined as the index of the node in
row i − 1 of the graph that resulted in the cost S(i, k) (which we can think of as the
“predecessor” of that node).
Formally, we apply the following algorithm:
1. Initialize S(1, k) = Edata 1 (k) for k = 1, . . . , N . That is, the first row is initialized
That is, S(i, k) is the lowest cost that can be achieved considering the allowable
predecessors in the previous row of the matrix, and R(i, k) is the index of the
predecessor in row i − 1 that achieves this cost. The notion of allowability
depends on the application, as illustrated in Figure A.1.
4. The matrix has thus been filled in from the first row to the last. In the stereo
correspondence case (or more generally when we require the path to end at
the corner of the matrix), we fix LM = N . In the seam carving case (or more
generally when the path can end at any point in the last row of the image), we
fix LM = N ∗ , where
N ∗ = arg min S(M , k) (A.3)
k
5. We finally extract the minimal cost path by backtracking from row M . That is,
for i = M − 1 : −1 : 1, we compute
M M
nodes
nodes
2 2
1 1
1 2 labels N 1 2 labels N
(a) (b)
Figure A.1. Paths for dynamic programming problems. (a) A path for a seam carving problem
must connect the top and bottom edges; adjacent labels can differ by at most 1. (b) A path for
a stereo correspondence problem must have non-decreasing labels and connect the lower left
corner to the upper right corner. The smaller circled pixels in each case indicate the allowable
predecessors of the large circled pixel (each of which may have a different weight specified by
the smoothness term).
A.2. B e l i e f P r o p a g a t i o n 355
where V is a set of nodes and E is a set of undirected edges along which we want
to enforce smoothness. When these edges form a one-dimensional chain, we can
find the labeling {L(i), i ∈ V} that minimizes Equation (A.5) in polynomial time using
dynamic programming. However, when the edge set E contains cycles, dynamic
programming no longer applies, and there is no higher-dimensional analogy.
In particular, we frequently want to minimize Equation (A.5) when V is the set of
pixels in an image, and E is the set of all adjacent pixels (for example, 4-neighbors).
The resulting graph is planar and contains a large number of cycles. Unfortunately,
there is no efficient algorithm that provably finds the minimal labeling in this sit-
uation. However, the algorithm called loopy belief propagation has found great
practical success in the computer vision community for approximately minimizing
Equation (A.5) despite its lack of formal guarantees on convergence or exact opti-
mality. For example, we discussed loopy belief propagation’s application to matting
problems in Section 2.5 and to stereo correspondence in Section 5.5.
Minimizing a function like Equation (A.5) often arises from a maximum a posteri-
ori (MAP) estimation problem on a Markov Random Field, in which we want to find
the labeling that maximizes the probability density function1 given by
1" "
p(L) = φi (L(i)) ψij (L(i), L(j)) (A.6)
Z
i∈V (i,j)∈E
where φi (k) is called the evidence potential function, ψij (k, l) is called the compati-
bility potential function, and Z is a normalization constant so the probability density
function sums to 1. Comparing Equation (A.5) to Equation (A.6), we can see that the
data/smoothness terms and evidence/compatibility potential functions can easily be
related by
i
Edata (k) = − log φi (k)
(A.7)
i,j
Esmoothness (k, l) = − log ψij (k, l)
1 Technically, this is a probability mass function since the label set is discrete.
356 Appendix A: Optimization Algorithms for Computer Vision
bi
mji
Figure A.2. One iteration of belief prop-
agation. Node i collects incoming mes-
sages mji from its neighbors {j | (i, j) ∈ E},
i j which are used to update its belief bi
about its label.
In loopy belief propagation, each node maintains an evolving belief about its
labeling — that is, a probability distribution function over the possible labels,
denoted at node i as {bi (k), k = 1, . . . , N }. The beliefs are iteratively updated by means
of messages passed along edges, denoted {mji , (i, j) ∈ E}, that convey neighboring
nodes’ current opinions about the belief at node i. The idea is sketched in Figure A.2.
The beliefs and messages are initialized as uniform distributions and iteratively
updated according to the following rules:
1 "
bi (L(i)) ← φi (L(i)) mji (L(i)) (A.8)
ZB
j|(i,j)∈E
1 "
mij (L(j)) ← max φi (L(i))ψij (L(i), L(j)) mhi (L(i)) (A.9)
ZM L(i)
h|(i,h)∈E ,h =j
1 "
N
mij (L(j)) ← φi (L(i))ψij (L(i), L(j)) mhi (L(i)) (A.10)
ZM
i=1 h|(i,h)∈E ,h =j
The main alternative to loopy belief propagation for minimizing energies of the form
of Equation (A.5) is the use of graph cuts. We introduced graph cuts in Section 2.8
in the context of hard segmentation of an image; in this case, there are only two
labels (i.e., a pixel is either part of the foreground or part of the background). We also
discussed graph cuts with binary labels in Section 3.3 for finding good compositing
seams and in Section 3.5 for seam carving.
The main advantage of graph cuts in these two-label situations is that efficient
algorithms exist to globally minimize the Gibbs energy in the special case when the
smoothness term satisfies a Potts model,2 namely
i,j
Esmoothness (L(i), L(j)) = 0 if L(i) = L(j)
(A.11)
i,j
Esmoothness (L(i), L(j)) = Vij if L(i) = L(j)
Section 2.8.1 describes how to map a Gibbs energy in this form onto a graph with
weighted edges. To review, we begin with the set of nodes V used to define the Gibbs
energy function, and add two special terminal nodes that we call the source S and
the sink T. We assume the source is associated with label 0 and the sink is associated
with label 1.3 We also augment the set of edges E used to define the Gibbs energy
function, adding edges ei S and ei T between each regular node and each of the two
terminals. We put a nonnegative weight wij on each edge eij . The weights on each
edge are related to the data and smoothness terms of Equation (A.5) by:
wi S = Edata (L(i) = 1)
wi T = Edata (L(i) = 0) (A.12)
wij = Vij
2 More generally, Kolmogorov and Zabih [248] proved that a binary Gibbs energy function can be
i,j i,j i,j
minimized using graph cuts if and only if Esmoothness (0, 0) + Esmoothness (1, 1) ≤ Esmoothness (0, 1) +
i,j
Esmoothness (1, 0) for each (i, j). See that paper for details on how to handle non-Potts models.
3 This is a minor change from the previous section, where we assumed the labels were indexed
starting from 1.
358 Appendix A: Optimization Algorithms for Computer Vision
That is, if node i should have label 0, we want Edata (L(i) = 0) to be low and Edata (L(i) =
1) to be high. Thus we want the weight of the edge attaching node i to the source (label
0) to be high and the weight of the edge attaching node i to the sink (label 1) to be
low, so that the edge to the sink is cut and node i remains attached to the source.
In this section, we briefly describe how to compute the minimum cut on such a
graph — that is, a subset of edges C such that if we remove these edges from E, there
is no path from S to T in the resulting subgraph, and the subset C minimizes the cost
|C| = wij (A.13)
(i,j)∈C
The key concept is to transform the minimum cut problem to a maximum flow
problem on the graph. That is, we think of the edge weights as capacities for transport-
ing material (e.g., water), and want to determine the maximum amount of material
that can be flowed from the source to the sink along the edges.4 After computing
the maximum flow, the set of edges at full capacity corresponds to the minimum cut
(and the cost of this cut corresponds to the maximum amount of material that can be
flowed).
Computing the maximum flow is a well-studied problem in combinatorial opti-
mization, and one of the main approaches is called the Ford-Fulkerson method,
which at a high level operates as follows:
The main issue is how to efficiently find good paths in Step 2 to reach the maximum
flow as quickly as possible. Cormen et al. [106] describe various classical approaches,
including the Edmonds-Karp algorithm, in which the augmenting path is the short-
est path from S to T in the residual network (i.e., a graph in which an edge appears
if it has unused capacity). From the perspective of computer vision problems, in
which the graphs have a typical, regularly connected structure, the most important
contribution was made by Boykov and Kolmogorov [60]. They used a pair of search
trees emanating from the source and the sink that explore non-saturated edges to
find augmenting paths, and efficiently reuse these trees in each step. Their algorithm
has superior performance on computer vision problems such as image segmenta-
tion and stereo compared to the leading maximum-flow algorithms. Many computer
vision researchers use Kolmogorov’s publicly available maximum-flow/minimum-
cut implementation (http://vision.csd.uwo.ca/code/), which was also incorporated
into Szeliski et al.’s common interface for minimizing energies over Markov Random
Fields (see the previous section).
Many Gibbs energy problems require more than two labels at a node, and the
algorithm used previously can’t be directly applied. For example, in Section 5.5.2
we discussed the problem of stereo correspondence, in which the labels correspond
4 Maximum flow problems generally require the graph to be directed, not undirected; this is
addressed by creating a directed graph that has two opposing directed edges i → j and j → i in
place of every edge (i, j) in the original undirected graph. Each directed edge is given the same
weight as the original undirected edge.
A.3. Graph Cuts and α -Expansion 359
label α
wi T = ∞ if L(i) = α
wi T = Edata (L(i) = current label) if L(i) = α
wi S = Edata (L(i) = α) all i ∈ V
i,j
wi aij = Esmoothness (L(i), α) if (i, j) ∈ E, L(i) = L(j) (A.15)
i,j
waij j = Esmoothness (α, L(j)) if (i, j) ∈ E, L(i) = L(j)
i,j
waij T = Esmoothness (L(i), L(j)) if (i, j) ∈ E, L(i) = L(j)
i,j
wij = Esmoothness (L(i), α) if (i, j) ∈ E, L(i) = L(j)
5 As previously, we require some conditions on the Esmoothness term; specifically that it is a metric.
That is, for any labels L(i), L(j), L(h),
i,j
Esmoothness (L(i), L(j)) = 0 ⇐⇒ L(i) = L(j)
i,j i,j
Esmoothness (L(i), L(j)) = Esmoothness (L(j), L(i)) ≥ 0 (A.14)
i,j i,h h,j
Esmoothness (L(i), L(j)) ≤ Esmoothness (L(i), L(h)) + Esmoothness (L(h), L(j))
360 Appendix A: Optimization Algorithms for Computer Vision
However, unlike our interpretation in the binary labeling problem, after we compute
the minimum cut, all of the nodes separated from the source node are given the
label α, and all of the nodes separated from the sink node keep their current label.
This is a little counterintuitive and opposite our interpretation earlier in this section
and in Chapter 2, but we maintain this notation to be consistent with the original
formulation. We can see that since all the nodes already labeled α are connected to
the sink node with infinite weight, the only outcome of solving the subproblem is that
some nodes not currently labeled α change their label to α.
For the overall algorithm with N labels, we iterate the following steps:
While the overall solution to the multi-label problem resulting from α-expansion
doesn’t enjoy the global optimality guarantee that we have for the binary-label prob-
lem, it still has several desirable properties. First, it provably converges in a finite
number of iterations, and convergence is relatively fast since a large number of
spatially distant pixels can change their label simultaneously in each subproblem.
Second, Boykov et al. proved that the cost of the labeling at convergence is within a
known factor of the global minimum cost. In particular, for the multi-label Potts
model, the cost of the converged labeling is at most twice the global minimum
cost. This is an excellent result considering that finding the global minimum of the
multi-label problem is known to be NP-hard.
The notation in Equation (A.18) is shorthand for the positive definiteness of the matrix
∂2F ∗
2 (θ ), which is called the Hessian.
∂θ
The condition in Equation (A.17) can be written:
∂F ∗ ∂f
(θ ) = −2(x − f (θ ∗ )) (θ ∗ ) = 0 (A.19)
∂θ ∂θ
or equivalently,
∂f ∗
(θ )(x − f (θ ∗ )) = 0 (A.20)
∂θ
∂f
Note that is a N × M matrix. When f is a linear function of the parameters θ
∂θ
given by f (θ ) = Aθ, where A ∈ RN ×M doesn’t depend on θ, then Equation (A.20) is a
linear equation in θ ∗ , the solution of which is:
θ ∗ = (F F )−1 F x (A.21)
∂F t 1 ∂ 2F
F (θ ) ≈ F (θ t ) + (θ ) (θ − θ t ) + (θ − θ t ) 2 (θ t )(θ − θ t ) (A.22)
∂θ 2 ∂θ
We can compute the Hessian matrix of second derivatives in Equation (A.22) as:
∂ 2F
N
∂ 2 f (xk ; θ ) ∂f t ∂f t
t t
(θ ) = −2 (xk − f (xk ; θ)) (θ ) + 2 (θ ) (θ )
∂θ 2
∂θ 2 ∂θ ∂θ
k=1
(A.23)
N
∂ 2 f (xk ; θ ) t t t
= −2 (xk − f (xk ; θ)) (θ ) + 2J (θ ) J (θ )
k=1
∂θ 2
∂f t
J (θ t ) = (θ ) (A.24)
∂θ
That is, the (j, k)th element of J is the partial derivative of the j th model prediction x̂j
with respect to the k th parameter θ k .
362 Appendix A: Optimization Algorithms for Computer Vision
The minimizer of this function, which we’ll denote θ t+1 , is given by setting the
gradient of Equation (A.22) to zero:
−1
t+1 t ∂ 2F t ∂F t
θ =θ − (θ ) (θ ) (A.25)
∂θ 2 ∂θ
When we add a step size parameter to Equation (A.25), and substitute the
expansion in Equation (A.23), we obtain:
−1
N
∂ 2 f (xk ; θ )
t
θ t+1 t t
= θ + ρ J (θ ) J (θ ) − (xk − f (xk ; θ )) t
(θ ) J (θ t ) [x − f (θ t )]
k=1
∂θ 2
(A.26)
Under certain conditions, if θ t is a good estimate of the minimizer θ ∗ , the parame-
ters θ t+1 produced by Equation (A.26) are an incrementally better estimate of θ ∗ . The
new estimate can then replace the old (i.e., incrementing t by 1) and Equation (A.26)
reapplied. The iterations terminate when the term J (θ t ) (x −f (θ t )) in Equation (A.26)
becomes vanishingly small (which is the condition in Equation (A.20)). The iteration
suggested by Equation (A.26) is called the Newton-Raphson method.
The convergence of the algorithm is governed by the choice of the step size param-
eter ρ. One possibility is to choose ρ = 1, corresponding to explicitly minimizing the
quadratic approximation at each iteration, but in practice we can choose ρ < 1 to
simply reduce the cost function in the direction of the vector on the right-hand side
of Equation (A.26).
A simplification of Equation (A.26) is obtained by dropping the second term in the
inverted matrix, producing the update equation
−1
θ t+1 = θ t + ρ J (θ t ) J (θ t ) J (θ t ) (x − f (θ t )) (A.27)
θ t+1 = θ t + ρJ (θ t ) (x − f (θ t )) (A.28)
the approximated Hessian is positive definite. This falls into a more general class of
techniques called trust region methods.
More details about these algorithms (e.g. choice of step size, termination crite-
ria, search directions, pitfalls) can be found in numerical optimization textbooks
(e.g., [118, 351]). Implementations of the algorithms in pseudocode can be found in
Press et al. [374].
As discussed in Section 6.5.3.2, the structure of the Jacobian matrix has critical
implications for designing a fast numerical optimization algorithm. In particular, if
J (θ t ) is sparse (that is, each element of f only depends on a few elements of θ ), the
computations should be arranged so that needless multiplications and additions of
elements known to be 0 are avoided.
Figure Acknowledgments
B
All figures were created by the author and/or contain the author’s original images,
with the following exceptions:
• Figures 2.4, 2.14 and 2.16: images and trimaps courtesy Christoph Rhemann,
from www.alphamatting.com.
• Figure 2.26: Source Code courtesy of Summit Entertainment, LLC. Images pro-
vided by Oblique FX. Iron Man 2 images appear courtesy of Marvel Studios,
TM & ©2010 Marvel and Subs. www.marvel.com.
• Figure 3.33: image courtesy Jacob Becker.
• Figure 3.38: adidas, the 3-Stripes mark and the Impossible is Nothing mark are
registered trademarks of the adidas Group used with permission. The name,
image and likeness of Muhammad Ali are provided courtesy of Muhammad Ali
Enterprises LLC. Transformers: Dark of the Moon ©2011 Paramount Pictures.
All Rights Reserved. The Mummy: Tomb of the Dragon Emperor courtesy of
Universal Studios Licensing LLC. Images provided by Digital Domain.
• Figure 4.22: QR Code is a registered trademark of Denso Wave Incorporated,
Japan. ARToolKit markers from Kato et al. [231]. ARTags from Fiala [140].
• Figure 4.23: Thor images appear courtesy of Marvel Studios, TM & ©2011
Marvel and Subs. www.marvel.com. Transformers: Revenge of the Fallen ©2009
DW Studios L.L.C. and Paramount Pictures Corporation. All Rights Reserved.
Transformers: Revenge of the Fallen images provided by Digital Domain.
• Figure 5.9: Table image, layers, and flow field from Liu et al. [288].
• Figure 5.18: Tsukuba images from Nakamura et al. [344] and stereo result from
the Middlebury benchmark at http://vision.middlebury.edu/stereo/.
• Figure 5.31: Transformers: Dark of the Moon ©2011 Paramount Pictures.
All Rights Reserved. Transformers: Revenge of the Fallen ©2009 DW Stu-
dios L.L.C. and Paramount Pictures Corporation. All Rights Reserved. Images
provided by Digital Domain.
• Figures 6.5 and 6.6: source images courtesy Ziyan Wu.
• Figure 6.19: Transformers: Dark of the Moon ©2011 Paramount Pictures. All
Rights Reserved. A Beautiful Mind ©2001 Universal Studios and DW Stu-
dios L.L.C. All Rights Reserved. Courtesy of Universal Studios Licensing LLC.
Images provided by Digital Domain.
• Figure 7.1d: XSuit character and posing courtesy of Noah Schnapp.
364
Appendix B: Figure Acknowledgments 365
• Figure 7.8: underlying skeleton image and Figure 7.19 underlying face image by
Patrick J. Lynch, illustrator, and C. Carl Jaffe MD, cardiologist, Yale University
Center for Advanced Instructional Media. Creative Commons Attribution 2.5
generic license.
• Figure 7.27: Rise of the Planet of the Apes ©2011 Twentieth Century Fox. All
rights reserved.
• Figure 8.3 and elsewhere in Chapter 8: building scan provided by David Doria.
• Figure 8.27: multi-view stereo dataset images and PMVS result from the
Middlebury benchmark at http://vision.middlebury.edu/mview/.
• Figure 8.30b: chicken mesh from A. Mian et al., http://www.csse.uwa.
edu.au/∼ ajmal/3Dmodeling.html.
• Figure 8.40: Thor images appear courtesy of Marvel Studios, TM & ©2011
Marvel and Subs. www.marvel.com. Fast Five courtesy of Universal Studios
Licensing LLC. Images provided by Gentle Giant Studios.
The algorithmic results in several figures were created using original or modified
versions of publicly available executables and source code from various researchers.
The researchers, webpages, and corresponding figures are listed in this section.
Thanks to the researchers for making their code available.
[1] A. Abdel-Hakim and A. Farag. CSIFT: A SIFT descriptor with color invariant characteris-
tics. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2006.
[2] E. Adelson and J. Bergen. The plenoptic function and the elements of early vision. In M. S.
Landy and J. A. Movshon, editors, Computational Models of Visual Processing, chapter 1.
MIT Press, 1991.
[3] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 28(1):44–58, Jan. 2006.
[4] S. Agarwal, N. Snavely, S. M. Seitz, and R. Szeliski. Bundle adjustment in the large. In
European Conference on Computer Vision (ECCV), 2010.
[5] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building Rome in a day. In IEEE
International Conference on Computer Vision (ICCV), 2009.
[6] A. Agarwala. Efficient gradient-domain compositing using quadtrees. In ACM SIGGRAPH
(ACM Transactions on Graphics), 2007.
[7] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin,
and M. Cohen. Interactive digital photomontage. In ACM SIGGRAPH (ACM Transactions
on Graphics), 2004.
[8] A. Agarwala, A. Hertzmann, D. Salesin, and S. Seitz. Keyframe-based tracking for
rotoscoping and animation. In ACM SIGGRAPH (ACM Transactions on Graphics), 2004.
[9] A. Agarwala, K. C. Zheng, C. Pal, M. Agrawala, M. Cohen, B. Curless, D. Salesin,
and R. Szeliski. Panoramic video textures. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2005.
[10] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Removing photography artifacts using gra-
dient projection and flash-exposure sampling. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2005.
[11] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec. The Digital Emily
project: photoreal facial modeling and animation. In ACM SIGGRAPH Courses, 2009.
[12] B. Allen, B. Curless, and Z. Popović. The space of human body shapes: reconstruction and
parameterization from range scans. In ACM SIGGRAPH (ACM Transactions on Graphics),
2003.
[13] P. K. Allen, A. Troccoli, B. Smith, I. Stamos, and S. Murray. The Beauvais Cathedral project.
In Workshop on Applications of Computer Vision in Archeology, 2003.
[14] M. Andersson and D. Betsis. Point reconstruction from noisy images. Journal of
Mathematical Imaging and Vision, 5(1):77–90, Jan. 1995.
[15] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. SCAPE: shape com-
pletion and animation of people. In ACM SIGGRAPH (ACM Transactions on Graphics),
2005.
[16] N. Apostoloff and A. Fitzgibbon. Bayesian video matting using learnt image priors. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2004.
[17] O. Arikan and D. A. Forsyth. Interactive motion generation from examples. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2002.
367
368 Bibliography
[18] O. Arikan, D. A. Forsyth, and J. F. O’Brien. Motion synthesis from annotations. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2003.
[19] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm
for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM,
45(6):891–923, Nov. 1998.
[20] M. Ashikhmin. Synthesizing natural textures. In Symposium on Interactive 3D Graphics,
2001.
[21] S. Avidan and A. Shamir. Seam carving for content-aware image resizing. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2007.
[22] S. Avidan and A. Shashua. Threading fundamental matrices. In European Conference on
Computer Vision (ECCV), 1998.
[23] S. Ayer and H. Sawhney. Layered representation of motion video using robust maximum-
likelihood estimation of mixture models and MDL encoding. In IEEE International
Conference on Computer Vision (ICCV), 1995.
[24] S. Bae, A. Agarwala, and F. Durand. Computational rephotography. ACM Transactions on
Graphics, 29(3):1–15, June 2010.
[25] X. Bai and G. Sapiro. Geodesic matting: A framework for fast interactive image and
video segmentation and matting. International Journal of Computer Vision, 82(2):113–32,
Apr. 2009.
[26] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video SnapCut: robust video object cutout
using localized classifiers. In ACM SIGGRAPH (ACM Transactions on Graphics), 2009.
[27] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski. A database and evalua-
tion methodology for optical flow. International Journal of Computer Vision, 92(1):1–31,
Mar. 2011.
[28] Y. Bando, B. Chen, and T. Nishita. Extracting depth and matte using a color-filtered
aperture. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2008.
[29] I. Baran and J. Popović. Automatic rigging and animation of 3D characters. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2007.
[30] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. PatchMatch: a random-
ized correspondence algorithm for structural image editing. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2009.
[31] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow techniques.
International Journal of Computer Vision, 12(1):43–77, Feb. 1994.
[32] A. Baumberg. Reliable feature matching across widely separated views. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2000.
[33] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. In European
Conference on Computer Vision (ECCV), 2006.
[34] P. Beardsley, A. Zisserman, and D. Murray. Sequential updating of projective and affine
structure from motion. International Journal of Computer Vision, 23(3):235–59, June
1997.
[35] T. Beier and S. Neely. Feature-based image metamorphosis. In ACM SIGGRAPH (ACM
Transactions on Graphics), 1992.
[36] J. Beis and D. Lowe. Shape indexing using approximate nearest-neighbour search in
high-dimensional spaces. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 1997.
[37] P. N. Belhumeur. A Bayesian approach to binocular steropsis. International Journal of
Computer Vision, 19(3):237–60, Aug. 1996.
[38] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape
contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–22,
Apr. 2002.
[39] J. Bergen, P. Anandan, K. Hanna, and R. Hingorani. Hierarchical model-based motion
estimation. In European Conference on Computer Vision (ECCV), 1992.
[40] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2000.
[41] P. Besl and H. McKay. A method for registration of 3-D shapes. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 14(2):239–56, Feb. 1992.
Bibliography 369
[42] P. J. Besl. Active, optical range imaging sensors. Machine Vision and Applications,
1(2):127–52, June 1988.
[43] P. Bhat, C. L. Zitnick, M. Cohen, and B. Curless. GradientShop: A gradient-domain
optimization framework for image and video filtering. ACM Transactions on Graphics,
29(2):1–14, Mar. 2010.
[44] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, and M. Gross. Multi-
scale capture of facial geometry and motion. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2007.
[45] J. Bilmes. A gentle tutorial of the EM algorithm and its application to parameter esti-
mation for Gaussian mixture and hidden Markov models. Technical Report TR-97-021,
University of California, Berkeley, 1997.
[46] S. Birchfield and C. Tomasi. A pixel dissimilarity measure that is insensitive to image
sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):401–6,
Apr. 1998.
[47] M. Black and A. Jepson. Estimating optical flow in segmented images using variable-order
parametric models with local deformations. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(10):972–86, Oct. 1996.
[48] M. J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and
piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1):75–104,
Jan. 1996.
[49] A. Blake, C. Rother, M. Brown, P. Pérez, and P. Torr. Interactive image segmentation using
an adaptive GMMRF model. In European Conference on Computer Vision (ECCV), 2004.
[50] M. Bleyer, C. Rother, and P. Kohli. Surface stereo with soft segmentation. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[51] B. Bodenheimer, C. Rose, S. Rosenthal, and J. Pella. The process of motion capture: Deal-
ing with the data. In Eurographics Workshop on Computer Animation and Simulation,
1997.
[52] J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder. Sparse matrix solvers on the GPU:
conjugate gradients and multigrid. In ACM SIGGRAPH (ACM Transactions on Graphics),
2003.
[53] F. Bookstein. Principal warps: thin-plate splines and the decomposition of deformations.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(6):567–85, June 1989.
[54] G. Borshukov, J. Montgomery, and W. Werner. Playable universal capture: compression
and real-time sequencing of image-based facial animation. In ACM SIGGRAPH Courses,
2006.
[55] J. Bouguet. Pyramidal implementation of the Lucas-Kanade feature tracker: description
of the algorithm. Technical report, Intel Corporation, 1999.
[56] J.-Y. Bouguet and P. Perona. 3D photography on your desk. In IEEE International
Conference on Computer Vision (ICCV), 1998.
[57] K. L. Boyer and A. C. Kak. Color-encoded structured light for rapid active ranging. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(1):14–28, Jan. 1987.
[58] Y. Boykov and D. Huttenlocher. Adaptive Bayesian recognition in tracking rigid objects. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2000.
[59] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary and region segmen-
tation of objects in N-D images. In IEEE International Conference on Computer Vision
(ICCV), 2001.
[60] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algo-
rithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 26(9):1124–37, Sept. 2004.
[61] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph
cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–39,
Nov. 2001.
[62] D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using
robust binocular stereo and surface meshing. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), 2008.
370 Bibliography
[63] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High resolution passive facial
performance capture. In ACM SIGGRAPH (ACM Transactions on Graphics), 2010.
[64] P. Braido and X. Zhang. Quantitative analysis of finger motion coordination in hand
manipulative and gestic acts. Human Movement Science, 22(6):661–78, Apr. 2004.
[65] M. Brand and A. Hertzmann. Style machines. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2000.
[66] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from
image streams. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2000.
[67] C. Bregler, J. Malik, and K. Pullen. Twist based acquisition and tracking of animal and
human kinematics. International Journal of Computer Vision, 56(3):179–94, Feb. 2004.
[68] R. Brinkmann. The Art and Science of Digital Compositing. Morgan Kaufmann, 2nd
edition, 2008.
[69] D. Brown. The bundle adjustment - progress and prospects. International Archives of the
Photogrammetry, Remote Sensing and Spatial Information Sciences, 21(3):1–33, 1976.
[70] M. Brown, D. Burschka, and G. Hager. Advances in computational stereo. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(8):993–1008, Aug. 2003.
[71] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):43–57, Jan. 2011.
[72] M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features.
International Journal of Computer Vision, 74(1):59–73, Aug. 2007.
[73] T. Brox, C. Bregler, and J. Malik. Large displacement optical flow. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[74] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation
based on a theory for warping. In European Conference on Computer Vision (ECCV), 2004.
[75] A. Bruderlin and L. Williams. Motion signal processing. In ACM SIGGRAPH (ACM
Transactions on Graphics), 1995.
[76] A. Bruhn, J. Weickert, and C. Schnörr. Lucas/Kanade meets Horn/Schunck: Combining
local and global optic flow methods. International Journal of Computer Vision, 61(3):211–
31, Feb. 2005.
[77] G. J. Burghouts and J.-M. Geusebroek. Performance evaluation of local colour invariants.
Computer Vision and Image Understanding, 113(1):48–62, Jan. 2009.
[78] P. J. Burt and E. H. Adelson. A multiresolution spline with application to image mosaics.
ACM Transactions on Graphics, 2(4):217–36, Oct. 1983.
[79] R. Burtch. History of photogrammetry. Technical report, The Center for Photogrammetric
Training, Ferris State University, 2008.
[80] N. Campbell, G. Vogiatzis, C. Hernández, and R. Cipolla. Using multiple hypotheses to
improve depth-maps for multi-view stereo. In European Conference on Computer Vision
(ECCV), 2008.
[81] B. Caprile and V. Torre. Using vanishing points for camera calibration. International
Journal of Computer Vision, 4(2):127–39, Mar. 1990.
[82] J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R. Fright, B. C. McCallum, and
T. R. Evans. Reconstruction and representation of 3D objects with radial basis functions.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2001.
[83] D. Caspi, N. Kiryati, and J. Shamir. Range imaging with adaptive color structured light.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):470–80, May 1998.
[84] Y. Caspi and M. Irani. Spatio-temporal alignment of sequences. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(11):1409–24, Nov. 2002.
[85] J. Chai and J. K. Hodgins. Performance animation from low-dimensional control signals.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[86] C. Chen and A. Kak. Modeling and calibration of a structured light scanner for 3-D robot
vision. In IEEE International Conference on Robotics and Automation, 1987.
[87] C. Chen and I. Stamos. Range image segmentation for modeling and object detection in
urban scenes. In International Conference on 3-D Digital Imaging and Modeling (3DIM),
2007.
Bibliography 371
[88] S. E. Chen and L. Williams. View interpolation for image synthesis. In ACM SIGGRAPH
(ACM Transactions on Graphics), 1993.
[89] Y. Chen and G. Medioni. Object modeling by registration of multiple range images. In
IEEE International Conference on Robotics and Automation, 1991.
[90] M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. Repfinder: finding approxi-
mately repeated scene elements for image editing. In ACM SIGGRAPH (ACM Transactions
on Graphics), 2010.
[91] K. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette of articulated objects and
its use for human body kinematics estimation and motion capture. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2003.
[92] S. Chiaverini, G. Oriolo, and I. D. Walker. Kinematically redundant manipulators. In
B. Siciliano and O. Khatib, editors, Springer Handbook of Robotics, pages 245–68. Springer,
2008.
[93] T. S. Cho, M. Butman, S. Avidan, and W. Freeman. The patch transform and its appli-
cations to image editing. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 2008.
[94] G. Christensen, R. Rabbitt, and M. Miller. Deformable templates using large deformation
kinematics. IEEE Transactions on Image Processing, 5(10):1435–47, Oct. 1996.
[95] H.-K. Chu, W.-H. Hsu, N. J. Mitra, D. Cohen-Or, T.-T. Wong, and T.-Y. Lee. Camouflage
images. In ACM SIGGRAPH (ACM Transactions on Graphics), 2010.
[96] Y. Chuang, A. Agarwala, B. Curless, D. Salesin, and R. Szeliski. Video matting of complex
scenes. In ACM SIGGRAPH (ACM Transactions on Graphics), 2002.
[97] Y. Chuang, D. Goldman, B. Curless, D. Salesin, and R. Szeliski. Shadow matting and
compositing. In ACM SIGGRAPH (ACM Transactions on Graphics), 2003.
[98] Y. Chuang, D. Zongker, J. Hindorff, B. Curless, D. Salesin, and R. Szeliski. Environment
matting extensions: Towards higher accuracy and real-time capture. In ACM SIGGRAPH
(ACM Transactions on Graphics), 2000.
[99] Y.-Y. Chuang, B. Curless, D. Salesin, and R. Szeliski. A Bayesian approach to digital mat-
ting. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2001.
[100] R. Collins. A space-sweep approach to true multi-image matching. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 1996.
[101] R. Collins, Y. Liu, and M. Leordeanu. Online selection of discriminative tracking fea-
tures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1631–43,
Oct. 2005.
[102] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space anal-
ysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–19,
May 2002.
[103] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 25(5):564–77, May 2003.
[104] S. Cooper, A. Hertzmann, and Z. Popović. Active learning for real-time motion controllers.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2007.
[105] S. Corazza, L. Mündermann, E. Gambaretto, G. Ferrigno, and T. Andriacchi. Marker-
less motion capture through visual hull, articulated ICP and subject specific model
generation. International Journal of Computer Vision, 87(1):156–69, Mar. 2010.
[106] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press,
3rd edition, 2009.
[107] A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov. Bilayer segmentation of live video. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2006.
[108] A. Criminisi, P. Pérez, and K. Toyama. Region filling and object removal by exemplar-
based image inpainting. IEEE Transactions on Image Processing, 13(9):1200–12,
Sept. 2004.
[109] G. Csurka, D. Demirdjian, A. Ruf, and R. Horaud. Closed-form solutions for the Euclidean
calibration of a stereo rig. In European Conference on Computer Vision (ECCV), 1998.
372 Bibliography
[110] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3D shape scanning with a time-
of-flight camera. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2010.
[111] B. Curless and M. Levoy. Better optical triangulation through spacetime analysis. In IEEE
International Conference on Computer Vision (ICCV), 1995.
[112] B. Curless and M. Levoy. A volumetric method for building complex models from range
images. In ACM SIGGRAPH (ACM Transactions on Graphics), 1996.
[113] J. Davis and X. Chen. A laser range scanner designed for minimum calibration complexity.
In International Conference on 3-D Digital Imaging and Modeling (3DIM), 2001.
[114] J. Davis, D. Nehab, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo: a unify-
ing framework for depth from triangulation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(2):296–302, Feb. 2005.
[115] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–67, June 2007.
[116] P. Debevec, A. Wenger, C. Tchou, A. Gardner, J. Waese, and T. Hawkins. A lighting repro-
duction approach to live-action compositing. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2002.
[117] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and rendering architecture from
photographs: a hybrid geometry- and image-based approach. In ACM SIGGRAPH (ACM
Transactions on Graphics), 1996.
[118] J. Dennis, Jr. and R. Schnabel. Numerical Methods for Unconstrained Optimization and
Nonlinear Equations. Society for Industrial and Applied Mathematics, 1996.
[119] J. Deutscher and I. Reid. Articulated body motion capture by stochastic search.
International Journal of Computer Vision, 61(2):185–205, Feb. 2005.
[120] A. Dick, P. Torr, and R. Cipolla. Modelling and interpretation of architecture from several
images. International Journal of Computer Vision, 60(2):111–34, Nov. 2004.
[121] H. Dinh and S. Kropac. Multi-resolution spin-images. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
[122] T. Dobbert. Matchmoving: the Invisible Art of Camera Tracking. Sybex, 2005.
[123] W. Dong, N. Zhou, J.-C. Paul, and X. Zhang. Optimized image resizing using seam carving
and scaling. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2009.
[124] D. Doria and R. J. Radke. Filling large holes in LiDAR data by inpainting depth gradients.
In Workshop on Point Cloud Processing in Computer Vision (PCP2012), 2012.
[125] J. Draréni, S. Roy, and P. Sturm. Methods for geometrical video projector calibration.
Machine Vision and Applications, 23(1):79–89, Jan. 2012.
[126] I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image completion. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2003.
[127] Y. Dufournaud, C. Schmid, and R. Horaud. Matching images with different resolutions. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2000.
[128] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2001.
[129] P. Ekman, W. V. Friesen, and J. C. Hager. Facial Action Coding System: The Manual.
A Human Face, 2002.
[130] J. H. Elder and R. M. Goldberg. Image editing in the contour domain. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(3):291–6, Mar. 2001.
[131] C. H. Esteban and F. Schmitt. Silhouette and stereo fusion for 3D object modeling.
Computer Vision and Image Understanding, 96(3):367–92, Dec. 2004.
[132] Z. Farbman, G. Hoffer, Y. Lipman, D. Cohen-Or, and D. Lischinski. Coordinates for instant
image cloning. In ACM SIGGRAPH (ACM Transactions on Graphics), 2009.
[133] H. Farid. Image forgery detection. IEEE Signal Processing Magazine, 26(2):16–25,
Mar. 2009.
[134] H. Farid. Seeing is not believing. IEEE Spectrum, 46(8):44–51, Aug. 2009.
[135] G. Farin. Curves and Surfaces for CAGD: A Practical Guide. Morgan Kaufmann, 5th edition,
2001.
Bibliography 373
[136] R. Fattal, D. Lischinski, and M. Werman. Gradient domain high dynamic range
compression. In ACM SIGGRAPH (ACM Transactions on Graphics), 2002.
[137] O. Faugeras and Q.-T. Luong. The Geometry of Multiple Images: The Laws That Govern
the Formation of Multiple Images of a Scene and Some of Their Applications. MIT Press,
2004.
[138] P. Felzenszwalb and D. Huttenlocher. Efficient belief propagation for early vision. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2004.
[139] P. Felzenszwalb and R. Zabih. Dynamic programming and graph algorithms in computer
vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):721–40,
Apr. 2011.
[140] M. Fiala. Designing highly reliable fiducial markers. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 32(7):1317–24, July 2010.
[141] G. Finlayson, S. Hordley, C. Lu, and M. Drew. On the removal of shadows from images.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1):59–68, Jan. 2006.
[142] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Communications of the
ACM, 24(6):381–95, June 1981.
[143] R. Fisher, A. Ashbrook, C. Robertson, and N. Werghi. A low-cost range finder using a visu-
ally located, structured light source. In International Conference on 3-D Digital Imaging
and Modeling (3DIM), 1999.
[144] A. Fitzgibbon. Simultaneous linear estimation of multiple view geometry and lens distor-
tion. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2001.
[145] A. Fitzgibbon and A. Zisserman. Automatic camera recovery for closed or open image
sequences. In European Conference on Computer Vision (ECCV), 1998.
[146] M. Floater. Mean value coordinates. Computer Aided Geometric Design, 20(1):19–27,
Mar. 2003.
[147] J. Flusser. On the independence of rotation moment invariants. Pattern Recognition,
33(9):1405–10, Sept. 2000.
[148] P.-E. Forssén. Maximally stable colour regions for recognition and matching. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[149] P.-E. Forssén and D. Lowe. Shape descriptors for maximally stable extremal regions. In
IEEE International Conference on Computer Vision (ICCV), 2007.
[150] W. Förstner. A feature based correspondence algorithm for image matching. Interna-
tional Archives of Photogrammetry and Remote Sensing, 26(3):150–66, 1986.
[151] J. Foster. The Green Screen Handbook: Real-World Production Techniques. Sybex, 2010.
[152] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y.-H. Jen, E. Dunn,
B. Clipp, S. Lazebnik, and M. Pollefeys. Building Rome on a cloudless day. In European
Conference on Computer Vision (ECCV), 2010.
[153] W. Freeman and E. Adelson. The design and use of steerable filters. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 13(9):891–906, Sept. 1991.
[154] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learning low-level vision.
International Journal of Computer Vision, 40(1):25–47, Oct. 2000.
[155] A. Frome, D. Huber, R. Kolluri, T. Bülow, and J. Malik. Recognizing objects in range data
using regional point descriptors. In European Conference on Computer Vision (ECCV),
2004.
[156] C. Frueh, S. Jain, and A. Zakhor. Data processing algorithms for generating textured 3D
building facade meshes from laser scans and camera images. International Journal of
Computer Vision, 61(2):159–84, Feb. 2005.
[157] C. Frueh and A. Zakhor. Capturing 2 1/2 D depth and texture of time-varying scenes
using structured infrared light. In International Conference on 3-D Digital Imaging and
Modeling (3DIM), 2005.
[158] C. Früh and A. Zakhor. An automated method for large-scale, ground-based city model
acquisition. International Journal of Computer Vision, 60(1):5–24, Oct. 2004.
374 Bibliography
[159] Y. Furukawa and J. Ponce. Dense 3D motion capture for human faces. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[160] Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–76, Aug. 2010.
[161] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel. Motion capture
using joint skeleton tracking and surface estimation. In IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR), 2009.
[162] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real time motion capture using a
single time-of-flight camera. In IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR), 2010.
[163] E. Gastal and M. Oliveira. Shared sampling for real-time alpha matting. In Eurographics,
2010.
[164] D. Gavrila and L. Davis. 3-D model-based tracking of humans in action: a multi-
view approach. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 1996.
[165] A. Gelb. Applied Optimal Estimation. MIT Press, 1974.
[166] D. C. Ghiglia and M. D. Pritt. Two-Dimensional Phase Unwrapping: Theory, Algorithms,
and Software. Wiley-Interscience, 1998.
[167] J. Gibson. The Perception of the Visual World. Riverside Press, 1950.
[168] M. Gleicher and N. Ferrier. Evaluating video-based motion capture. In Computer
Animation, 2002.
[169] M. L. Gleicher and F. Liu. Re-cinematography: Improving the camerawork of casual
video. ACM Transactions on Multimedia Computing, Communications and Applications,
5(1):1–28, October 2008.
[170] M. Goesele, B. Curless, and S. Seitz. Multi-view stereo revisited. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2006.
[171] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. Seitz. Multi-view stereo for com-
munity photo collections. In IEEE International Conference on Computer Vision (ICCV),
2007.
[172] A. Golovinskiy, V. Kim, and T. Funkhouser. Shape-based recognition of 3D point clouds
in urban environments. In IEEE International Conference on Computer Vision (ICCV),
2009.
[173] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins, 3rd edition, 1996.
[174] A. A. Gooch, S. C. Olsen, J. Tumblin, and B. Gooch. Color2Gray: salience-preserving color
removal. In ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[175] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In ACM
SIGGRAPH (ACM Transactions on Graphics), 1996.
[176] L. Grady. Random walks for image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 28(11):1768–83, Nov. 2006.
[177] L. Grady. A lattice-preserving multigrid method for solving the inhomogeneous Poisson
equations used in image analysis. In European Conference on Computer Vision (ECCV),
2008.
[178] L. Grady, T. Schiwietz, S. Aharon, and R. Westermann. Random walks for interactive
alpha-matting. In IASTED International Conference on Visualization, Imaging and Image
Processing, 2005.
[179] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popović. Style-based inverse kinematics.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2004.
[180] R. Gross, I. Matthews, and S. Baker. Active appearance models with occlusion. Image and
Vision Computing, 24(1):593–604, June 2006.
[181] A. Grundhöfer and O. Bimber. VirtualStudio2Go: digital video composition for real
environments. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2008.
[182] Y. Guan, W. Chen, X. Liang, Z. Ding, and Q. Peng. Easy matting – a stroke based approach
for continuous image matting. In Eurographics, 2006.
[183] B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin. Making faces. In ACM
SIGGRAPH (ACM Transactions on Graphics), 1998.
Bibliography 375
[184] G. Hager and P. Belhumeur. Real-time tracking of image regions with changes in geom-
etry and illumination. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 1996.
[185] O. Hall-Holt and S. Rusinkiewicz. Stripe boundary codes for real-time structured-light
range scanning of moving objects. In IEEE International Conference on Computer Vision
(ICCV), 2001.
[186] C. Harris and M. Stephens. A combined corner and edge detector. In Alvey Vision
Conference, 1988.
[187] R. Hartley. Theory and practice of projective rectification. International Journal of
Computer Vision, 35(2):115–27, Nov. 1999.
[188] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, 2nd edition, 2004.
[189] R. I. Hartley and P. Sturm. Triangulation. Computer Vision and Image Understanding,
68(2):146–57, Nov. 1997.
[190] N. Hasler, B. Rosenhahn, T. Thormahlen, M. Wand, J. Gall, and H.-P. Seidel. Marker-
less motion capture with unsynchronized moving cameras. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[191] J. Hays and A. A. Efros. Scene completion using millions of photographs. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2007.
[192] K. He, J. Sun, and X. Tang. Fast matting using large kernel matting Laplacian matrices. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2010.
[193] X. He and P. Niyogi. Locality preserving projections. In Advances in Neural Information
Processing Systems, 2003.
[194] L. Herda, P. Fua, R. Plänkers, R. Boulic, and D. Thalmann. Using skeleton-based tracking
to increase the reliability of optical motion capture. Human Movement Science, 20(3):313–
41, June 2001.
[195] L. Herda, R. Urtasun, and P. Fua. Hierarchical implicit surface joint limits to constrain
video-based motion capture. In European Conference on Computer Vision (ECCV), 2004.
[196] C. Hernandez, G. Vogiatzis, and R. Cipolla. Multiview photometric stereo. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 30(3):548–54, Mar. 2008.
[197] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2001.
[198] V. Hiep, R. Keriven, P. Labatut, and J.-P. Pons. Towards high-resolution large-scale
multi-view stereo. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2009.
[199] P. Hillman, J. Hannah, and D. Renshaw. Semi-automatic foreground/background seg-
mentation of motion picture images and image sequences. IEE Proceedings on Vision,
Image, and Signal Processing, 152(4):387–97, Aug. 2005.
[200] H. Hirschmüller and D. Scharstein. Evaluation of stereo matching costs on images with
radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence,
31(9):1582–99, Sept. 2009.
[201] M. Holden. A review of geometric transformations for nonrigid body registration. IEEE
Transactions on Medical Imaging, 27(1):111–28, Jan. 2008.
[202] R. Horaud and G. Csurka. Self-calibration and Euclidean reconstruction using motions
of a stereo rig. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 1998.
[203] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17(1-
3):185–203, Aug. 1981.
[204] E. Horn and N. Kiryati. Toward optimal structured light patterns. In International
Conference on 3-D Digital Imaging and Modeling (3DIM), 1997.
[205] M.-K. Hu. Visual pattern recognition by moment invariants. IRE Transactions on
Information Theory, 8(2):179–87, Feb. 1962.
[206] P. S. Huang, C. Zhang, and F.-P. Chiang. High-speed 3-D shape measurement based on
digital fringe projection. Optical Engineering, 42(1):163–8, Jan. 2003.
376 Bibliography
[207] D. Huber, B. Akinci, P. Tang, A. Adan, B. Okorn, and X. Xiong. Using laser scanners for
modeling and analysis in architecture, engineering, and construction. In Conference on
Information Sciences and Systems (CISS), 2010.
[208] D. Huber, A. Kapuria, R. Donamukkala, and M. Hebert. Parts-based 3D object classifica-
tion. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2004.
[209] M. B. Hullin, M. Fuchs, I. Ihrke, H.-P. Seidel, and H. P. A. Lensch. Fluorescent immersion
range scanning. In ACM SIGGRAPH (ACM Transactions on Graphics), 2008.
[210] Y. Hung and W. Tang. Projective reconstruction from multiple views with minimiza-
tion of 2D reprojection error. International Journal of Computer Vision, 66(3):305–17,
Mar. 2006.
[211] D. Huynh. Calibration of a structured light system: a projective approach. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1997.
[212] M. Isard and A. Blake. CONDENSATION - conditional density propagation for visual
tracking. International Journal of Computer Vision, 29(1):5–28, Aug. 1998.
[213] F. Isgrò and E. Trucco. Projective rectification without epipolar geometry. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1999.
[214] H. Ishikawa and D. Geiger. Occlusions, discontinuities, and epipolar lines in stereo. In
European Conference on Computer Vision (ECCV), 1998.
[215] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–9,
Nov. 1998.
[216] R. A. Jarvis. A perspective on range finding techniques for computer vision.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):122–39,
Mar. 1983.
[217] C. Je, S. Lee, and R.-H. Park. High-contrast color-stripe pattern for rapid structured-light
range imaging. In European Conference on Computer Vision (ECCV), 2004.
[218] S. Jeschke, D. Cline, and P. Wonka. A GPU Laplacian solver for diffusion curves and
Poisson image editing. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2009.
[219] J. Jia, J. Sun, C. Tang, and H. Shum. Drag-and-drop pasting. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2006.
[220] J. Jia, Y.-W. Tai, T.-P. Wu, and C.-K. Tang. Video repairing under variable illumination
using cyclic motions. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(5):832–39, May 2006.
[221] J. Jia and C. Tang. Inference of segmented color and texture description by tensor voting.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):771–86, June 2004.
[222] H. Jin, P. Favaro, and S. Soatto. Real-time feature tracking and outlier rejection with
changes in illumination. In IEEE International Conference on Computer Vision (ICCV),
2002.
[223] G. Johansson. Visual perception of biological motion and a model for its analysis.
Attention, Perception, and Psychophysics, 14(2):201–11, June 1973.
[224] A. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered
3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):433–49,
May 2002.
[225] M. Johnson, G. Brostow, J. Shotton, O. Arandjelovic, V. Kwatra, and R. Cipolla. Semantic
photo synthesis. In Eurographics, 2006.
[226] N. Joshi, W. Matusik, and S. Avidan. Natural video matting using camera arrays. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2006.
[227] N. Joshi, W. Matusik, S. Avidan, H. Pfister, and W. Freeman. Exploring defocus matting:
Nonparametric acceleration, super-resolution, and off-center matting. IEEE Computer
Graphics and Applications, 27(2):43–52, March–April 2007.
[228] S. Joshi and M. Miller. Landmark matching via large deformation diffeomorphisms. IEEE
Transactions on Image Processing, 9(8):1357–70, Aug. 2000.
[229] T. Kadir, A. Zisserman, and M. Brady. An affine invariant salient region detector. In
European Conference on Computer Vision (ECCV), 2004.
Bibliography 377
[230] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International
Journal of Computer Vision, 1(4):321–31, Jan. 1988.
[231] H. Kato and M. Billinghurst. Marker tracking and HMD calibration for a video-based
augmented reality conferencing system. In IEEE and ACM International Workshop on
Augmented Reality, 1999.
[232] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Eurographics
Symposium on Geometry Processing, 2006.
[233] M. Kazhdan and H. Hoppe. Streaming multigrid for gradient-domain operations on large
images. In ACM SIGGRAPH (ACM Transactions on Graphics), 2008.
[234] Q. Ke and T. Kanade. A robust subspace approach to layer extraction. In Workshop on
Motion and Video Computing, 2002.
[235] Y. Ke and R. Sukthankar. PCA-SIFT: a more distinctive representation for local image
descriptors. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2004.
[236] R. Kehl and L. V. Gool. Markerless tracking of complex human motions from multiple
views. Computer Vision and Image Understanding, 104(2-3):190–209, Nov. 2006.
[237] C. Kenney, M. Zuliani, and B. Manjunath. An axiomatic approach to corner detection. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2005.
[238] G. Kim, D. Huber, and M. Hebert. Segmentation of salient regions in outdoor scenes using
imagery and 3-D data. In IEEE Computer Society Workshop on Applications of Computer
Vision, 2008.
[239] M. Kim, K. Hyun, J. Kim, and J. Lee. Synchronized multi-character motion editing. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2009.
[240] M. Kimura, M. Mochimaru, and T. Kanade. Projector calibration using arbitrary planes
and calibrated camera. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 2007.
[241] A. Kirk, J. O’Brien, and D. Forsyth. Skeletal parameter estimation from optical motion
capture data. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2005.
[242] A. Klaus, M. Sormann, and K. Karner. Segment-based stereo matching using belief prop-
agation and a self-adapting dissimilarity measure. In International Conference on Pattern
Recognition (ICPR), 2006.
[243] A. Kokaram, B. Collis, and S. Robinson. Automated rig removal with Bayesian motion
interpolation. IEE Proceedings on Vision, Image and Signal Processing, 152(4):407–14,
Aug. 2005.
[244] A. Kolb, E. Barth, R. Koch, and R. Larsen. Time-of-flight cameras in computer graphics.
In Eurographics, 2010.
[245] R. Kolluri, J. R. Shewchuk, and J. F. O’Brien. Spectral surface reconstruction from noisy
point clouds. In Eurographics Symposium on Geometry Processing, 2004.
[246] V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions using
graph cuts. In IEEE International Conference on Computer Vision (ICCV), 2001.
[247] V. Kolmogorov and R. Zabih. Multi-camera scene reconstruction via graph cuts. In
European Conference on Computer Vision (ECCV), 2002.
[248] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts?
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):147–59, Feb. 2004.
[249] N. Komodakis and G. Tziritas. Image completion using efficient belief propagation
via priority scheduling and dynamic pruning. IEEE Transactions on Image Processing,
16(11):2649–61, Nov. 2007.
[250] T. Koninckx and L. Van Gool. Real-time range acquisition by adaptive structured light.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3):432–45, Mar. 2006.
[251] S. Koppal, S. Yamazaki, and S. Narasimhan. Exploiting DLP illumination dithering
for reconstruction and photography of high-speed scenes. International Journal of
Computer Vision, 96(1):125–44, Jan. 2012.
378 Bibliography
[252] L. Kovar and M. Gleicher. Flexible automatic motion blending with registration curves.
In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2003.
[253] L. Kovar and M. Gleicher. Automated extraction and parameterization of motions in large
data sets. In ACM SIGGRAPH (ACM Transactions on Graphics), 2004.
[254] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2002.
[255] P. Krähenbühl, M. Lang, A. Hornung, and M. Gross. A system for retargeting of streaming
video. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2009.
[256] K. Kraus. Photogrammetry: Geometry from Images and Laser Scans. de Gruyter, 2007.
[257] P. G. Kry and D. K. Pai. Interaction capture and synthesis. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2006.
[258] K. N. Kutulakos and S. M. Seitz. A theory of shape by space carving. International Journal
of Computer Vision, 38(3):199–218, July 2000.
[259] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video
synthesis using graph cuts. In ACM SIGGRAPH (ACM Transactions on Graphics), 2003.
[260] J.-F. Lalonde and A. Efros. Using color compatibility for assessing image realism. In IEEE
International Conference on Computer Vision (ICCV), 2007.
[261] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Criminisi. Photo clip art.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2007.
[262] D. Lanman, D. Crispell, and G. Taubin. Surround structured lighting: 3-D scanning with
orthographic illumination. Computer Vision and Image Understanding, 113(11):1107–17,
Nov. 2009.
[263] A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 16(2):150–62, Feb. 1994.
[264] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture representation using local affine
regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1265–78,
Aug. 2005.
[265] J. Lee, J. Chai, P. S. A. Reitsma, J. K. Hodgins, and N. S. Pollard. Interactive control of
avatars animated with human motion data. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2002.
[266] J. Lee and S. Y. Shin. A hierarchical approach to interactive motion editing for human-like
figures. In ACM SIGGRAPH (ACM Transactions on Graphics), 1999.
[267] S. Lee, G. Wolberg, and S. Y. Shin. Scattered data interpolation with multilevel B-splines.
IEEE Transactions on Visualization and Computer Graphics, 3(3):228–44, July 1997.
[268] S.-Y. Lee, K.-Y. Chwa, J. Hahn, and S. Y. Shin. Image morphing using deformation
techniques. The Journal of Visualization and Computer Animation, 7(1):3–23, 1996.
[269] S.-Y. Lee, K.-Y. Chwa, S. Y. Shin, and G. Wolberg. Image metamorphosis using snakes and
free-form deformations. In ACM SIGGRAPH (ACM Transactions on Graphics), 1995.
[270] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 28(9):1465–79, Sept. 2006.
[271] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–42, Feb. 2008.
[272] A. Levin, A. Rav-Acha, and D. Lischinski. Spectral matting. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 30(10):1699–1712, Oct. 2008.
[273] A. Levin, A. Zomet, and Y. Weiss. Learning how to inpaint from global image statistics. In
IEEE International Conference on Computer Vision (ICCV), 2003.
[274] M. Levoy and P. Hanrahan. Light field rendering. In ACM SIGGRAPH (ACM Transactions
on Graphics), 1996.
[275] M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Ander-
son, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital Michelangelo project: 3D
scanning of large statues. In ACM SIGGRAPH (ACM Transactions on Graphics), 2000.
[276] J. Lewis. Fast template matching. In Vision Interface, 1995.
[277] H. Li, B. S. Manjunath, and S. K. Mitra. Multisensor image fusion using the wavelet
transform. Graphical Models and Image Processing, 57(3):235–45, May 1995.
Bibliography 379
[278] Y. Li, L. Sharan, and E. H. Adelson. Compressing and companding high dynamic range
images with subband architectures. In ACM SIGGRAPH (ACM Transactions on Graphics),
2005.
[279] Y. Li, J. Sun, and H. Shum. Video object cut and paste. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2005.
[280] Y. Li, J. Sun, C. Tang, and H. Shum. Lazy snapping. In ACM SIGGRAPH (ACM Transactions
on Graphics), 2004.
[281] D. Liebowitz and S. Carlsson. Uncalibrated motion capture exploiting articulated
structure constraints. International Journal of Computer Vision, 51(3):171–87, Feb. 2003.
[282] D. Liebowitz and A. Zisserman. Metric rectification for perspective images of planes. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
1998.
[283] D. Liebowitz and A. Zisserman. Combining scene and auto-calibration constraints. In
IEEE International Conference on Computer Vision (ICCV), 1999.
[284] I.-C. Lin and M. Ouhyoung. Mirror MoCap: Automatic and efficient capture of dense 3D
facial motion parameters from video. The Visual Computer, 21(6):355–72, July 2005.
[285] T. Lindeberg. Detecting salient blob-like image structures and their scales with a scale-
space primal sketch: A method for focus-of-attention. International Journal of Computer
Vision, 11(3):283–318, Dec. 1993.
[286] T. Lindeberg. Feature detection with automatic scale selection. International Journal of
Computer Vision, 30(2):79–116, Nov. 1998.
[287] T. Lindeberg and J. Gårding. Shape-adapted smoothing in estimation of 3-D shape cues
from affine deformations of local 2-D brightness structure. Image and Vision Computing,
15(6):415–34, June 1997.
[288] C. Liu, W. Freeman, E. Adelson, and Y. Weiss. Human-assisted motion annotation. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2008.
[289] C. Liu, J. Yuen, and A. Torralba. SIFT flow: Dense correspondence across scenes and its
applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):978–
94, May 2011.
[290] C. K. Liu, A. Hertzmann, and Z. Popović. Learning physics-based motion style with
nonlinear inverse optimization. In ACM SIGGRAPH (ACM Transactions on Graphics),
2005.
[291] F. Liu and M. Gleicher. Automatic image retargeting with fisheye-view warping. In ACM
Symposium on User Interface Software and Technology, 2005.
[292] F. Liu and M. Gleicher. Video retargeting: automating pan and scan. In ACM International
Conference on Multimedia, 2006.
[293] F. Liu, M. Gleicher, H. Jin, and A. Agarwala. Content-preserving warps for 3D video
stabilization. In ACM SIGGRAPH (ACM Transactions on Graphics), 2009.
[294] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agarwala. Subspace video stabilization. ACM
Transactions on Graphics, 30(1):4:1–4:10, Feb. 2011.
[295] G. Liu and L. McMillan. Estimation of missing markers in human motion capture. The
Visual Computer, 22(9):721–8, Sept. 2006.
[296] J. Liu, J. Sun, and H. Shum. Paint selection. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2009.
[297] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or. Optimizing photo composition. In Eurograph-
ics, 2010.
[298] W.-Y. Lo, J. V. Baar, C. Knaus, M. Zwicker, and M. Gross. Stereoscopic 3D copy and paste.
In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2010.
[299] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A multilevel banded graph cuts method for fast
image segmentation. In IEEE International Conference on Computer Vision (ICCV), 2005.
[300] H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two
projections. Nature, 293:133–5, Sept. 1981.
[301] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3D surface
construction algorithm. In ACM SIGGRAPH (ACM Transactions on Graphics), 1987.
380 Bibliography
[302] H. Lou and J. Chai. Example-based human motion denoising. IEEE Transactions on
Visualization and Computer Graphics, 16(5):870–9, Sept. 2010.
[303] M. Lourakis and A. Argyros. Is Levenberg-Marquardt the most efficient optimization
algorithm for implementing bundle adjustment? In IEEE International Conference on
Computer Vision (ICCV), 2005.
[304] M. I. Lourakis and A. A. Argyros. Efficient, causal camera tracking in unpre-
pared environments. Computer Vision and Image Understanding, 99(2):259–90,
Aug. 2005.
[305] M. I. A. Lourakis and A. A. Argyros. SBA: a software package for generic sparse bundle
adjustment. ACM Transactions on Mathematical Software, 36(1):2:1–2:30, Mar. 2009.
[306] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision, 60(2):91–110, Nov. 2004.
[307] B. D. Lucas and T. Kanade. An iterative image registration technique with an application
to stereo vision. In Imaging Understanding Workshop, 1981.
[308] Q.-T. Luong and O. Faugeras. Self-calibration of a moving camera from point cor-
respondences and fundamental matrices. International Journal of Computer Vision,
22(3):261–89, Mar. 1997.
[309] W.-C. Ma, A. Jones, J.-Y. Chiang, T. Hawkins, S. Frederiksen, P. Peers, M. Vukovic,
M. Ouhyoung, and P. Debevec. Facial performance synthesis using deformation-
driven polynomial displacement maps. In ACM SIGGRAPH Asia (ACM Transactions on
Graphics), 2008.
[310] D. Mahajan, F. Huang, W. Matusik, R. Ramamoorthi, and P. Belhumeur. Moving gradi-
ents: a path-based method for plausible image interpolation. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2009.
[311] S. Mahamud, M. Hebert, Y. Omori, and J. Ponce. Provably-convergent iterative methods
for projective structure from motion. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR), 2001.
[312] H. Malm and A. Heyden. Stereo head calibration from a planar object. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2001.
[313] A. Mansfield, P. Gehler, L. Van Gool, and C. Rother. Scene carving: Scene consistent image
retargeting. In European Conference on Computer Vision (ECCV), 2010.
[314] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally
stable extremal regions. Image and Vision Computing, 22(10):761–7, 2004.
[315] Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum. Full-frame video stabilization
with motion inpainting. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(7):1150–63, July 2006.
[316] I. Matthews and S. Baker. Active appearance models revisited. International Journal of
Computer Vision, 60(2):135–64, Nov. 2004.
[317] S. Maybank. Theory of Reconstruction from Image Motion. Springer-Verlag, 1993.
[318] J. McCann and N. S. Pollard. Real-time gradient-domain painting. In ACM SIGGRAPH
(ACM Transactions on Graphics), 2008.
[319] M. McGuire, W. Matusik, H. Pfister, J. Hughes, and F. Durand. Defocus video matting. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[320] M. McGuire, W. Matusik, and W. Yerazunis. Practical, real-time studio matting using dual
imagers. In Eurographics Symposium on Rendering, 2006.
[321] P. McLauchlan. Gauge independence in optimization algorithms for 3D vision. In
B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice,
pages 183–99. Springer, 2000.
[322] L. McMillan and G. Bishop. Plenoptic modeling: an image-based rendering system. In
ACM SIGGRAPH (ACM Transactions on Graphics), 1995.
[323] A. Menache. Understanding Motion Capture for Computer Animation. Morgan Kauf-
mann, 2nd edition, 2011.
[324] I. Mikić, M. Trivedi, E. Hunter, and P. Cosman. Human body model acquisition and
tracking using voxel data. International Journal of Computer Vision, 53(3):199–223, July
2003.
Bibliography 381
[325] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In IEEE
International Conference on Computer Vision (ICCV), 2001.
[326] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In European
Conference on Computer Vision (ECCV), 2002.
[327] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors.
International Journal of Computer Vision, 60(1):63–86, Oct. 2004.
[328] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615–30, Oct. 2005.
[329] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir,
and L. Van Gool. A comparison of affine region detectors. International Journal of
Computer Vision, 65(1):43–72, Nov. 2005.
[330] F. Mindru, T. Tuytelaars, L. Van Gool, and T. Moons. Moment invariants for recognition
under changing viewpoint and illumination. Computer Vision and Image Understanding,
94(1-3):3–27, Apr.–Jun. 2004.
[331] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion
capture. Computer Vision and Image Understanding, 81(3):231–68, Mar. 2001.
[332] T. B. Moeslund, A. Hilton, and V. Krüger. A survey of advances in vision-based human
motion capture and analysis. Computer Vision and Image Understanding, 104(2-3):90–
126, Nov. 2006.
[333] P. Montesinos, V. Gouet, R. Deriche, and D. Pelé. Matching color uncalibrated images
using differential invariants. Image and Vision Computing, 18(9):659–71, June 2000.
[334] R. Morano, C. Ozturk, R. Conn, S. Dubin, S. Zietz, and J. Nissano. Structured light using
pseudorandom codes. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(3):322–7, Mar. 1998.
[335] H. Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover.
Technical Report CMU-RI-TR-3, Carnegie Mellon University, 1980.
[336] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on 3D
objects. International Journal of Computer Vision, 73(3):263–84, July 2007.
[337] D. Morris, K. Kanatani, and T. Kanade. Uncertainty modeling for optimal structure from
motion. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and
Practice, pages 315–45. Springer, 2000.
[338] D. Morris, K. Kanatani, and T. Kanade. Gauge fixing for accurate 3D estimation. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2001.
[339] E. Mortensen and W. Barrett. Interactive segmentation with intelligent scissors. Graphi-
cal Models and Image Processing, 60(5):349–84, Sept. 1998.
[340] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Generic and real-time
structure from motion using local bundle adjustment. Image and Vision Computing,
27(8):1178–93, July 2009.
[341] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference:
An empirical study. In Uncertainty in AI, 1999.
[342] R. M. Murray, Z. Li, and S. S. Sastry. A Mathematical Introduction to Robotic Manipulation.
CRC Press, 1994.
[343] H.-H. Nagel and W. Enkelmann. An investigation of smoothness constraints for the esti-
mation of displacement vector fields from image sequences. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-8(5):565–93, Sept. 1986.
[344] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta. Occlusion detectable stereo-occlusion
patterns in camera matrix. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 1996.
[345] S. Narasimhan, S. Nayar, B. Sun, and S. Koppal. Structured light in scattering media. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2005.
[346] S. Negahdaripour. Revised definition of optical flow: integration of radiometric and
geometric cues for dynamic scene analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(9):961–79, Sept. 1998.
382 Bibliography
[370] J.-P. Pons, R. Keriven, and O. Faugeras. Multi-view stereo reconstruction and scene flow
estimation with a global image-based matching score. International Journal of Computer
Vision, 72(2):179–93, June 2007.
[371] Z. Popović and A. Witkin. Physically based motion transformation. In ACM SIGGRAPH
(ACM Transactions on Graphics), 1999.
[372] R. Poppe. Vision-based human motion analysis: An overview. Computer Vision and Image
Understanding, 108(1-2):4–18, Oct. 2007.
[373] T. Porter and T. Duff. Compositing digital images. In ACM SIGGRAPH (ACM Transactions
on Graphics), 1984.
[374] W. Press, B. Flannery, S. Teukolsky, W. Vetterling, et al. Numerical Recipes. Cambridge
University Press, 2007.
[375] D. Price. The Pixar Touch. Vintage, 2009.
[376] Y. Pritch, E. Kav-Venaki, and S. Peleg. Shift-map image editing. In IEEE International
Conference on Computer Vision (ICCV), 2009.
[377] A. Protiere and G. Sapiro. Interactive image segmentation via adaptive weighted
distances. IEEE Transactions on Image Processing, 16(4):1046–57, Apr. 2007.
[378] K. Pulli. Multiview registration for large data sets. In International Conference on 3-D
Digital Imaging and Modeling (3DIM), 1999.
[379] R. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Image change detection algorithms: a
systematic survey. IEEE Transactions on Image Processing, 14(3):294–307, Mar. 2005.
[380] R. Radke, P. Ramadge, S. Kulkarni, and T. Echigo. Efficiently synthesizing virtual video.
IEEE Transactions on Circuits and Systems for Video Technology, 13(4):325–37, Apr. 2003.
[381] R. Raskar and P. Beardsley. A self-correcting projector. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2001.
[382] R. Raskar, H. Nii, B. deDecker, Y. Hashimoto, J. Summet, D. Moore, Y. Zhao, J. Westhues,
P. Dietz, J. Barnwell, S. Nayar, M. Inami, P. Bekaert, M. Noland, V. Branzoi, and E. Bruns.
Prakash: lighting aware motion capture using photosensing markers and multiplexed
illuminators. In ACM SIGGRAPH (ACM Transactions on Graphics), 2007.
[383] A. Rav-Acha, P. Kohli, C. Rother, and A. Fitzgibbon. Unwrap mosaics: a new rep-
resentation for video editing. In ACM SIGGRAPH (ACM Transactions on Graphics),
2008.
[384] A. Rav-Acha, Y. Pritch, D. Lischinski, and S. Peleg. Dynamosaicing: Mosaicing of dynamic
scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10):1789–
1801, Oct. 2007.
[385] I. D. Reid. Projective calibration of a laser-stripe range finder. Image and Vision
Computing, 14(9):659–66, Oct. 1996.
[386] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec. High Dynamic Range Imaging:
Acquisition, Display, and Image-Based Lighting. Morgan Kaufmann, 2005.
[387] L. Ren, A. Patrick, A. A. Efros, J. K. Hodgins, and J. M. Rehg. A data-driven approach to
quantifying natural human motion. In ACM SIGGRAPH (ACM Transactions on Graphics),
2005.
[388] L. Ren, G. Shakhnarovich, J. K. Hodgins, H. Pfister, and P. Viola. Learning silhouette
features for control of human motion. ACM Transactions on Graphics, 24(4):1303–31,
Oct. 2005.
[389] C. Rhemann, C. Rother, and M. Gelautz. Improving color modeling for alpha matting. In
British Machine Vision Conference (BMVC), 2008.
[390] C. Rhemann, C. Rother, P. Kohli, and M. Gelautz. A spatially varying PSF-based prior
for alpha matting. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2010.
[391] C. Rhemann, C. Rother, A. Rav-Acha, and T. Sharp. High resolution matting via interactive
trimap segmentation. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 2008.
[392] C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, and P. Rott. A perceptually moti-
vated online benchmark for image matting. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), 2009.
384 Bibliography
[393] R. Rickitt. Special Effects: The History and Technique. Billboard Books, 2nd edition, 2007.
[394] M. Ringer and J. Lasenby. A procedure for automatically estimating model parameters in
optical motion capture. Image and Vision Computing, 22(10):843–50, Sept. 2004.
[395] C. Rose, M. Cohen, and B. Bodenheimer. Verbs and adverbs: multidimensional motion
interpolation. IEEE Computer Graphics and Applications, 18(5):32–40, Sept. 1998.
[396] C. Rose, B. Guenter, B. Bodenheimer, and M. F. Cohen. Efficient generation of motion
transitions using spacetime constraints. In ACM SIGGRAPH (ACM Transactions on
Graphics), 1996.
[397] B. Rosenhahn and T. Brox. Scaled motion dynamics for markerless motion capture. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2007.
[398] B. Rosenhahn, C. Schmaltz, T. Brox, J. Weickert, D. Cremers, and H.-P. Seidel. Markerless
motion capture of man-machine interaction. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), 2008.
[399] D. Ross, D. Tarlow, and R. Zemel. Learning articulated structure and motion. Interna-
tional Journal of Computer Vision, 88(2):214–37, June 2010.
[400] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In
IEEE International Conference on Computer Vision (ICCV), 2005.
[401] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In
European Conference on Computer Vision (ECCV), 2006.
[402] E. Rosten, R. Porter, and T. Drummond. FASTER and better: A machine learning approach
to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(1):105–19, Jan. 2010.
[403] S. Roth and M. Black. On the spatial statistics of optical flow. International Journal of
Computer Vision, 74(1):33–50, Aug. 2007.
[404] C. Rother, L. Bordeaux, Y. Hamadi, and A. Blake. Autocollage. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2006.
[405] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extraction using
iterated graph cuts. In ACM SIGGRAPH (ACM Transactions on Graphics), 2004.
[406] S. Roy and I. Cox. A maximum-flow formulation of the N-camera stereo correspondence
problem. In IEEE International Conference on Computer Vision (ICCV), 1998.
[407] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir. A comparative study of image
retargeting. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2010.
[408] M. Rubinstein, A. Shamir, and S. Avidan. Improved seam carving for video retargeting. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2008.
[409] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator media retargeting. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2009.
[410] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time 3D model acquisition. In ACM
SIGGRAPH (ACM Transactions on Graphics), 2002.
[411] S. Rusinkiewicz and M. Levoy. Efficient variants of the ICP algorithm. In International
Conference on 3-D Digital Imaging and Modeling (3DIM), 2001.
[412] M. Ruzon and C. Tomasi. Alpha estimation in natural images. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2000.
[413] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied
Mathematics, 2003.
[414] F. Sadlo, T. Weyrich, R. Peikert, and M. Gross. A practical structured light acquisition
system for point-based geometry and texture. In Eurographics/IEEE VGTC Symposium
on Point-Based Graphics, 2005.
[415] A. Safonova and J. K. Hodgins. Analyzing the physical correctness of interpolated human
motion. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2005.
[416] A. Safonova, J. K. Hodgins, and N. S. Pollard. Synthesizing physically realistic human
motion in low-dimensional, behavior-specific spaces. In ACM SIGGRAPH (ACM Trans-
actions on Graphics), 2004.
[417] J. Salvi, J. Batlle, and E. Mouaddib. A robust-coded pattern projection for dynamic 3D
scene measurement. Pattern Recognition Letters, 19(11):1055–65, Sept. 1998.
Bibliography 385
[418] J. Salvi, S. Fernandez, T. Pribanic, and X. Llado. A state of the art in struc-
tured light patterns for surface profilometry. Pattern Recognition, 43(8):2666–80,
Aug. 2010.
[419] J. Salvi, J. Pagès, and J. Batlle. Pattern codification strategies in structured light systems.
Pattern Recognition, 37(4):827–49, Apr. 2004.
[420] P. Sand and S. Teller. Video matching. In ACM SIGGRAPH (ACM Transactions on
Graphics), 2004.
[421] H. S. Sawhney, Y. Guo, K. Hanna, R. Kumar, S. Adkins, and S. Zhou. Hybrid stereo camera:
an IBR approach for synthesis of very high resolution stereoscopic image sequences. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2001.
[422] S. Schaefer, T. McPhail, and J. Warren. Image deformation using moving least squares.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2006.
[423] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or
“How do I organize my holiday snaps?”. In European Conference on Computer Vision
(ECCV), 2002.
[424] F. Schaffalitzky and A. Zisserman. Automated location matching in movies. Computer
Vision and Image Understanding, 92(2-3):236–64, Nov. 2003.
[425] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. International Journal of Computer Vision, 47(1):7–42,
Apr. 2002.
[426] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2003.
[427] H. Schey. Div, Grad, Curl, and All That: An Informal Text on Vector Calculus. W.W. Norton
and Company, 2005.
[428] G. Schindler, F. Dellaert, and S. B. Kang. Inferring temporal order of images from 3D struc-
ture. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2007.
[429] C. Schmalz and E. Angelopoulou. A graph-based approach for robust single-shot struc-
tured light. In IEEE International Workshop on Projector-Camera Systems (PROCAMS),
2010.
[430] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(5):530–5, May 1997.
[431] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors.
International Journal of Computer Vision, 37(2):151–72, June 2000.
[432] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping using
scale-invariant features. In IEEE International Conference on Robotics and Automation,
2001.
[433] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation
of multi-view stereo reconstruction algorithms. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), 2006.
[434] S. M. Seitz and C. R. Dyer. View morphing. In ACM SIGGRAPH (ACM Transactions on
Graphics), 1996.
[435] S. M. Seitz and C. R. Dyer. Photorealistic scene reconstruction by voxel coloring.
International Journal of Computer Vision, 35(2):151–73, Nov. 1999.
[436] J. A. Sethian. Level Set Methods and Fast Marching Methods: Evolving Interfaces in Compu-
tational Geometry, Fluid Mechanics, Computer Vision, and Materials Science. Cambridge
University Press, 1999.
[437] V. Setlur, S. Takagi, R. Raskar, M. Gleicher, and B. Gooch. Automatic image retargeting.
In International Conference on Mobile and Ubiquitous Multimedia, 2005.
[438] M. Shaheen, J. Gall, R. Strzodka, L. Van Gool, and H.-P. Seidel. A comparison of 3d model-
based tracking approaches for human motion capture in uncontrolled environments. In
IEEE Computer Society Workshop on Applications of Computer Vision, 2009.
[439] A. Shashua. Algebraic functions for recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 17(8):779–89, Aug. 1995.
386 Bibliography
[463] N. Snavely, S. Seitz, and R. Szeliski. Skeletal graphs for efficient structure from motion. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2008.
[464] N. Snavely, I. Simon, M. Goesele, R. Szeliski, and S. Seitz. Scene reconstruction and
visualization from community photo collections. Proceedings of the IEEE, 98(8):1370–90,
Aug. 2010.
[465] J. M. Soucie, C. Wang, A. Forsyth, S. Funk, M. Denny, K. E. Roach, and D. Boone.
Range of motion measurements: reference values and a database for comparison studies.
Haemophilia, 17(3):500–7, May 2011.
[466] D. Stavens and S. Thrun. Unsupervised learning of invariant features using video. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[467] D. Steedly and I. Essa. Propagation of innovative information in non-linear least-squares
structure from motion. In IEEE International Conference on Computer Vision (ICCV),
2001.
[468] D. Steedly, I. Essa, and F. Dellaert. Spectral partitioning for structure from motion. In
IEEE International Conference on Computer Vision (ICCV), 2003.
[469] G. Strang. Introduction to Linear Algebra. Wellesley Cambridge Press, 4th edition, 2009.
[470] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking
camera calibration and multi-view stereo for high resolution imagery. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[471] P. Sturm. Critical motion sequences for monocular self-calibration and uncalibrated
Euclidean reconstruction. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 1997.
[472] P. Sturm. Critical motion sequences for the self-calibration of cameras and stereo systems
with variable focal length. In British Machine Vision Conference (BMVC), 1999.
[473] P. Sturm and S. Maybank. On plane-based camera calibration: A general algorithm, sin-
gularities, applications. In IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 1999.
[474] P. Sturm and B. Triggs. A factorization based algorithm for multi-image projective
structure and motion. In European Conference on Computer Vision (ECCV), 1996.
[475] D. Sun, S. Roth, and M. Black. Secrets of optical flow estimation and their principles. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2010.
[476] D. Sun, S. Roth, J. Lewis, and M. Black. Learning optical flow. In European Conference on
Computer Vision (ECCV), 2008.
[477] D. Sun, E. Sudderth, and M. Black. Layered image motion with explicit occlusions, tem-
poral consistency, and depth ordering. In Conference on Neural Information Processing
Systems, 2010.
[478] J. Sun, J. Jia, C. Tang, and H. Shum. Poisson matting. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2004.
[479] J. Sun, Y. Li, S. Kang, and H. Shum. Flash matting. In ACM SIGGRAPH (ACM Transactions
on Graphics), 2006.
[480] J. Sun, Y. Li, S. Kang, and H.-Y. Shum. Symmetric stereo matching for occlusion han-
dling. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2005.
[481] J. Sun, L. Yuan, J. Jia, and H.-Y. Shum. Image completion with structure propagation. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[482] J. Sun, N.-N. Zheng, and H.-Y. Shum. Stereo matching using belief propagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(7):787–800, July 2003.
[483] K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister. Multi-scale image harmoniza-
tion. In ACM SIGGRAPH (ACM Transactions on Graphics), 2010.
[484] R. Szeliski. Locally adapted hierarchical basis preconditioning. In ACM SIGGRAPH (ACM
Transactions on Graphics), 2006.
[485] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen,
and C. Rother. A comparative study of energy minimization methods for Markov Random
388 Bibliography
Fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(6):1068–80, June 2008.
[486] H. Tao, H. Sawhney, and R. Kumar. A global matching framework for stereo computation.
In IEEE International Conference on Computer Vision (ICCV), 2001.
[487] D. Tell and S. Carlsson. Wide baseline point matching using affine invariants computed
from intensity profiles. In European Conference on Computer Vision (ECCV), 2000.
[488] J.-P. Thirion. Image matching as a diffusion process: an analogy with Maxwell’s demons.
Medical Image Analysis, 2(3):243–60, Sept. 1998.
[489] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005.
[490] E. Tola, V. Lepetit, and P. Fua. DAISY: An efficient dense descriptor applied to
wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(5):815–30, May 2010.
[491] D. Tolani, A. Goswami, and N. I. Badler. Real-time inverse kinematics techniques for
anthropomorphic limbs. Graphical Models, 62(5):353–88, Sept. 2000.
[492] C. Tomasi and T. Kanade. Detection and tracking of point features. Technical Report
CMU-CS-91-132, Carnegie Mellon University, 1991.
[493] C. Tomasi and T. Kanade. Shape and motion from image streams under orthogra-
phy: a factorization method. International Journal of Computer Vision, 9(2):137–54,
Nov. 1992.
[494] T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto. Making good features track better. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
1998.
[495] P. H. Torr, A. W. Fitzgibbon, and A. Zisserman. The problem of degeneracy in struc-
ture and motion recovery from uncalibrated image sequences. International Journal of
Computer Vision, 32(1):27–44, Aug. 1999.
[496] L. Torresani, A. Hertzmann, and C. Bregler. Learning non-rigid 3D shape from 2D motion.
In Conference on Neural Information Processing Systems, 2004.
[497] M. Trajković and M. Hedley. Fast corner detection. Image and Vision Computing,
16(2):75–87, 1998.
[498] B. Triggs. Factorization methods for projective structure and motion. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 1996.
[499] B. Triggs. Autocalibration and the absolute quadric. In IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR), 1997.
[500] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment — a modern
synthesis. In B. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory
and Practice, pages 153–77. Springer, 2000.
[501] G. Turk and M. Levoy. Zippered polygon meshes from range images. In ACM SIGGRAPH
(ACM Transactions on Graphics), 1994.
[502] G. Turk and J. F. O’Brien. Shape transformation using variational implicit functions. In
ACM SIGGRAPH (ACM Transactions on Graphics), 1999.
[503] T. Tuytelaars and L. Van Gool. Matching widely separated views based on affine invariant
regions. International Journal of Computer Vision, 59(1):61–85, Aug. 2004.
[504] S. Umeyama. Least-squares estimation of transformation parameters between two point
patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–80,
Apr. 1991.
[505] R. Unnikrishnan and M. Hebert. Extracting scale and illuminant invariant regions
through color. In British Machine Vision Conference (BMVC), 2006.
[506] S. Uras, F. Girosi, A. Verri, and V. Torre. A computational approach to motion perception.
Biological Cybernetics, 60(2):79–87, Dec. 1988.
[507] R. Urtasun, D. Fleet, and P. Fua. 3D people tracking with Gaussian process dynam-
ical models. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2006.
[508] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1582–
96, Sept. 2010.
Bibliography 389
[509] J. van de Weijer and C. Schmid. Coloring local feature extraction. In European Conference
on Computer Vision (ECCV), 2006.
[510] L. Van Gool, T. Moons, and D. Ungureanu. Affine / photometric invariants for planar
intensity patterns. In European Conference on Computer Vision (ECCV), 1996.
[511] A. Vasile and R. Marino. Pose-independent automatic target detection and recognition
using 3D laser radar imagery. Lincoln Laboratory Journal, 15(1):61–78, 2005.
[512] M. V. Venkatesh, S. S. Cheung, and J. Zhao. Efficient object-based video inpainting.
Pattern Recognition Letters, 30(2):168–79, 2009.
[513] V. Verma, R. Kumar, and S. Hsu. 3D building detection and modeling from aerial LIDAR
data. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2006.
[514] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on
immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence,
13(6):583–98, June 1991.
[515] V. Vineet and P. Narayanan. CUDA cuts: Fast graph cuts on the GPU. In CVPR Workshop
on Visual Computer Vision on GPUs, 2008.
[516] P. Viola and M. J. Jones. Robust real-time face detection. International Journal of
Computer Vision, 57(2):137–54, May 2004.
[517] P. Vlahos. Composite photography utilizing sodium vapor illumination, 1963. US Patent
3,095,304.
[518] P. Vlahos. Electronic composite photography, 1971. US Patent 3,595,987.
[519] D. Vlasic, R. Adelsberger, G. Vannucci, J. Barnwell, M. Gross, W. Matusik, and J. Popović.
Practical motion capture in everyday surroundings. In ACM SIGGRAPH (ACM Transac-
tions on Graphics), 2007.
[520] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated mesh animation from multi-
view silhouettes. In ACM SIGGRAPH (ACM Transactions on Graphics), 2008.
[521] D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popović, S. Rusinkiewicz, and W. Matusik.
Dynamic shape capture using multi-view photometric stereo. In ACM SIGGRAPH Asia
(ACM Transactions on Graphics), 2009.
[522] G. Vogiatzis, C. Hernandez, P. Torr, and R. Cipolla. Multiview stereo via volumetric graph-
cuts and occlusion robust photo-consistency. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 29(12):2241–6, Dec. 2007.
[523] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg. Real-time
detection and tracking for augmented reality on mobile phones. IEEE Transactions on
Visualization and Computer Graphics, 16(3):355–68, May 2010.
[524] D. Wagner and D. Schmalstieg. ARToolKitPlus for pose tracking on mobile devices. In
Computer Vision Winter Workshop, 2007.
[525] M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on trees:
message-passing and linear programming. IEEE Transactions on Information Theory,
51(11):3697–717, Nov. 2005.
[526] K. Waldron and J. Schmiedeler. Kinematics. In B. Siciliano and O. Khatib, editors, Springer
Handbook of Robotics, pages 9–33. Springer, 2008.
[527] H. Wang, R. Raskar, and N. Ahuja. Seamless video editing. In International Conference on
Pattern Recognition (ICPR), 2004.
[528] J. Wang and E. Adelson. Representing moving images with layers. IEEE Transactions on
Image Processing, 3(5):625–38, Sept. 1994.
[529] J. Wang, M. Agrawala, and M. Cohen. Soft scissors: an interactive tool for realtime high
quality matting. In ACM SIGGRAPH (ACM Transactions on Graphics), 2007.
[530] J. Wang, P. Bhat, R. Colburn, M. Agrawala, and M. Cohen. Interactive video cutout. In
ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[531] J. Wang and M. Cohen. An iterative optimization approach for unified image seg-
mentation and matting. In IEEE International Conference on Computer Vision (ICCV),
2005.
[532] J. Wang and M. Cohen. Optimized color sampling for robust matting. In IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
390 Bibliography
[533] J. Wang and M. Cohen. Simultaneous matting and compositing. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[534] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seidel. Motion-aware temporal
coherence for video resizing. In ACM SIGGRAPH Asia (ACM Transactions on Graphics),
2009.
[535] Y.-S. Wang, H.-C. Lin, O. Sorkine, and T.-Y. Lee. Motion-based video retargeting with
optimized crop-and-warp. In ACM SIGGRAPH (ACM Transactions on Graphics), 2010.
[536] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized scale-and-stretch for image
resizing. In ACM SIGGRAPH Asia (ACM Transactions on Graphics), 2008.
[537] Z.-F. Wang and Z.-G. Zheng. A region based stereo matching algorithm using coopera-
tive optimization. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2008.
[538] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An improved algorithm for TV-L1
optical flow. In Statistical and Geometrical Approaches to Visual Motion Analysis, 2009.
[539] L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization.
In ACM SIGGRAPH (ACM Transactions on Graphics), 2000.
[540] T. Weise, B. Leibe, and L. Van Gool. Fast 3D scanning with automatic motion compensa-
tion. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2007.
[541] Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixture
estimation. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 1997.
[542] G. Welch and E. Foxlin. Motion tracking: no silver bullet, but a respectable arsenal. IEEE
Computer Graphics and Applications, 22(6):24–38, Nov. 2002.
[543] W. Wells III, P. Viola, H. Atsumi, S. Nakajima, and R. Kikinis. Multi-modal volume regis-
tration by maximization of mutual information. Medical Image Analysis, 1(1):35–51, Mar.
1996.
[544] Y. Wexler, A. Fitzgibbon, and A. Zisserman. Bayesian estimation of layers from multiple
images. In European Conference on Computer Vision (ECCV), 2002.
[545] Y. Wexler, A. Fitzgibbon, and A. Zisserman. Image-based environment matting. In
Eurographics Workshop on Rendering, 2002.
[546] Y. Wexler, E. Shechtman, and M. Irani. Space-time completion of video. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 29(3):463–76, Mar. 2007.
[547] L. Williams. Performance-driven facial animation. In ACM SIGGRAPH (ACM Transactions
on Graphics), 1990.
[548] S. Winder and M. Brown. Learning local image descriptors. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
[549] S. Winder, G. Hua, and M. Brown. Picking the best DAISY. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
[550] A. Witkin and Z. Popović. Motion warping. In ACM SIGGRAPH (ACM Transactions on
Graphics), 1995.
[551] G. Wolberg. Image morphing: a survey. The Visual Computer, 14(8):360–72, Dec. 1998.
[552] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous content-driven video-
retargeting. In IEEE International Conference on Computer Vision (ICCV), 2007.
[553] S. Wright. Digital Compositing for Film and Video. Focal Press, 3rd edition, 2010.
[554] H. Wu, R. Chellappa, A. Sankaranarayanan, and S. Zhou. Robust visual tracking using
the time-reversibility constraint. In IEEE International Conference on Computer Vision
(ICCV), 2007.
[555] T. Wu, C. Tang, M. Brown, and H. Shum. Natural shadow matting. ACM Transactions on
Graphics, 26(2), June 2007.
[556] J. Xiao, H. Cheng, H. Sawhney, C. Rao, and M. Isnardi. Bilateral filtering-based optical
flow estimation with occlusion detection. In European Conference on Computer Vision
(ECCV), 2006.
[557] L. Xu and J. Jia. Stereo matching: An outlier confidence approach. In European Conference
on Computer Vision (ECCV), 2008.
Bibliography 391
[558] L. Xu, J. Jia, and Y. Matsushita. Motion detail preserving optical flow estimation. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[559] K. Yamane and Y. Nakamura. Natural motion animation through constraining and
deconstraining at will. IEEE Transactions on Visualization and Computer Graphics,
9(3):352–60, July 2003.
[560] J. Yan and M. Pollefeys. A factorization-based approach for articulated nonrigid shape,
motion and kinematic chain recovery from video. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 30(5):865–77, May 2008.
[561] G. Yang, J. Becker, and C. Stewart. Estimating the location of a camera with respect to
a 3D model. In International Conference on 3-D Digital Imaging and Modeling (3DIM),
2007.
[562] G. Yang, C. Stewart, M. Sofka, and C.-L. Tsai. Registration of challenging image pairs: Ini-
tialization, estimation, and decision. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(11):1973–89, Nov. 2007.
[563] Q. Yang, L. Wang, R. Yang, H. Stewenius, and D. Nister. Stereo matching with color-
weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(3):492–504, Mar. 2009.
[564] Q. Yang, R. Yang, J. Davis, and D. Nister. Spatial-depth super resolution for range
images. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2007.
[565] L. Yatziv, A. Bartesaghi, and G. Sapiro. O(N) implementation of the fast marching
algorithm. Journal of Computational Physics, 212(2):393–9, Mar. 2006.
[566] J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagation and its gener-
alizations. In G. Lakemeyer and B. Nebel, editors, Exploring Artificial Intelligence in the
New Millennium, pages 239–70. Elsevier, 2003.
[567] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual
correspondence. In European Conference on Computer Vision (ECCV), 1994.
[568] L. Zhang, B. Curless, and S. Seitz. Rapid shape acquisition using color structured light and
multi-pass dynamic programming. In International Symposium on 3D Data Processing
Visualization and Transmission (3DPVT), 2002.
[569] L. Zhang, B. Curless, and S. Seitz. Spacetime stereo: shape recovery for dynamic scenes. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),
2003.
[570] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz. Spacetime faces: high resolution capture
for modeling and animation. In ACM SIGGRAPH (ACM Transactions on Graphics), 2004.
[571] S. Zhang and P. S. Huang. High-resolution, real-time three-dimensional shape measure-
ment. Optical Engineering, 45(12):123601–1:8, Dec. 2006.
[572] S. Zhang and S.-T. Yau. High-speed three-dimensional shape measurement system using
a modified two-plus-one phase-shifting algorithm. Optical Engineering, 46(11):113603:1–
6, Nov. 2007.
[573] Z. Zhang. Iterative point matching for registration of free-form curves and surfaces.
International Journal of Computer Vision, 13(2):119–52, Oct. 1994.
[574] Z. Zhang. On the epipolar geometry between two images with lens distortion. In
International Conference on Pattern Recognition (ICPR), 1996.
[575] Z. Zhang. A flexible new technique for camera calibration. Technical Report MSR-TR-98-
71, Microsoft Research, 1998.
[576] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two
uncalibrated images through the recovery of the unknown epipolar geometry. Artificial
Intelligence, 78(1-2):87–119, Oct. 1995.
[577] J. Zhao and N. I. Badler. Inverse kinematics positioning using nonlinear programming
for highly articulated figures. ACM Transactions on Graphics, 13(4):313–36, Oct. 1994.
[578] W. Zhao, D. Nister, and S. Hsu. Alignment of continuous video onto 3D point clouds. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(8):1305–18, Aug. 2005.
[579] Y. Zheng and C. Kambhamettu. Learning based digital matting. In IEEE International
Conference on Computer Vision (ICCV), 2009.
392 Bibliography
[580] J. Zhu, M. Liao, R. Yang, and Z. Pan. Joint depth and alpha matte optimization via fusion
of stereo and time-of-flight sensor. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR), 2009.
[581] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski. High-quality video
view interpolation using a layered representation. In ACM SIGGRAPH (ACM Transactions
on Graphics), 2004.
[582] D. Zongker, D. Werner, B. Curless, and D. Salesin. Environment matting and compositing.
In ACM SIGGRAPH (ACM Transactions on Graphics), 1999.
[583] V. B. Zordan, A. Majkowska, B. Chiu, and M. Fast. Dynamic response for motion capture
animation. In ACM SIGGRAPH (ACM Transactions on Graphics), 2005.
[584] V. B. Zordan and N. C. Van Der Horst. Mapping optical motion capture data to skele-
tal motion using a physical model. In ACM SIGGRAPH/Eurographics Symposium on
Computer Animation, 2003.
Index
393
394 Index