Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
Aaron Hertzmann
University of Toronto
SIGGRAPH 2004 Tutorial
Evaluations: www.siggraph.org/courses_evaluation
CG is maturing …
… but it’s still hard to create
… it’s hard to create in real-time
Data-driven computer graphics
mocap.cs.cmu.edu
Example: character posing
Example: shape modeling
[Thrun et al.]
Bayesian reasoning is …
A standard tool of computer vision
and …
Applications in:
• Data mining
• Robotics
• Signal processing
• Bioinformatics
• Text analysis (inc. spam filters)
• and (increasingly) graphics!
Outline for this course
3:45-4pm: Introduction
4pm-4:45: Fundamentals
- From axioms to probability theory
- Prediction and parameter estimation
4:45-5:15: Statistical shape models
- Gaussian models and PCA
- Applications: facial modeling, mocap
5:15-5:30: Summary and questions
More about the course
• Prerequisites
– Linear algebra, multivariate
calculus, graphics, optimization
• Unique features
– Start from first principles
– Emphasis on graphics problems
– Bayesian prediction
– Take-home “principles”
Bayesian vs. Frequentist
• Frequentist statistics
– a.k.a. “orthodox statistics”
– Probability = frequency of
occurrences in infinite # of trials
– Arose from sciences with
populations
– p-values, t-tests, ANOVA, etc.
• Bayesian vs. frequentist debates
have been long and acrimonious
Bayesian vs. Frequentist
“In academia, the Bayesian
revolution is on the verge of
becoming the majority viewpoint,
which would have been
unthinkable 10 years ago.”
- Bradley P. Carlin, professor of
public health, University of
Minnesota
New York Times, Jan 20, 2004
Bayesian vs. Frequentist
If necessary, please leave these
assumptions behind (for today):
• “A probability is a frequency”
• “Probability theory only applies
to large populations”
• “Probability theory is arcane and
boring”
Fundamentals
What is reasoning?
• How do we infer properties of the
world?
• How should computers do it?
Aristotelian logic
• If A is true, then B is true
• A is true
• Therefore, B is true
P(A) = .1
P(A) = .5
P(B | A) = .99
P(A | B) = .3
Basic rules
Sum rule:
P(A) + P(¬A) = 1
Example:
A: “it will rain today”
p(A) = .9 p(¬A) = .1
Basic rules
Sum rule:
∑i P(Ai) = 1
when exactly one of Ai must be true
Basic rules
Product rule:
Sum Rule
∑i P(Ai) = 1 ∑i P(Ai|B) = 1
Summary
Product rule P(A,B) = P(A|B) P(B)
Sum rule ∑i P(Ai) = 1
What is P(A)?
Example, continued
Model: P(B) = .7, P(A|B) = .8, P(A|¬B) = .5
P(D|M) P(M)
P(M|D) =
P(D)
Posterior
Principle #3:
Describe your model of the
world, and then compute the
probabilities of the unknowns
given the observations
Principle #3a:
P(D|M) P(M)
P(M|D) =
Posterior
P(D)
Discrete variables
Probabilities over discrete
variables
C 2 { Heads, Tails }
P(C=Heads) = .5
P(C=Heads) + P(C=Tails) = 1
Continuous variables
Let x 2 RN
How do we describe beliefs over x?
e.g., x is a face, joint angles, …
Continuous variables
Probability Distribution Function (PDF)
a.k.a. “marginal probability”
p(x)
x0 x1
Gaussian distributions
x » N(µ, σ2)
p(x|µ,σ2) = exp(-(x-µ)2/2σ2) / p 2πσ2
µ
Why use Gaussians?
• Convenient analytic properties
• Central Limit Theorem
• Works well
• Not for everything, but a good
building block
• For more reasons, see
[Bishop 1995, Jaynes 2003]
Rules for continuous PDFs
θ
“Posterior distribution:” new beliefs about θ
Bayesian prediction
What is the probability of another
head?
P(C=h|C1:N) = s P(C=h,θ|C1:N) dθ
= s P(C=h|θ, C1:N) P(θ | C1:N) dθ
= (H+1)/(N+2)
= 751 / 1002 = 74.95 %
Note: we never computed θ
Parameter estimation
• What if we want an estimate of θ?
• Maximum A Posteriori (MAP):
θ* = arg maxθ p(θ | C1, …, CN)
=H/N
= 750 / 1000 = 75%
A problem
Suppose we flip the coin once
What is P(C2 = h | C1 = h)?
{x2} µ, σ2
Learning a Gaussian
p(x|µ,σ2) = exp(-(x-µ)2/2σ2) / p 2πσ2
p(x1:K|µ, σ2) = ∏ p(xi | µ, σ2)
Want: max p(x1:K|µ, σ2)
= min –ln p(x1:K|µ, σ2)
= ∑i (x-µ)2/2σ2 + K/2 ln 2π σ2
Closed-form solution:
µ = ∑i xi / N
σ2 = ∑i (x - µ)2/N
Stereology
[Jagnow et al. 2004 (this morning)]
Model:
Marginalize out S:
p(θ | I) = s p(θ, S | I) dS
can be maximized
Principle #4b:
When estimating variables,
marginalize out as many
unknowns as possible.
?
Linear regression
Model:
ε » N(0, σ2I)
y y=ax+b+ε
Or:
p(y|x,a,b,σ2) =
x x N(ax + b, σ2I)
Linear regression
p(y|x, a, b,σ2) = N(ax + b, σ2I)
2
p(y1:K | x1:K, a, b, σ ) = ∏i p(yi | xi, a, b, σ2)
Maximum likelihood:
a*,b*,σ2* = arg max ∏i p(yi|xi,a,b,σ2)
= arg min –ln ∏i p(yi|x, a, b, σ2)
Minimize:
∑i (yi-(axi+b))2/(2σ2) + K/2 ln 2 π σ2
y
x y
x
Nonlinear regression
Model:
ε » N(0, σ2I)
y y = f(x;w) + ε
Curve parameters
Or:
p(y|x,w,σ2) =
x N(f(x;w), σ2I)
Typical curve models
Line
f(x;w) = w0 x + w1
B-spline, Radial Basis Functions
f(x;w) = ∑i wi Bi(x)
Artificial neural network
f(x;w) = ∑i wi tanh(∑j wj x + w0)+w1
Nonlinear regression
p(y|x, w, σ2) = N(f(x;w), σ2I)
2
p(y1:K | x1:K, w, σ ) = ∏i p(yi | xi, a, b, σ2)
Maximum likelihood:
w*,σ2* = arg max ∏i p(yi|xi,a,b,σ2)
= arg min –ln ∏i p(yi|x, a, b, σ2)
Minimize:
∑i (yi-f(xi;w))2/(2σ2) + K/2 ln 2 π σ2
Bending energy:
p(w|λ) ~ exp( -s kr fk2 / 2 λ2)
Weight decay:
p(w|λ) ~ exp( -kwk2 / 2 λ2)
Smoothness priors
MAP estimation:
arg max p(w|y) = p(y | w) p(w)/p(y)=
arg min –ln p(y|w) p(w) =
∑i (yi – f(xi; w))2/(2σ2) + kwk 2/2λ2 + K ln σ
x
Bayesian regression
y = ∑i wi yi (s.t., ∑i wi = 1)
= ∑i xi ai + µ
=Ax+µ
x y
y
x y
x
Conventional PCA
(Bayesian formulation)
x, A, µ » Uniform, AT A = I
ε » N(0, σ2 I)
y=Ax+µ+ε
Given training y1:K, what are A, x, µ, σ2?
ML point
x y
Linear constraint
Problems:
•Estimated point far from data if data is noisy
•High-dimensional y is a uniform distribution
•Low-dimensional x is overconstrained
Why? Because x » U
Probabilistic PCA
x » N(0,I)
y = Ax + b + ε
x y
y
PPCA vs. Gaussians
However…
PPCA: p(y) = s p(x,y) dx
= N(b, A AT + σ2 I)
This is a special case of a Gaussian!
PCA is a degenerate case (σ2=0)
Face estimation in an image
p(y) = N(µ, Σ)
y
p(Image | y) = N(Is(y), σ2 I)
Image
[Blanz and Vetter 1999]
Lucas-Kanade tracking
Tracking result
3D reconstruction
Reference frame
Results
Robust algorithm
3D reconstruction
[Almodovar 2002]
Inverse kinematics
DOFs (y)
Constraints
x y
Non-linear dimension reduction
y = f(x;w) + ε
Like non-linear regression w/o x
f(x;w)
x y
Walk cycle:
f(x;w)
Course evaluation
http://www.siggraph.org/courses_evaluation
Thank you!