Lecture 12
Lecture 12
Volodymyr Kuleshov
Cornell Tech
Lecture 12
Pros:
extreme flexibility: can use pretty much any function fθ (x) you want
Cons: Computing Z (θ) is generally intractable. As a result:
Sampling from pθ (x) is hard
Evaluating and optimizing likelihood pθ (x) is hard (learning is hard)
No feature learning (but can add latent variables)
What if we could train models without using Z ?
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models
Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 5 / 38
The (Stein) Score Function: Definition
Given a probability density p(x), its (Stein) score function is defined as
∇x log p(x)
On the left is a mixture of Gaussians. One the right is its score function in 1D.
Note that the score function is negative (points towards the left) near the left
mode and is positive (points towards the right) near the right mode.
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models
Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 10 / 38
Score Matching via Fisher Divergences
How do we fit score functions?
Data and model scores are vector fields; we want them to be aligned:
It’s the average Euclidean distance between two vectors over all x
The Fisher divergence is zero if and only if ∇x log pdata (x) ≈ sθ (x)
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 11 / 38
A Practical Objective for Score Matching
The Fisher divergence doesn’t give us an objective we can easily optimize.
1
Ex∼pdata (x) ||∇x log pdata (x) − sθ (x)||22
2
We don’t know log pdata (x), and much less its gradient.
However, via a change of variables trick, we can obtain an equivalent
objective (Hyvarinen, 2005)
1 2
Ex∼pdata (x) ||sθ (x)||2 + tr (∇x sθ (x))
2
We drop the first term and apply integration parts to the second term:
Z ∞ Z ∞ Z ∞
∇x p(x)
p(x)∇x log p(x)sθ (x)2 dx = p(x) sθ (x)2 dx = ∇x p(x)sθ (x)2 dx
−∞ −∞ p(x) −∞
Z ∞
= p(x)sθ (x)|∞
−∞ − p(x)∇x sθ (x)dx
| {z } −∞
=0
The first term is zero because we assume x has bounded support. Plugging the above
into the first expression yields the desired result.
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models
Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 17 / 38
Sample Generation: Intuition
Next, we want to generate samples from a trained score function. We can
do this by following the gradient of the data distribution
Intuitively, we start with noise z and we follow the gradient of the data
distribution until we generate accurate x
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 18 / 38
Langevin Dynamics
This process is guaranteed to produce samples from pθ (x) when the step
size αt → 0 and T → ∞.
Thus, our high-level strategy is to learn sθ (x) ≈ ∇x log pθ (x) from a set of
samples, and then generate new x via Langevin dynamics.
This is a problem as ∇x log pdata (x) is not defined where pdata (x) = 0
We can run Langevin Dynamics with decreasing noise levels. At each new
noise level, we start where we left off with the previous noise level.
1 Score Functions
Motivation
Definitions
Score estimation
2 Score Matching
Fisher Divergences
Score Matching
Sliced Score Estimation
3 Sample Generation
Langevin Dynamics
Manifold Hypothesis
Gaussian Smoothing
4 Connections to Diffusion Models
Figures and contents adapted from Stefano Ermon and Lily Weng
Volodymyr Kuleshov (Cornell Tech) Deep Generative Models, 2022 Lecture 12 29 / 38
Diffusion Models: Intuition
The intuition behind diffusion models is to define a process for gradually
turning data to noise, and learning the inverse of this process.
− log pθ (x0 ) ≤ − log pθ (x0 ) + DKL (q(x1:T |x0 )kpθ (x1:T |x0 ))
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Ex1:T ∼q(x1:T |x0 ) log
pθ (x0:T )/pθ (x0 )
h q(x1:T |x0 ) i
= − log pθ (x0 ) + Eq log + log pθ (x0 )
pθ (x0:T )
h q(x1:T |x0 ) i
= Eq log
pθ (x0:T )
h q(x1:T |x0 ) i
Let LVLB = Eq(x0:T ) log ≥ −Eq(x0 ) log pθ (x0 )
pθ (x0:T )
LVLB = LT + LT −1 + · · · + L0
where LT = DKL (q(xT |x0 ) k pθ (xT ))
Lt = DKL (q(xt |xt+1 , x0 ) k pθ (xt |xt+1 )) for 1 ≤ t ≤ T − 1
L0 = − log pθ (x0 |x1 )
It’s not hard to show that q(xt |x0 ) has an analytic form:
√
q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I)
QT
where αt = 1 − βt and ᾱt = i=1 αi and from that we can derive that
This gives us all the ingredients to learn these models by gradient descent.
Pros:
1 Make very few model assumptions
2 Can train energy-based models without manipulating normalizing
constant
3 Can produce very high quality samples (matching GANs)
Cons:
1 Sampling is still quite slow (need to run MCMC chain)
2 Do not always provide likelihoods (though we get them for certain
models)