0% found this document useful (0 votes)
34 views

L14 TopicModels Sampling

This document provides an overview of statistical learning for text data analytics, specifically focusing on topic models like latent Dirichlet allocation. It discusses representation of text through language models and topic models, as well as learning approaches like supervised learning, deep learning, and optimization techniques. The document outlines the course topics, including an introduction to topic models, probabilistic latent semantic analysis, and latent Dirichlet allocation. It provides background on Bayesian modeling and Monte Carlo methods like importance sampling, rejection sampling, Metropolis methods, and Gibbs sampling as applied to collapsed Gibbs sampling for LDA.

Uploaded by

msk123123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

L14 TopicModels Sampling

This document provides an overview of statistical learning for text data analytics, specifically focusing on topic models like latent Dirichlet allocation. It discusses representation of text through language models and topic models, as well as learning approaches like supervised learning, deep learning, and optimization techniques. The document outlines the course topics, including an introduction to topic models, probabilistic latent semantic analysis, and latent Dirichlet allocation. It provides background on Bayesian modeling and Monte Carlo methods like importance sampling, rejection sampling, Metropolis methods, and Gibbs sampling as applied to collapsed Gibbs sampling for LDA.

Uploaded by

msk123123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Statistical Learning for Text Data Analytics

Latent Dirichlet Allocation

Yangqiu Song

Hong Kong University of Science and Technology


[email protected]

Spring 2018

∗Contents are based on materials created by David Mackay (Introduction to


Monte Carlo methods, 1998)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 1 / 40


Course Topics

Representation: language models, word embeddings, topic models


Learning: supervised learning, semi-supervised learning, sequence models,
deep learning, optimization techniques
Inference: constraint modeling, joint inference, search algorithms

NLP applications: tasks introduced in Lecture 1


Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 2 / 40
Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 3 / 40


Alternative Way for PLSA to Generate Texts

QM QNm PK
P(D, W) = m=1 i=1 k=1 P(zm,i = k)P(dm |θ k )P(wi |φk )
Q M Q V P K cdm (wi )
= m=1 i=1 k=1 P(z m,i = k)P(dm |θ k )P(w i |φk )

QM QV P cdm (wi )
K
P(D, W) = m=1 i=1 P(dm ) k=1 P(zm,i = k|θ m )P(wi |φk )

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 4 / 40


Bayesian Modeling: Topic Models

Figure: PLSA
Figure: LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 5 / 40


Generative Process of Latent Dirichlet Allocation

For all clusters/components k ∈ [1, K ]:


Choose mixture components φk ∼ Dir(φ|β)
For all documents m ∈ [1, M]:
Choose Nm ∼ Poisson(ξ)
Choose mixture probability θ m ∼ Dir(θ|α)
For all words n ∈ [1, Nm ] in document dm :
Choose a component index
zm,n ∼ Mult(z|θ m )
Choose a word wm,n ∼ Mult(w |φzm,n )

Figure: LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 6 / 40


Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 7 / 40


Bayesian Inference
Suppose we have a Basysian learning problem
P(Θ|X ) ∝ P(X |Θ)P(Θ) (X = {x1 , . . . , N})
If we want to predict for a new coming data x
Maximum a posterior (MAP) makes a point estimation
Θ∗ = maxΘ P(Θ|X ), and makes a prediction as P(x|Θ∗ )
R
Full Bayesian uses P(x|X ) = Θ P(x|Θ)P(Θ|X )dΘ

In general, we have a lot of following cases need to be estimated:


Z
EP(x) [φ(x)] = φ(x)P(x)dx
x
One way to solve this (especially when P(x) is difficult to compute) is
using sampling:
R
1 X
ÊP(x) [φ(x)] = φ(x(r ) )
R
x(r ) ∼P(x)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 8 / 40


Sampling
Z
Φ = EP(x) [φ(x)] = φ(x)P(x)dx
x
We call P(x) the target density
We assume x is a Rd vector with real/discrete components xi
We concentrate on the sampling problem, because if we have solved
it, then we can solve the expectation problem by
R
1 X
Φ̂ = ÊP(x) [φ(x)] = φ(x(r ) )
R (r )
x ∼P(x)
The expectation of Φ̂ is Φ
2
The variance of Φ̂ will decrease as σR where σ 2 is the variance of Φ:
Z
σ 2 = [φ(x) − Φ]2 P(x)dx
x
which means the accuracy of the sampling is independent of the
dimensionality of the space sampled
A few as a dozen independent samples x(r ) suffice of estimate Φ
satisfactorily
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 9 / 40
However, why is sampling from P(x) hard?

We assume that the density from which we wish to draw samples


P(x) can be evaluated, at least to with a multiplicative constant:

P(x) = P ∗ (x)/Z

If we can evaluate P ∗ (x), why can we not easily obtain Φ?


We do not know the normalizing constant
Z
Z = dxP ∗ (x)
x

Even if we know Z , drawing samples from P(x) is still challenging,


especially in high-dimensional spaces

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 10 / 40


One Dimensional Sampling Example
Consider P ∗ = exp{0.4(x − 0.4)2 − 0.08x 4 }, x ∈ (−∞, ∞)
To give a simpler problem, we can discretize the variable x and ask for
samples from the discrete prob.

evaluate pi∗ = P ∗ (xi ) at each point xi , we can compute


If we P
Z = i pi∗ and pi = pi∗ /Z
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 11 / 40
Recall: Generative View of Text Documents

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 12 / 40


Generating Text from Language Models

Example

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 13 / 40


Generating Text from Language Models

Example

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 14 / 40


Recall: Computer Simulation

Sample from a discrete distribution P(X ), assuming n outcomes in the


event space X

Algorithm 1 Sample from a distribution P(X )


1: for t = 1 to T do
2: Divide the interval [0, 1] into n intervals according to the probabilities
of the outcomes
3: Generate a random number rPbetweenP 0 and 1
4: Return xi where r falls into [ 0i−1 pi , i0 pi ]
5: end for

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 15 / 40


The Cost of Computing Z

To compute Z , we have to visit every point in the space

There are 50 uniformly spaced points in one dimension


If our system had N dimensions, e.g., K = 2 clusters (latent
variables), N = 1000 data examples
Then the corresponding number of points would be 21000
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 16 / 40
Uniform Sampling

Having agreed that we cannot visit every location in the space, we


might consider trying to solve sampling by uniform sampling:
Sample x(r ) uniformly and evaluate P ∗ (x(r ) ) to give
R
X
ZR = P ∗ (x(r ) )
r =1
R
and estimate Φ = EP(x) [φ(x)] = x
φ(x)P(x)dx by

R
X P ∗ (x(r ) )
Φ̂ = φ(x(r ) )
r =1
ZR

Is there anything wrong with this strategy?

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 17 / 40


Is there anything wrong with this strategy?

Let’s assume φ(x) is a benign, smoothly varying function, and


concentrated on the nature of P ∗ (x)
A high dimensional distribution is often concentrated in a small region
of the state space known as its typical set T
whose volume is given by |T | ' 2H(X)
H(X) is the
P entropy of the probability distribution
1
H(X) = x P(x) log2 P(x)
R
Φ = x φ(x)P(x)dx will be principally determined by values in typical
set
If we have N random variables with binary values, the total size of
state space is 2N and the typical set size is 2H
Each sample has a chance 2H /2N of falling into typical set
We need Rmin ' O(2N−H ) samples

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 18 / 40


Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 19 / 40


Importance Sampling
Importance sampling is not a method for generating samples from
P(x)
It is just a method for estimating the expectations of a function φ(x)
Let’s imagine the target distribution is a one-dimensional density
P(x)
P ∗ (x)
P(x) =
Z
but P(x) is too complicated to sample from directly

We assume Q(x) = QZQ(x) is a simpler density from which we can
generate samples
In importance sampling, we generate R samples {x (r ) }Rr =1 from Q(x)
Then Φ can be estimated by
R P
X
rPwr φ(x (r ) )
Φ̂ =
r wr
r =1
P ∗ (x (r ) )
where wr = Q ∗ (x (r ) )
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 20 / 40
Importance Sampling: A Toy Example
Φ can be estimated by
R P
X wr φ(x (r ) )
rP
Φ̂ =
r =1r wr

P ∗ (x (r ) )
where wr = Q ∗ (x (r ) )

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 21 / 40


Importance Sampling: A Toy Example

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 22 / 40


Importance Sampling in Many Dimensions

Importance sampling suffers from two difficulties


We clearly need to obtain samples that lie in the typical set
Even if we obtain samples in the typical set, the weights associated
with theose samples are likely to vary by large factors

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 23 / 40


Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 24 / 40


Rejection Sampling
P ∗ (x)
We again assume one dimensional complicated density P(x) = Z
We assume a simpler proposal density Q(x) which we can evaluate
and can generate samples from
We further assume for all x, cQ ∗ (x) > P ∗ (x)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 25 / 40


Rejection Sampling

We first generate x from Q(x)


We evaluate cQ ∗ (x) and generate a uniformly distributed variable
from the interval [0, cQ ∗ (x)]
We then evaluate P ∗ (x) and accept or reject the sample x by
comparing u with P ∗ (x)
If u > P ∗ (x) then x is rejected
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 26 / 40
Rejection Sampling

Reject sampling will work best when Q is a good proximation of P


c grows exponentially with the dimensionality N (MacKay (1998))
While it is a useful method for one-dimensional problems, it is not a
practical technique for high-dimensional distributions P(x)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 27 / 40


Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 28 / 40


Motivation

Importance sampling and rejection sampling only work well if the


proposal density Q(x) is similar to P(x)
In large and complex problems, it is difficult to create a single density
Q(x) that has this property
The Metropolis algorithm makes use of a proposal density Q(x 0 , x (t) )
which depends on the current state x (t)
It is not necessarily for Q(x 0 , x (t) ) to look at all similar to P(x)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 29 / 40


Metropolis Method
We again assume that we can evaluate P ∗ (x) for any x
A tentative new x 0 is generated from the proposal density Q(x 0 |x (t) )
To decide whether to accept the new state, we compute the quantity:
P ∗ (x 0 )Q(x (t) |x 0 )
a=
P ∗ (x (t) )Q(x 0 |x (t) )

If a ≥ 1, then accept the new x 0


Otherwise, the new state is accepted with probability a

If the state is accepted, we set x (t+1) = x 0


If the state is rejected, we set x (t+1) = x (t)
This is different from rejection sampling: in Metropolis method, a
rejection causes the current state sent to the generated list another
time
The samples in a Metropolis simulation of T iterations are correlated
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 30 / 40
Variants

P ∗ (x 0 )Q(x (t) |x 0 )
a=
P ∗ (x (t) )Q(x 0 |x (t) )

Steps that are accepted are shown as


green lines, and rejected steps are
shown in red.

When Q(x (t) , x 0 ) = Q(x 0 , x (t) ), it is called Metropolis algorithm


(Metropolis et al., 1953)
When Q(x (t) , x 0 ) 6= Q(x 0 , x (t) ), it is know as Metropolis-Hastings
algorithm (Hastings, 1970)
Metropolis methods are know as Markov chain Monte Carlo methods

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 31 / 40


Markov Chains

A first order Markov chain is defined to be a series of random


variables x (1) , . . . , x (M) such that the following conditional
independence property holds

P(x (t+1) |x (1) , . . . , x (t) ) = P(x (t+1) |x (t) )

where we define the transition probability


T (x (t) , x (t+1) ) = P(x (t+1) |x (t) )
A Markov chain is called homogeneous if the transition probabilities are
the same for all t
The marginal probability for a particular variable can be expressed in
terms of the marginal probability for the previous variable in the chain
in the form X
P(x (t+1) ) = P(x (t+1) |x (t) )P(x (t) )
x (t)

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 32 / 40


Markov Chains (Cont’d)
A distribution is said to be invariant, or stationary, with respect to a
Markov chain if each step in the chain leaves that distribution
invariant
For a homogeneous Markov chain P(z) is invariant if
X
P(x) = T (x 0 , x)P(x 0 )
x0

A sufficient (but not necessary) condition for ensuring that the


required distribution P(x) is invariant is to choose the transition
probabilities to satisfy the property of detailed balance, defined by
P(x)T (x, x 0 ) = P(x 0 )T (x 0 , x)
It is easy to verify that
X X X
T (x 0 , x)P(x 0 ) = P(x)T (x, x 0 ) = P(x) T (x, x 0 ) = P(x)
x0 x0 x0

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 33 / 40


Metropolis methods are know as Markov chain Monte
Carlo methods

From the probability to accept a new state:


!
(t) P ∗ (x 0 )Q(x (t) |x 0 )
a(x, x ) = min 1, ∗ (t)
P (x )Q(x 0 |x (t) )

We have the joint probability of two consecutive states as

P ∗ (x (t) ) · Q(x 0 |x (t) )a(x, x (t) )


= min(P ∗ (x (t) )Q(x 0 |x (t) ), P ∗ (x 0 )Q(x (t) |x 0 ))
= min(P ∗ (x 0 )Q(x (t) |x 0 ), P ∗ (x (t) )Q(x 0 |x (t) ))
= P ∗ (x) · Q(x (t) |x 0 )a(x (t) , x 0 )

as required
Metropolis method actually samples from the required distribution
P(x)
Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 34 / 40
Sampling Effects

The specific choice of proposal distribution can have a marked effect


on the performance of the algorithm

The scale ρ of the proposal distribution should be on the order of the


smallest standard deviation σmin
The iteration T should be at least (σmax /σmin )2 to obtain
approximately independent samples

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 35 / 40


Simulation of Sampling

1

21 x ∈ {0, 1, . . . , 20}
P(x) =
0 otherwise

x0 = x ± 1
1
0 2
Q(x |x) =
0 otherwise

Rejection will occur only when


the proposal takes the state
x 0 = −1 or x 0 = 21
It takes ≈ T 2 = 100 (178)
iterations to reach 0 or 20
It takes ≈ 400 (540) iterations
to reach both 0 and 20

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 36 / 40


Overview

1 Language Models: Recap

2 Topic Models

3 Probabilistic Latent Semantic Analysis (PLSA)

4 Latent Dirichlet Allocation (LDA)


Motivation: Bayesian Modeling
Background of Monte Carlo Methods
Importance Sampling
Rejection Sampling
Metropolis Methods
Gibbs Sampling
Sampling for EM Algorithm
Collapsed Gibbs Sampling for LDA

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 37 / 40


Gibbs Sampling
In the general case of a system with K variables, a single iteration
involves sampling one parameter at a time:
(t+1) (t) (t) (t)
x1 ∼ P(x2 , x3 , . . . , xK )
(t+1) (t+1) (t) (t)
x2 ∼ P(x1 , x3 , . . . , xK )
(t+1) (t+1) (t+1) (t)
x3 ∼ P(x1 , x2 , . . . , xK )
...
(t+1) (t+1) (t+1) (t+1)
xK ∼ P(x1 , x2 , . . . , xK −1 )
(t) (t+1) (t+1) (t+1) (t) (t)
Denote x\k = P(x1 , x2 , . . . , xk−1 , xk+1 , . . . , xK )
Gibbs sampling can be viewed as a Metropolis method
(t)
P ∗ (x0 )Q(x(t) |x 0 ) P(x0 )P(xk |x0\k )
aG = P ∗ (x(t) )Q(x0 |x(t) )
= (t)
P(x(t) )P(xk0 |x\k )
(t) 0 (t) (t)
P(xk0 |x0\k )P(x0\k )P(xk |x0\k ) x\k =x\k P(xk0 |x0\k )P(x0\k )P(xk |x0\k )
= (t) (t) (t) (t) = (t) =1
P(xk |x\k )P(x\k )P(xk0 |x\k ) P(xk |x0\k )P(x0\k )P(xk0 |x0\k )

The samples are always accepted


Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 38 / 40
Example of Gibbs Sampling

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 39 / 40


References I

MacKay, D. J. C. (1998). Introduction to Monte Carlo methods. In Jordan, M. I.,


editor, Learning in Graphical Models, NATO Science Series, pages 175–204. Kluwer
Academic Press.

Yangqiu Song (HKUST) Learning for Text Analytics Spring 2018 40 / 40

You might also like