0% found this document useful (0 votes)
30 views

Johnson11MLSS Talk Extras

This document provides an overview of Bayesian inference for Dirichlet-Multinomial mixtures and Dirichlet processes. It introduces Bayesian inference, mixture models, Markov chain sampling methods like the Gibbs sampler, and their application to topic modeling problems using Dirichlet processes. The Gibbs sampler is used to perform Bayesian inference for Dirichlet-Multinomial mixtures and estimate topic distributions in documents.

Uploaded by

Fabian Moss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Johnson11MLSS Talk Extras

This document provides an overview of Bayesian inference for Dirichlet-Multinomial mixtures and Dirichlet processes. It introduces Bayesian inference, mixture models, Markov chain sampling methods like the Gibbs sampler, and their application to topic modeling problems using Dirichlet processes. The Gibbs sampler is used to perform Bayesian inference for Dirichlet-Multinomial mixtures and estimate topic distributions in documents.

Uploaded by

Fabian Moss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Bayesian Inference for

Dirichlet-Multinomials and Dirichlet


Processes

Mark Johnson

Macquarie University
Sydney, Australia

MLSS “Summer School”

1 / 73
Random variables and “distributed
according to” notation
• A probability distribution F is a non-negative function from
some set X whose values sum (integrate) to 1
• A random variable X is distributed according to a distribution
F, or more simply, X has distribution F, written X ∼ F, iff:
P( X = x ) = F ( x ) for all x
(This is for discrete RVs).
• You’ll sometimes see the notion

X|Y ∼ F
which means “X is generated conditional on Y with
distribution F” (where F usually depends on Y), i.e.,
P( X | Y ) = F ( X | Y )
2 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

3 / 73
Bayes’ rule

P(Data | Hypothesis) P(Hypothesis)


P(Hypothesis | Data) =
P(Data)

• Bayesian’s use Bayes’ Rule to update beliefs in hypotheses in


response to data
• P(Hypothesis | Data) is the posterior distribution,
• P(Hypothesis) is the prior distribution,
• P(Data | Hypothesis) is the likelihood, and
• P(Data) is a normalising constant sometimes called the
evidence

4 / 73
Computing the normalising constant

P(Data) = ∑ P(Data, Hypothesis0 )


Hypothesis0 ∈H

= ∑ P(Data | Hypothesis0 )P(Hypothesis0 )


Hypothesis0 ∈H

• If set of hypotheses H is small, can calculate P(Data) by


enumeration
• But often these sums are intractable

5 / 73
Bayesian belief updating

• Idea: treat posterior from last observation as the prior for next
• Consistency follows because likelihood factors
I Suppose d = (d1 , d2 ). Then the posterior of a hypothesis
h is:

P( h | d1 , d2 ) ∝ P( h ) P( d1 , d2 | h )
= P(h) P(d1 | h) P(d2 | h, d1 )
∝ P(h | d1 ) P(d2 | h, d1 )
| {z } | {z }
updated prior likelihood

6 / 73
Discrete distributions

• A discrete distribution has a finite set of outcomes 1, . . . , m


• A discrete distribution is parameterized by a vector
θ = (θ1 , . . . , θm ), where P( X = j|θ) = θ j (so ∑m
j=1 θ j = 1)
I Example: An m-sided die, where θ = prob. of face j
j
• Suppose X = ( X1 , . . . , Xn ) and each Xi |θ ∼ D ISCRETE (θ).
Then:
n m
Nj
P( X |θ) = ∏ D ISCRETE(Xi ; θ) = ∏ θj
i =1 j =1

where Nj is the number of times j occurs in X.


• Goal of next few slides: compute P(θ| X )

7 / 73
Multinomial distributions

• Suppose Xi ∼ D ISCRETE (θ) for i = 1, . . . , n, and


Nj is the number of times j occurs in X
• Then N |n, θ ∼ M ULTI (θ, n), and

m
n! Nj
Nj ! ∏ j
P( N |n, θ) = θ
∏m
j =1 j =1

where n!/ ∏m j=1 Nj ! is the number of sequences of values with


occurence counts N
• The vector N is known as a sufficient statistic for θ because it
supplies as much information about θ as the original
sequence X does.

8 / 73
Dirichlet distributions
• Dirichlet distributions are probability distributions over
multinomial parameter vectors
I called Beta distributions when m = 2

• Parameterized by a vector α = (α1 , . . . , αm ) where α j > 0 that


determines the shape of the distribution

1 m α j −1
C (α) ∏
D IR (θ | α) = θj
j =1
Z m
α j −1 ∏mj =1 Γ ( α j )
C (α) = ∏ θj dθ =
Γ(∑m
∆ j =1 j =1 α j )

• Γ is a generalization of the factorial function


• Γ(k ) = (k − 1)! for positive integer k
• Γ( x ) = ( x − 1)Γ( x − 1) for all x
9 / 73
Plots of the Dirichlet distribution
Γ(∑m
j =1 α j )
m

Γ(α j ) ∏ k
α −1
P( θ | α ) = θ k
∏m
j =1 k =1

4
α = (1,1)
α = (5,2)
α = (0.1,0.1)
3
P(θ1|α)

0
0 0.2 0.4 0.6 0.8 1
θ1 (probability of outcome 1)

10 / 73
Dirichlet distributions as priors for θ
• Generative model:

θ | α ∼ D IR (α)
Xi | θ ∼ D ISCRETE (θ), i = 1, . . . , n

• We can depict this as a Bayes net using plates, which indicate


replication

Xi
n

11 / 73
Inference for θ with Dirichlet priors
• Data X = ( X1 , . . . , Xn ) generated i.i.d. from D ISCRETE (θ)
• Prior is D IR (α). By Bayes Rule, posterior is:

P(θ| X ) ∝ P( X |θ) P(θ)


! !
m m
Nj α j −1
∝ ∏ θj ∏ θj
j =1 j =1
m
Nj +α j −1
= ∏ θj , so
j =1
P(θ| X ) = D IR ( N + α)

• So if prior is Dirichlet with parameters α,


posterior is Dirichlet with parameters N + α
⇒ can regard Dirichlet parameters α as “pseudo-counts” from
“pseudo-data”
12 / 73
Conjugate priors
• If prior is D IR (α) and likelihood is i.i.d. D ISCRETE (θ),
then posterior is D IR ( N + α)
⇒ prior parameters α specify “pseudo-observations”
• A class C of prior distributions P( H ) is conjugate to a class of
likelihood functions P( D | H ) iff the posterior P( H | D ) is also a
member of C
• In general, conjugate priors encode “pseudo-observations”
I the difference between prior P( H ) and posterior P( H | D )

are the observations in D


I but P( H | D ) belongs to same family as P( H ), and can

serve as prior for inferences about more data D 0


⇒ must be possible to encode observations D using
parameters of prior
• In general, the likelihood functions that have conjugate priors
belong to the exponential family
13 / 73
Point estimates from Bayesian posteriors

• A “true” Bayesian prefers to use the full P( H | D ), but


sometimes we have to choose a “best” hypothesis
• The Maximum a posteriori (MAP) or posterior mode is

b = argmax P( H | D ) = argmax P( D | H ) P( H )
H
H H

• The expected value EP [ X ] of X under distribution P is:


Z
EP [ X ] = x P( X = x ) dx

The expected value is a kind of average, weighted by P( X ).


The expected value E[θ] of θ is an estimate of θ.

14 / 73
The posterior mode of a Dirichlet
• The Maximum a posteriori (MAP) or posterior mode is

b = argmax P( H | D ) = argmax P( D | H ) P( H )
H
H H

• For Dirichlets with parameters α, the MAP estimate is:

αj − 1
θ̂ j = m
∑ j 0 =1 ( α j 0 − 1 )

so if the posterior is D IR ( N + α), the MAP estimate for θ is:


Nj + α j − 1
θ̂ j =
n + ∑mj 0 =1 ( α j 0 − 1 )

• If α = 1 then θ̂ j = Nj /n, which is also the maximum likelihood


estimate (MLE) for θ
15 / 73
The expected value of θ for a Dirichlet
• The expected value EP [ X ] of X under distribution P is:
Z
EP [ X ] = x P( X = x ) dx

• For Dirichlets with parameters α, the expected value of θ j is:

αj
ED IR(α) [θ j ] = m
∑ j 0 =1 α j 0

• Thus if the posterior is D IR ( N + α), the expected value of θ j is:

Nj + α j
ED IR( N +α) [θ j ] =
n + ∑m
j 0 =1 α j 0

• E[θ] smooths or regularizes the MLE by


adding pseudo-counts α to N
16 / 73
Sampling from a Dirichlet

1 m α j −1
C (α) ∏
θ | α ∼ D IR (α) iff P(θ|α) = θ j , where:
j =1
∏mj =1 Γ ( α j )
C (α) =
Γ(∑m j =1 α j )

• There are several algorithms for producing samples from


D IR (α). A simple one relies on the following result:
• If Vk ∼ G AMMA (αk ) and θk = Vk /(∑m k0 =1 Vk0 ), then θ ∼ D IR ( α )
• This leads to the following algorithm for producing a sample
θ from D IR (α)
I Sample v from G AMMA ( α ) for k = 1, . . . , m
k k
m
I Set θ = v / ( ∑ 0
k k v k 0)
k =1
17 / 73
Posterior with Dirichlet priors
θ | α ∼ D IR (α)
Xi | θ ∼ D ISCRETE (θ), i = 1, . . . , n
• Integrate out θ to calculate posterior probability of X
Z Z
P( X | α ) = P( X, θ|α) dθ = P( X |θ) P(θ|α) dθ

! !
m
1 m α j −1
Z
Nj
= ∏ θj
∆ j =1 C (α) ∏
θj dθ
j =1
m
1
Z
Nj +α j −1
=
C (α) ∆
∏ θj dθ
j =1

C ( N + α) ∏mj =1 Γ ( α j )
= , where C (α) =
C (α) Γ(∑m j =1 α j )
• Collapsed Gibbs samplers and the Chinese Restaurant Process rely
on this result 18 / 73
Predictive distribution for
Dirichlet-Multinomial

• The predictive distribution is the distribution of observation


Xn+1 given observations X = ( X1 , . . . , Xn ) and prior D IR (α)
Z
P( Xn+1 = k | X, α) = P( Xn+1 = k | θ) P(θ | X, α) dθ
Z∆
= θk D IR (θ | N + α) dθ

Nk + αk
= m
∑ j=1 Nj + α j

19 / 73
Example: rolling a die
• Data d = (2, 5, 4, 2, 6)

4
α = (1,1,1,1,1,1)
α = (1,2,1,1,1,1)
α = (1,2,1,1,2,1)
3 α = (1,2,1,2,2,1)
P(θ2|α)

α = (1,3,1,2,2,1)
α = (1,3,1,2,2,2)
2

0
0 0.2 0.4 0.6 0.8 1
θ2 (probability of side 2)

20 / 73
Inference in complex models
• If the model is simple enough we can calculate the posterior
exactly (conjugate priors)
• When the model is more complicated, we can only
approximate the posterior
• Variational Bayes calculate the function closest to the posterior
within a class of functions
• Sampling algorithms produce samples from the posterior
distribution
I Markov chain Monte Carlo algorithms (MCMC) use a

Markov chain to produce samples


I A Gibbs sampler is a particular MCMC algorithm

• Particle filters are a kind of on-line sampling algorithm


(on-line algorithms only make one pass through the data)

21 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

22 / 73
Mixture models

• Observations Xi are a mixture of ` source distributions


F (θk ), k = 1, . . . , `
• The value of Zi specifies which source distribution is used to
generate Xi (Z is like a switch)
• If Zi = k, then Xi ∼ F (θk )
• Here we assume the Zi are not observed, i.e., hidden

θ
`
Xi | Zi , θ ∼ F (θZi ) i = 1, . . . , n
Z X
n

23 / 73
Applications of mixture models
• Blind source separation: data Xi come from ` different sources
I Which Xi come from which source?
(Zi specifies the source of Xi )
I What are the sources?

(θk specifies properties of source k)


• Xi could be a document and Zi the topic of Xi
• Xi could be an image and Zi the object(s) in Xi
• Xi could be a person’s actions and Zi the “cause” of Xi
• These are unsupervised learning problems, which are kinds of
clustering problems
• In a Bayesian setting, compute posterior P( Z, θ| X )
But how can we compute this?

24 / 73
Dirichlet Multinomial mixtures

φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di

β α • Zi is generated from a multinomial φ


• Dirichlet priors on φ and θk
φ θ • Easy to modify this framework for other
` applications
• Why does each observation X i consist of di
elements?
Z X
• What effect do the priors α and β have?
d
n
25 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

26 / 73
Why sample?
• Setup: Bayes net has variables X, whose value x we observe,
and variables Y, whose value we don’t know
I Y includes any parameters we want to estimate, such as θ

• Goal: compute the expected value of some function f :


E[ f | X = x ] = ∑ f (x, y) P(Y = y|X = x)
y

I E.g., f ( x, y) = 1 if x1 and x2 are both generated from


same hidden state, and 0 otherwise
• In what follows, everything is conditioned on X = x,
so take P(Y ) to mean P(Y | X = x)
• Suppose we can produce n samples y(t) , where Y (t) ∼ P(Y ).
Then we can estimate:
1 n
n t∑
E[ f | X = x ] = f ( x, y(t) )
=1
27 / 73
Markov chains
• A (first-order) Markov chain is a distribution over random
variables S(0) , . . . , S(n) all ranging over the same state space S ,
where:
n −1
P ( S (0) , . . . , S ( n ) ) = P ( S (0) ) ∏ P ( S ( t +1) | S ( t ) )
t =0

S(t+1) is conditionally independent of S(0) , . . . , S(t−1) given S(t)


• A Markov chain in homogeneous or time-invariant iff:

P(S(t+1) = s0 |S(t) = s) = Ps0 ,s for all t, s, s0


The matrix P is called the transition probability matrix of the
Markov chain
(t)
• If P(S(t) = s) = πs (i.e., π (t) is a vector of state probabilities
at time t) then:
I π ( t +1) = P π ( t )

I π ( t ) = P t π (0)
28 / 73
Ergodicity
• A Markov chain with tpm P is ergodic iff there is a positive
integer m s.t. all elements of Pm are positive (i.e., there is an
m-step path between any two states)
• Informally, an ergodic Markov chain “forgets” its past states
• Theorem: For each homogeneous ergodic Markov chain with
tpm P there is a unique limiting distribution DP , i.e., as n
approaches infinity, the distribution of Sn converges on DP
• DP is called the stationary distribution of the Markov chain
• Let π be the vector representation of DP , i.e., DP (y) = πy .
Then:

π = P π, and
n (0)
π = lim P π for every initial distribution π (0)
n→∞

29 / 73
Using a Markov chain for inference of P(Y )

• Set the state space S of the Markov chain to the range of Y


(S may be astronomically large)
• Find a tpm P such that P(Y ) ∼ DP
• “Run” the Markov chain, i.e.,
I Pick y (0) somehow

I For t = 0, . . . , n − 1:

– sample y(t+1) from P(Y (t+1) |Y (t) = y(t) ),


i.e., from P·,y(t)
I After discarding the first burn-in samples, use remaining
samples to calculate statistics
• WARNING: in general the samples y(t) are not independent

30 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

31 / 73
The Gibbs sampler
• The Gibbs sampler is useful when:
I Y is multivariate, i.e., Y = (Y1 , . . . , Ym ), and
I easy to sample from P(Yj |Y − j )
• The Gibbs sampler for P(Y ) is the tpm P = ∏m ( j)
j=1 P , where:
(
( j) 0 if y0− j 6= y− j
Py0 ,y =
P(Yj = y0j |Y − j = y− j ) if y0− j = y− j

• Informally, the Gibbs sampler cycles through each of the


variables Yj , replacing the current value y j with a sample from
P(Yj |Y − j = y− j )
• There are sequential scan and random scan variants of Gibbs
sampling

32 / 73
A simple example of Gibbs sampling

c if |Y1 | < 5, |Y2 | < 5 and |Y1 − Y2 | < 1
P(Y1 , Y2 ) =
0 otherwise

• The Gibbs sampler for P(Y1 , Y2 ) samples repeatedly from:


P(Y2 |Y1 ) = U NIFORM (max(−5, Y1 − 1), min(5, Y1 + 1))
P(Y1 |Y2 ) = U NIFORM (max(−5, Y2 − 1), min(5, Y2 + 1))
Sample run
5 Y1 Y2
0 0
0 -0.119
Y2

0
0.363 -0.119
0.363 0.146
-5
-0.681 0.146
-5 0 5 -0.681 -1.551
Y1 33 / 73
A non-ergodic
 Gibbs sampler
c if 1 < Y , Y < 5 or −5 < Y , Y < −1
1 2 1 2
P(Y1 , Y2 ) =
0 otherwise
• The Gibbs sampler for P(Y1 , Y2 ), initialized at (2,2), samples
repeatedly from:
P(Y2 |Y1 ) = U NIFORM (1, 5)
P(Y1 |Y2 ) = U NIFORM (1, 5)
I.e., never visits the negative values of Y1 , Y2
Sample run
5 Y1 Y2
2 2
2 2.72
Y2

0
2.84 2.72
2.84 4.71
-5
2.63 4.71
-5 0 5 2.63 4.52
Y1 1.11 4.52 34 / 73
Why does the Gibbs sampler work?

• The Gibbs sampler tpm is P = ∏m ( j)


j=1 P , where P
( j) replaces

y j with a sample from P(Yj |Y − j = y− j ) to produce y0


• But if y is a sample from P(Y ), then so is y0 ,
since y0 differs from y only by replacing y j with a sample from
P(Yj |Y − j = y− j )
• Since P( j) maps samples from P(Y ) to samples from P(Y ), so
does P
⇒ P(Y ) is a stationary distribution for P
• If P is ergodic, then P(Y ) is the unique stationary distribution
for P, i.e., the sampler converges to P(Y )

35 / 73
Gibbs sampling with Bayes nets

• Gibbs sampler: update y j with sample from


P(Yj |Y − j ) ∝ P(Yj , Y − j )
• Only need to evaluate terms that depend on Yj
Yj in Bayes net factorization
I Y appears once in a term P(Y |Y
j j Pa j )
I Y can appear multiple times in terms
j
P(Yk | . . . , Yj , . . .)
• In graphical terms, need to know value of:
I Y s parents
j
I Y s children, and their other parents
j

36 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

37 / 73
Dirichlet-Multinomial mixtures
φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di

β α
P(φ, Z, θ, X |α, β)
`
1 β −1+ Nk ( Z )
C ( β) k∏
φ θ = φk k
` =1 !
1 m α j −1+∑i:Zi =k Nj (X i )
C (α) ∏
θk,j
Z X j =1
d
n ∏mj =1 Γ ( α j )
where C (α) =
Γ(∑m j =1 α j )
38 / 73
Gibbs sampling for D-M mixtures

φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di

β α
P(φ| Z, β) = D IR (φ; β + N ( Z ))

φ θ m
Nj ( X i )
` P( Zi = k |φ, θ, X i ) ∝ φk ∏ θk,j
j =1

Z X P(θk |α, X, Z ) = D IR (θk ; α + ∑i:Zi =k N ( X i ))


d
n
39 / 73
Collapsed
β α
Dirichlet Multinomial mixtures
C ( N ( Z ) + β)
P( Z | β ) =
C ( β)
` C (α + ∑i:Zi =k N ( X i ))
Z X P( X |α, Z ) = ∏ C (α)
, so
d k =1

Nk ( Z −i ) + β k
P( Zi = k | Z −i , α, β) ∝
n − 1 + β•
C (α + ∑i0 6=i:Zi0 =k N ( X i0 ) + N ( X i ))
C (α + ∑i0 6=i:Zi0 =k N ( X i0 ))

• P( Zi = k | Z −i , α, β) is proportional to the prob. of generating:


I Zi = k, given the other Z −i , and
I X i in cluster k, given X −i and Z −i
40 / 73
Gibbs sampling for Dirichlet multinomial
mixtures

• Each X i could be generated from one of several Dirichlet


multinomials
• The variable Zi indicates the source for X i
• The uncollapsed sampler samples Z, θ and φ
• The collapsed sampler integrates out θ and φ and just samples
Z
• Collapsed samplers often (but not always) converge faster
than uncollapsed samplers
• Collapsed samplers are usually‘ easier to implement

41 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

42 / 73
Topic modeling of child-directed speech

• Data: Adam, Eve and Sarah’s mothers’ child-directed


utterances
I like it .
why don’t you read Shadow yourself ?
that’s a terribly small horse for you to ride .
why don’t you look at some of the toys in the basket .
want to ?
do you want to see what I have ?
what is that ?
not in your mouth .

• 59,959 utterances, composed of 337,751 words

43 / 73
Uncollapsed Gibbs sampler for topic model
β α • Data consists of “documents” X i
• Each X i is a sequence of “words” Xi,j
φ θ • Initialize by randomly assign each document
` X i to a topic Zi
• Repeat the following:
I Replace φ with a sample from a
Z X
Dirichlet with parameters β + N ( Z )
d
I For each topic k, replace θ with a
k
n sample from a Dirichlet with
parameters α + ∑i:Zi =k N ( X i ))
I For each document i, replace Z with a
i
sample from
Nj ( X i )
P( Zi = k |φ, θ, X i ) ∝ φk ∏m
j=1 θk,j

44 / 73
Collapsed Gibbs sampler for topic model

β α • Initialize by randomly assign each document


X i to a topic Zi
• Repeat the following:
I For each document i in 1, . . . , n (in
Z X
d random order):
– Replace Zi with a random sample
n
from P( Zi | Z −i , α, β)

P( Zi = k | Z −i , α, β)
Nk ( Z −i ) + β k C (α + ∑i0 6=i:Zi0 =k N ( X i0 ) + N ( X i ))

n − 1 + β• C (α + ∑i0 6=i:Zi0 =k N ( X i0 ))

45 / 73
Topics assigned after 100 iterations
1 big drum ?
3 horse .
8 who is that ?
9 those are checkers .
3 two checkers # yes .
1 play checkers ?
1 big horn ?
2 get over # Mommy .
1 shadow ?
9 I like it .
1 why don’t you read Shadow yourself ?
9 that’s a terribly small horse for you to ride .
2 why don’t you look at some of the toys in the basket .
1 want to ?
1 do you want to see what I have ?
8 what is that ?
2 not in your mouth .
2 let me put them together .
2 no # put floor .
3 no # that’s his pencil .
46 / 73
Most probable words in each cluster
P(Z=4) = 0.4334 P(Z=9) = 0.3111 P(Z=7) = 0.2555 P(Z=3) = 5.003e-
X P(X|Z) X P(X|Z) X P(X|Z) X P(X|Z)
. 0.12526 ? 0.19147 . 0.2258 quack 0.85
# 0.045402 you 0.062577 # 0.0695 . 0.15
you 0.040475 what 0.061256 that’s 0.034538
the 0.030259 that 0.022295 a 0.034066
it 0.024154 the 0.022126 no 0.02649
I 0.021848 # 0.021809 oh 0.023558
to 0.018473 is 0.021683 yeah 0.020332
don’t 0.015473 do 0.016127 the 0.014907
a 0.013662 it 0.015927 xxx 0.014288
? 0.013459 a 0.015092 not 0.013864
in 0.011708 to 0.013783 it’s 0.013343
on 0.011064 did 0.012631 ? 0.013033
your 0.010145 are 0.011427 yes 0.011795
and 0.009578 what’s 0.011195 right 0.0094166
that 0.0093303 your 0.0098961 alright 0.0088953
have 0.0088019 huh 0.0082591 is 0.0087975
no 0.0082514 want 0.0076782 you’re 0.0076571
put 0.0067486 where 0.0072346 one 0.006647
47 / 73
Remarks on cluster results
• The samplers cluster words by clustering the documents they
appear in, and cluster documents by clustering the words that
appear in them
• Even though there were ` = 10 clusters and α = 1, β = 1,
typically only 4 clusters were occupied after convergence
• Words x with high marginal probability P( X = x ) are
typically so frequent that they occur in all clusters
⇒ Listing the most probable words in each cluster may not be a
good way of characterizing the clusters
• Instead, we can Bayes invert and find the words that are most
strongly associated with each class

Nk,x ( Z, X ) + e
P( Z = k | X = x ) =
Nx ( X ) + e`
48 / 73
Purest words of each cluster
P(Z=4) = 0.4334 P(Z=9) = 0.3111 P(Z=7) = 0.2555 P(Z=3) = 5.0
X P(Z|X) X P(Z|X) X P(Z|X) X P(Z|
I’ll 0.97168 d(o) 0.97138 0 0.94715 quack 0.64
we’ll 0.96486 what’s 0.95242 mmhm 0.944 . 0.00
c(o)me 0.95319 what’re 0.94348 www 0.90244
you’ll 0.95238 happened 0.93722 m:hm 0.83019
may 0.94845 hmm 0.93343 uhhuh 0.81667
let’s 0.947 whose 0.92437 uh(uh) 0.78571
thought 0.94382 what 0.9227 uhuh 0.77551
won’t 0.93645 where’s 0.92241 that’s 0.7755
come 0.93588 doing 0.90196 yep 0.76531
let 0.93255 where’d 0.9009 um 0.76282
I 0.93192 don’t] 0.89157 oh+boy 0.73529
(h)ere 0.93082 whyn’t 0.89157 d@l 0.72603
stay 0.92073 who 0.88527 goodness 0.7234
later 0.91964 how’s 0.875 s@l 0.72
thank 0.91667 who’s 0.85068 sorry 0.70588
them 0.9124 [: 0.85047 thank+you 0.6875
can’t 0.90762 ? 0.84783 o:h 0.68
never 0.9058 matter 0.82963 nope 0.67857
49 / 73
Summary

• Complex models often don’t have analytic solutions


• Approximate inference can be used on many such models
• Monte Carlo Markov chain methods produce samples from
(an approximation to) the posterior distribution
• Gibbs sampling is an MCMC procedure that resamples each
variable conditioned on the values of the other variables
• If you can sample from the conditional distribution of each
hidden variable in a Bayes net, you can use Gibbs sampling to
sample from the joint posterior distribution
• We applied Gibbs sampling to Dirichlet-multinomial mixtures
to cluster sentences

50 / 73
Outline

Introduction to Bayesian Inference

Mixture models

Sampling with Markov Chains

The Gibbs sampler

Gibbs sampling for Dirichlet-Multinomial mixtures

Topic modeling with Dirichlet multinomial mixtures

Chinese Restaurant Processes

51 / 73
Bayesian inference for Dirichlet-multinomials

• Probability of next event with uniform Dirichlet prior with mass


α over m outcomes and observed data Z 1:n = ( Z1 , . . . , Zn )

P( Zn+1 = k | Z 1:n , α) ∝ nk ( Z 1:n ) + α/m

where nk ( Z 1:n ) is number of times k appears in Z 1:n


• Example: Coin (m = 2), α = 1, Z 1:2 = (heads, heads)
I P( Z = heads | Z
3 1:2 , α ) ∝ 2.5
I P( Z = tails | Z
3 1:2 , α ) ∝ 0.5

52 / 73
Dirichlet-multinomials with many outcomes

• Predictive probability:

P( Zn+1 = k | Z 1:n , α) ∝ nk ( Z 1:n ) + α/m

• Suppose the number of outcomes m  n. Then:



 nk ( Z 1:n ) if nk ( Z 1:n ) > 0
P( Zn+1 = k | Z 1:n , α) ∝
if nk ( Z 1:n ) = 0

α/m

• But most outcomes will be unobserved, so:

P( Zn+1 6∈ Z 1:n | Z 1:n , α) ∝ α

53 / 73
From Dirichlet-multinomials to Chinese
Restaurant Processes

...
• Suppose number of outcomes is unbounded
but we pick the event labels
• If we number event types in order of occurrence
⇒ Chinese Restaurant Process

Z1 = 1

nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n , α) ∝
α if k = m + 1

54 / 73
Chinese Restaurant Process (0)

• Customer → table mapping Z =


• P( z ) = 1

• Next customer chooses a table according to:



nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n ) ∝
α if k = m + 1

55 / 73
Chinese Restaurant Process (1)

• Customer → table mapping Z = 1


• P(z) = α/α

• Next customer chooses a table according to:



nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n ) ∝
α if k = m + 1

56 / 73
Chinese Restaurant Process (2)

1 α

• Customer → table mapping Z = 1, 1


• P(z) = α/α × 1/(1 + α)

• Next customer chooses a table according to:



nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n ) ∝
α if k = m + 1

57 / 73
Chinese Restaurant Process (3)

2 α

• Customer → table mapping Z = 1, 1, 2


• P(z) = α/α × 1/(1 + α) × α/(2 + α)

• Next customer chooses a table according to:



nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n ) ∝
α if k = m + 1

58 / 73
Chinese Restaurant Process (4)

2 1 α

• Customer → table mapping Z = 1, 1, 2, 1


• P(z) = α/α × 1/(1 + α) × α/(2 + α) × 2/(3 + α)

• Next customer chooses a table according to:



nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n ) ∝
α if k = m + 1

59 / 73
Pitman-Yor Process (0)

• Customer → table mapping z =


• P( z ) = 1

• In CRPs, probability of choosing a table ∝ number of


customers ⇒ strong rich get richer effect
• Pitman-Yor processes take mass a from each occupied table
and give it to the new table

nk (z) − a if k ≤ m = max(z)
P( Zn+1 = k | z) ∝
ma + b if k = m + 1
60 / 73
Pitman-Yor Process (1)

• Customer → table mapping z = 1


• P(z) = b/b

• In CRPs, probability of choosing a table ∝ number of


customers ⇒ strong rich get richer effect
• Pitman-Yor processes take mass a from each occupied table
and give it to the new table

nk (z) − a if k ≤ m = max(z)
P( Zn+1 = k | z) ∝
ma + b if k = m + 1
61 / 73
Pitman-Yor Process (2)

1−a a+b

• Customer → table mapping z = 1, 1


• P(z) = b/b × (1 − a)/(1 + b)

• In CRPs, probability of choosing a table ∝ number of


customers ⇒ strong rich get richer effect
• Pitman-Yor processes take mass a from each occupied table
and give it to the new table

nk (z) − a if k ≤ m = max(z)
P( Zn+1 = k | z) ∝
ma + b if k = m + 1
62 / 73
Pitman-Yor Process (3)

2−a a+b

• Customer → table mapping z = 1, 1, 2


• P(z) = b/b × (1 − a)/(1 + b) × ( a + b)/(2 + b)

• In CRPs, probability of choosing a table ∝ number of


customers ⇒ strong rich get richer effect
• Pitman-Yor processes take mass a from each occupied table
and give it to the new table

nk (z) − a if k ≤ m = max(z)
P( Zn+1 = k | z) ∝
ma + b if k = m + 1
63 / 73
Pitman-Yor Process (4)

2−a 1−a 2a + b

• Customer → table mapping z = 1, 1, 2, 1


• P( z ) =
b/b × (1 − a)/(1 + b) × ( a + b)/(2 + b) × (2 − a)/(3 + b)

• In CRPs, probability of choosing a table ∝ number of


customers ⇒ strong rich get richer effect
• Pitman-Yor processes take mass a from each occupied table
and give it to the new table

nk (z) − a if k ≤ m = max(z)
P( Zn+1 = k | z) ∝
ma + b if k = m + 1 64 / 73
Labeled Chinese Restaurant Process (0)

• Table → label mapping Y =


• Customer → table mapping Z =
• Output sequence X =
• P( X ) = 1

• Base distribution P0 (Y ) generates a label Yk for each table k


• All customers sitting at table k (i.e., Zi = k) share label Yk
• Customer i sitting at table Zi has label Xi = YZi
65 / 73
Labeled Chinese Restaurant Process (1)

fish

• Table → label mapping Y = fish


• Customer → table mapping Z = 1
• Output sequence X = fish
• P( X ) = α/α × P0 (fish)

• Base distribution P0 (Y ) generates a label Yk for each table k


• All customers sitting at table k (i.e., Zi = k) share label Yk
• Customer i sitting at table Zi has label Xi = YZi
66 / 73
Labeled Chinese Restaurant Process (2)

fish

1 α

• Table → label mapping Y = fish


• Customer → table mapping Z = 1, 1
• Output sequence X = fish,fish
• P( X ) = P0 (fish) × 1/(1 + α)

• Base distribution P0 (Y ) generates a label Yk for each table k


• All customers sitting at table k (i.e., Zi = k) share label Yk
• Customer i sitting at table Zi has label Xi = YZi
67 / 73
Labeled Chinese Restaurant Process (3)

fish apple

2 α

• Table → label mapping Y = fish,apple


• Customer → table mapping Z = 1, 1, 2
• Output sequence X = fish,fish,apple
• P( X ) = P0 (fish) × 1/(1 + α) × α/(2 + α)P0 (apple)

• Base distribution P0 (Y ) generates a label Yk for each table k


• All customers sitting at table k (i.e., Zi = k) share label Yk
• Customer i sitting at table Zi has label Xi = YZi
68 / 73
Labeled Chinese Restaurant Process (4)

fish apple

2 1 α

• Table → label mapping Y = fish,apple


• Customer → table mapping Z = 1, 1, 2
• Output sequence X = fish,fish,apple,fish
• P( X ) =
P0 (fish) × 1/(1 + α) × α/(2 + α)P0 (apple) × 2/(3 + α)

• Base distribution P0 (Y ) generates a label Yk for each table k


• All customers sitting at table k (i.e., Zi = k) share label Yk
• Customer i sitting at table Zi has label Xi = YZi
69 / 73
From Chinese restaurants to Dirichlet
processes
• Labeled Chinese restaurant processes take a distribution P0
and return a stream of samples from a different distribution
with the same support
• The Chinese restaurant process is a sequential process,
generating the next item conditioned on the previous ones
• We can get a different distribution each time we run a CRP
(allocation of customers to tables and labeling of tables are
randomized)
• Abstracting away from the sequential generation of the CRP,
we can view it as a mapping from a base distribution P0 to a
distribution over distributions DP(α, P0 )
• DP(α, P0 ) is called a Dirichlet process with concentration
parameter α and base distribution P0
• Distributions in DP(α, P0 ) are discrete (w.p. 1) even if the base
distribution P0 is continuous 70 / 73
Gibbs sampling with Chinese restaurants
• Idea: resample zi as if z−i were “real” data
• The CRP is exchangable: all ways of generating an assignment
of customers to labeled tables have the same probability
• This means P(zi |z−i ) is the same as if zi were generated after
s −i
I Exchangability means “treat every customer as if they were

your last”
• Tables are generated and garbage-collected during sampling
• The probability of generating a new table includes the
probability of generating its label
• When retracting zi reduces the number of customers at a table
to 0, garbage-collect the table
• CRPs not only estimate model parameters, they also estimate
the number of components (tables)
71 / 73
A DP clustering model
• Idea: replace multinomials with Chinese restaurants
• P(z) is a distribution over integers (clusters), generated by a
CRP
• For each cluster z, run separate Chinese restaurants for P( x |c)
• P( x |c) are distributions over words, so they need generator
distributions
I generators could be uniform over the named

entities/contexts in training data, or


I (n-gram) language models generating possible named

entities/contexts (unbounded vocabulary)


• In a hierarchical Dirichlet process, these generators could
themselves be Dirichlet processes that possibly share a
common vocabulary

72 / 73
Summary: Chinese Restaurant Processes

• Chinese Restaurant Processes (CRPs) generalize


Dirichlet-Multinomials to an unbounded number of outcomes
I concentration parameter α controls how likely a new

outcome is
I CRPs exhibit a rich get richer power-law behaviour

• Labeled CRPs use a base distribution to label each table


I base distribution can have infinite support

I concentrates mass on a countable subset

I power-law behaviour ⇒ Zipfian distributions

73 / 73

You might also like