Johnson11MLSS Talk Extras
Johnson11MLSS Talk Extras
Mark Johnson
Macquarie University
Sydney, Australia
1 / 73
Random variables and “distributed
according to” notation
• A probability distribution F is a non-negative function from
some set X whose values sum (integrate) to 1
• A random variable X is distributed according to a distribution
F, or more simply, X has distribution F, written X ∼ F, iff:
P( X = x ) = F ( x ) for all x
(This is for discrete RVs).
• You’ll sometimes see the notion
X|Y ∼ F
which means “X is generated conditional on Y with
distribution F” (where F usually depends on Y), i.e.,
P( X | Y ) = F ( X | Y )
2 / 73
Outline
Mixture models
3 / 73
Bayes’ rule
4 / 73
Computing the normalising constant
5 / 73
Bayesian belief updating
• Idea: treat posterior from last observation as the prior for next
• Consistency follows because likelihood factors
I Suppose d = (d1 , d2 ). Then the posterior of a hypothesis
h is:
P( h | d1 , d2 ) ∝ P( h ) P( d1 , d2 | h )
= P(h) P(d1 | h) P(d2 | h, d1 )
∝ P(h | d1 ) P(d2 | h, d1 )
| {z } | {z }
updated prior likelihood
6 / 73
Discrete distributions
7 / 73
Multinomial distributions
m
n! Nj
Nj ! ∏ j
P( N |n, θ) = θ
∏m
j =1 j =1
8 / 73
Dirichlet distributions
• Dirichlet distributions are probability distributions over
multinomial parameter vectors
I called Beta distributions when m = 2
1 m α j −1
C (α) ∏
D IR (θ | α) = θj
j =1
Z m
α j −1 ∏mj =1 Γ ( α j )
C (α) = ∏ θj dθ =
Γ(∑m
∆ j =1 j =1 α j )
Γ(α j ) ∏ k
α −1
P( θ | α ) = θ k
∏m
j =1 k =1
4
α = (1,1)
α = (5,2)
α = (0.1,0.1)
3
P(θ1|α)
0
0 0.2 0.4 0.6 0.8 1
θ1 (probability of outcome 1)
10 / 73
Dirichlet distributions as priors for θ
• Generative model:
θ | α ∼ D IR (α)
Xi | θ ∼ D ISCRETE (θ), i = 1, . . . , n
Xi
n
11 / 73
Inference for θ with Dirichlet priors
• Data X = ( X1 , . . . , Xn ) generated i.i.d. from D ISCRETE (θ)
• Prior is D IR (α). By Bayes Rule, posterior is:
b = argmax P( H | D ) = argmax P( D | H ) P( H )
H
H H
14 / 73
The posterior mode of a Dirichlet
• The Maximum a posteriori (MAP) or posterior mode is
b = argmax P( H | D ) = argmax P( D | H ) P( H )
H
H H
αj − 1
θ̂ j = m
∑ j 0 =1 ( α j 0 − 1 )
αj
ED IR(α) [θ j ] = m
∑ j 0 =1 α j 0
Nj + α j
ED IR( N +α) [θ j ] =
n + ∑m
j 0 =1 α j 0
1 m α j −1
C (α) ∏
θ | α ∼ D IR (α) iff P(θ|α) = θ j , where:
j =1
∏mj =1 Γ ( α j )
C (α) =
Γ(∑m j =1 α j )
C ( N + α) ∏mj =1 Γ ( α j )
= , where C (α) =
C (α) Γ(∑m j =1 α j )
• Collapsed Gibbs samplers and the Chinese Restaurant Process rely
on this result 18 / 73
Predictive distribution for
Dirichlet-Multinomial
19 / 73
Example: rolling a die
• Data d = (2, 5, 4, 2, 6)
4
α = (1,1,1,1,1,1)
α = (1,2,1,1,1,1)
α = (1,2,1,1,2,1)
3 α = (1,2,1,2,2,1)
P(θ2|α)
α = (1,3,1,2,2,1)
α = (1,3,1,2,2,2)
2
0
0 0.2 0.4 0.6 0.8 1
θ2 (probability of side 2)
20 / 73
Inference in complex models
• If the model is simple enough we can calculate the posterior
exactly (conjugate priors)
• When the model is more complicated, we can only
approximate the posterior
• Variational Bayes calculate the function closest to the posterior
within a class of functions
• Sampling algorithms produce samples from the posterior
distribution
I Markov chain Monte Carlo algorithms (MCMC) use a
21 / 73
Outline
Mixture models
22 / 73
Mixture models
θ
`
Xi | Zi , θ ∼ F (θZi ) i = 1, . . . , n
Z X
n
23 / 73
Applications of mixture models
• Blind source separation: data Xi come from ` different sources
I Which Xi come from which source?
(Zi specifies the source of Xi )
I What are the sources?
24 / 73
Dirichlet Multinomial mixtures
φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di
Mixture models
26 / 73
Why sample?
• Setup: Bayes net has variables X, whose value x we observe,
and variables Y, whose value we don’t know
I Y includes any parameters we want to estimate, such as θ
I π ( t ) = P t π (0)
28 / 73
Ergodicity
• A Markov chain with tpm P is ergodic iff there is a positive
integer m s.t. all elements of Pm are positive (i.e., there is an
m-step path between any two states)
• Informally, an ergodic Markov chain “forgets” its past states
• Theorem: For each homogeneous ergodic Markov chain with
tpm P there is a unique limiting distribution DP , i.e., as n
approaches infinity, the distribution of Sn converges on DP
• DP is called the stationary distribution of the Markov chain
• Let π be the vector representation of DP , i.e., DP (y) = πy .
Then:
π = P π, and
n (0)
π = lim P π for every initial distribution π (0)
n→∞
29 / 73
Using a Markov chain for inference of P(Y )
I For t = 0, . . . , n − 1:
30 / 73
Outline
Mixture models
31 / 73
The Gibbs sampler
• The Gibbs sampler is useful when:
I Y is multivariate, i.e., Y = (Y1 , . . . , Ym ), and
I easy to sample from P(Yj |Y − j )
• The Gibbs sampler for P(Y ) is the tpm P = ∏m ( j)
j=1 P , where:
(
( j) 0 if y0− j 6= y− j
Py0 ,y =
P(Yj = y0j |Y − j = y− j ) if y0− j = y− j
32 / 73
A simple example of Gibbs sampling
c if |Y1 | < 5, |Y2 | < 5 and |Y1 − Y2 | < 1
P(Y1 , Y2 ) =
0 otherwise
0
0.363 -0.119
0.363 0.146
-5
-0.681 0.146
-5 0 5 -0.681 -1.551
Y1 33 / 73
A non-ergodic
Gibbs sampler
c if 1 < Y , Y < 5 or −5 < Y , Y < −1
1 2 1 2
P(Y1 , Y2 ) =
0 otherwise
• The Gibbs sampler for P(Y1 , Y2 ), initialized at (2,2), samples
repeatedly from:
P(Y2 |Y1 ) = U NIFORM (1, 5)
P(Y1 |Y2 ) = U NIFORM (1, 5)
I.e., never visits the negative values of Y1 , Y2
Sample run
5 Y1 Y2
2 2
2 2.72
Y2
0
2.84 2.72
2.84 4.71
-5
2.63 4.71
-5 0 5 2.63 4.52
Y1 1.11 4.52 34 / 73
Why does the Gibbs sampler work?
35 / 73
Gibbs sampling with Bayes nets
36 / 73
Outline
Mixture models
37 / 73
Dirichlet-Multinomial mixtures
φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di
β α
P(φ, Z, θ, X |α, β)
`
1 β −1+ Nk ( Z )
C ( β) k∏
φ θ = φk k
` =1 !
1 m α j −1+∑i:Zi =k Nj (X i )
C (α) ∏
θk,j
Z X j =1
d
n ∏mj =1 Γ ( α j )
where C (α) =
Γ(∑m j =1 α j )
38 / 73
Gibbs sampling for D-M mixtures
φ | β ∼ D IR ( β)
Zi | φ ∼ D ISCRETE (φ) i = 1, . . . , n
θk | α ∼ D IR (α) k = 1, . . . , `
Xi,j | Zi , θ ∼ D ISCRETE (θZi ) i = 1, . . . , n; j = 1, . . . , di
β α
P(φ| Z, β) = D IR (φ; β + N ( Z ))
φ θ m
Nj ( X i )
` P( Zi = k |φ, θ, X i ) ∝ φk ∏ θk,j
j =1
Nk ( Z −i ) + β k
P( Zi = k | Z −i , α, β) ∝
n − 1 + β•
C (α + ∑i0 6=i:Zi0 =k N ( X i0 ) + N ( X i ))
C (α + ∑i0 6=i:Zi0 =k N ( X i0 ))
41 / 73
Outline
Mixture models
42 / 73
Topic modeling of child-directed speech
43 / 73
Uncollapsed Gibbs sampler for topic model
β α • Data consists of “documents” X i
• Each X i is a sequence of “words” Xi,j
φ θ • Initialize by randomly assign each document
` X i to a topic Zi
• Repeat the following:
I Replace φ with a sample from a
Z X
Dirichlet with parameters β + N ( Z )
d
I For each topic k, replace θ with a
k
n sample from a Dirichlet with
parameters α + ∑i:Zi =k N ( X i ))
I For each document i, replace Z with a
i
sample from
Nj ( X i )
P( Zi = k |φ, θ, X i ) ∝ φk ∏m
j=1 θk,j
44 / 73
Collapsed Gibbs sampler for topic model
P( Zi = k | Z −i , α, β)
Nk ( Z −i ) + β k C (α + ∑i0 6=i:Zi0 =k N ( X i0 ) + N ( X i ))
∝
n − 1 + β• C (α + ∑i0 6=i:Zi0 =k N ( X i0 ))
45 / 73
Topics assigned after 100 iterations
1 big drum ?
3 horse .
8 who is that ?
9 those are checkers .
3 two checkers # yes .
1 play checkers ?
1 big horn ?
2 get over # Mommy .
1 shadow ?
9 I like it .
1 why don’t you read Shadow yourself ?
9 that’s a terribly small horse for you to ride .
2 why don’t you look at some of the toys in the basket .
1 want to ?
1 do you want to see what I have ?
8 what is that ?
2 not in your mouth .
2 let me put them together .
2 no # put floor .
3 no # that’s his pencil .
46 / 73
Most probable words in each cluster
P(Z=4) = 0.4334 P(Z=9) = 0.3111 P(Z=7) = 0.2555 P(Z=3) = 5.003e-
X P(X|Z) X P(X|Z) X P(X|Z) X P(X|Z)
. 0.12526 ? 0.19147 . 0.2258 quack 0.85
# 0.045402 you 0.062577 # 0.0695 . 0.15
you 0.040475 what 0.061256 that’s 0.034538
the 0.030259 that 0.022295 a 0.034066
it 0.024154 the 0.022126 no 0.02649
I 0.021848 # 0.021809 oh 0.023558
to 0.018473 is 0.021683 yeah 0.020332
don’t 0.015473 do 0.016127 the 0.014907
a 0.013662 it 0.015927 xxx 0.014288
? 0.013459 a 0.015092 not 0.013864
in 0.011708 to 0.013783 it’s 0.013343
on 0.011064 did 0.012631 ? 0.013033
your 0.010145 are 0.011427 yes 0.011795
and 0.009578 what’s 0.011195 right 0.0094166
that 0.0093303 your 0.0098961 alright 0.0088953
have 0.0088019 huh 0.0082591 is 0.0087975
no 0.0082514 want 0.0076782 you’re 0.0076571
put 0.0067486 where 0.0072346 one 0.006647
47 / 73
Remarks on cluster results
• The samplers cluster words by clustering the documents they
appear in, and cluster documents by clustering the words that
appear in them
• Even though there were ` = 10 clusters and α = 1, β = 1,
typically only 4 clusters were occupied after convergence
• Words x with high marginal probability P( X = x ) are
typically so frequent that they occur in all clusters
⇒ Listing the most probable words in each cluster may not be a
good way of characterizing the clusters
• Instead, we can Bayes invert and find the words that are most
strongly associated with each class
Nk,x ( Z, X ) + e
P( Z = k | X = x ) =
Nx ( X ) + e`
48 / 73
Purest words of each cluster
P(Z=4) = 0.4334 P(Z=9) = 0.3111 P(Z=7) = 0.2555 P(Z=3) = 5.0
X P(Z|X) X P(Z|X) X P(Z|X) X P(Z|
I’ll 0.97168 d(o) 0.97138 0 0.94715 quack 0.64
we’ll 0.96486 what’s 0.95242 mmhm 0.944 . 0.00
c(o)me 0.95319 what’re 0.94348 www 0.90244
you’ll 0.95238 happened 0.93722 m:hm 0.83019
may 0.94845 hmm 0.93343 uhhuh 0.81667
let’s 0.947 whose 0.92437 uh(uh) 0.78571
thought 0.94382 what 0.9227 uhuh 0.77551
won’t 0.93645 where’s 0.92241 that’s 0.7755
come 0.93588 doing 0.90196 yep 0.76531
let 0.93255 where’d 0.9009 um 0.76282
I 0.93192 don’t] 0.89157 oh+boy 0.73529
(h)ere 0.93082 whyn’t 0.89157 d@l 0.72603
stay 0.92073 who 0.88527 goodness 0.7234
later 0.91964 how’s 0.875 s@l 0.72
thank 0.91667 who’s 0.85068 sorry 0.70588
them 0.9124 [: 0.85047 thank+you 0.6875
can’t 0.90762 ? 0.84783 o:h 0.68
never 0.9058 matter 0.82963 nope 0.67857
49 / 73
Summary
50 / 73
Outline
Mixture models
51 / 73
Bayesian inference for Dirichlet-multinomials
52 / 73
Dirichlet-multinomials with many outcomes
• Predictive probability:
53 / 73
From Dirichlet-multinomials to Chinese
Restaurant Processes
...
• Suppose number of outcomes is unbounded
but we pick the event labels
• If we number event types in order of occurrence
⇒ Chinese Restaurant Process
Z1 = 1
nk ( Z 1:n ) if k ≤ m = max( Z 1:n )
P( Zn+1 = k | Z 1:n , α) ∝
α if k = m + 1
54 / 73
Chinese Restaurant Process (0)
55 / 73
Chinese Restaurant Process (1)
56 / 73
Chinese Restaurant Process (2)
1 α
57 / 73
Chinese Restaurant Process (3)
2 α
58 / 73
Chinese Restaurant Process (4)
2 1 α
59 / 73
Pitman-Yor Process (0)
1−a a+b
2−a a+b
2−a 1−a 2a + b
fish
fish
1 α
fish apple
2 α
fish apple
2 1 α
your last”
• Tables are generated and garbage-collected during sampling
• The probability of generating a new table includes the
probability of generating its label
• When retracting zi reduces the number of customers at a table
to 0, garbage-collect the table
• CRPs not only estimate model parameters, they also estimate
the number of components (tables)
71 / 73
A DP clustering model
• Idea: replace multinomials with Chinese restaurants
• P(z) is a distribution over integers (clusters), generated by a
CRP
• For each cluster z, run separate Chinese restaurants for P( x |c)
• P( x |c) are distributions over words, so they need generator
distributions
I generators could be uniform over the named
72 / 73
Summary: Chinese Restaurant Processes
outcome is
I CRPs exhibit a rich get richer power-law behaviour
73 / 73