0% found this document useful (0 votes)
27 views55 pages

20-gaussian-mixture-model

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views55 pages

20-gaussian-mixture-model

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Machine Learning CS 4641

Gaussian Mixture Model


Nakul Gopalan
Georgia Tech

Some of the slides are based on slides from Jiawei Han Chao Zhang, Mahdi Roozbahani and Barnabás Póczos.
Outline

• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
Recap

Conditional probabilities:

𝑝 𝐴, 𝐵 = 𝑝 𝐴 𝐵 𝑝 𝐵 = 𝑝 𝐵 𝐴 𝑝(𝐴)

Bayes rule:

𝑝(𝐴, 𝐵) 𝑝 𝐵 𝐴 𝑝(𝐴)
𝑝 𝐴|𝐵 = =
𝑝(𝐵) 𝑝(𝐵)

𝑝 𝐴 = 1 = σ𝐾
𝑖=1 𝑝(𝐴 = 1, 𝐵𝑖 )= σ𝐾
𝑖=1 𝑝 𝐴 𝐵𝑖 𝑝(𝐵𝑖 )
Tomorrow=Rainy Tomorrow=Cold P(Today)
Today=Rainy 4/9 2/9 [4/9 + 2/9] = 2/3
Today=Cold 2/9 1/9 [2/9 + 1/9] = 1/3
P(Tomorrow) [4/9 + 2/9] = 2/3 [2/9 + 1/9] = 1/3

P(Tomorrow = Rainy) =
Hard Clustering Can Be Difficult

• Hard Clustering: K-Means, Hierarchical Clustering, DBSCAN


Towards Soft Clustering
Outline

• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
Gaussian Distribution
1-d Gaussian

1 𝑥−𝜇 2

𝑁 𝜇, 𝜎 = 𝑒 2𝜎2
2𝜋𝜎 2
Mixture Models
• Formally a Mixture Model is the weighted sum of a number of
pdfs where the weights are determined by a distribution,

What is f in GMM?

𝜋0 𝜋1 𝜋2
𝑥
𝑥
𝑓0 (𝑥)

𝑥
𝑓1 (𝑥)

𝑥
𝑓2 (𝑥)

𝜋0 𝜋1 𝜋2 𝑥
Why 𝑝(𝑥) is a pdf?
Why GMM?
It creates a new pdf for us to generate random variables. It is
a generative model.

It clusters different components using a Gaussian distribution.

So it provides us the inferring opportunity. Soft assignment!!


Some notes:

Is summation of a bunch of Gaussians a Gaussian


itself?

p(x) is a Probability density function or it is also called a marginal


distribution function.
p(x) = the density of selecting a data point from the pdf which is
created from a mixture model. Also, we know that the area under
a density function is equal to 1.
Mixture Models are Generative
• Generative simply means dealing with joint probability 𝑝 𝑥, 𝑧

p x = 𝜋0 𝑓0 (𝑥) + 𝜋1 𝑓1 (𝑥) + ⋯ + 𝜋𝑘 𝑓𝑘 (𝑥)

Let’s say 𝑓(. ) is a Gaussian distribution

p x = 𝜋0 𝑁 𝑋 𝜇0 , 𝜎0 + 𝜋1 𝑁 𝑋 𝜇1 , 𝜎1 + ⋯ + 𝜋𝑘 𝑁(𝑋|𝜇𝑘 , 𝜎𝑘 )

𝑝 𝑥 = ෍ 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘
𝑘

𝑝 𝑥 = ෍ 𝑝 𝑥 𝑧𝑘 𝑝(𝑧𝑘 ) 𝑧𝑘 is component 𝑘
𝑘

𝑝(𝑥) = ෍ 𝑝(𝑥, 𝑧𝑘 )
𝑘
GMM with graphical model concept

𝐾
𝜋𝑘 𝑧𝑛𝑘 𝑍𝑘 is the latent variable
𝑍𝑘 𝑝 𝑧𝑛𝑘 𝜋𝑘 = ෑ 𝜋𝑘 1-of-K representation
𝑘=1

𝜃
𝐾 𝑧𝑛𝑘
𝜇𝑘 𝑋𝑛 𝑝 𝑥 𝑧𝑛𝑘 , 𝜋, 𝜇, Σ = ෑ 𝑁 𝑥 𝜇𝑘 , Σ𝑘
𝑘=1
N
Given 𝑧, 𝜋, 𝜇, and Σ, what is the
probability of x in component k
Σ𝑘

𝜋0 𝜋1 𝜋2 𝑥
What is soft assignment?

𝜋0 𝜋1 𝜋2
𝑥
𝑥

What is the probability of a datapoint 𝑥 in each component?

How many components we have here? 3

How many probability distributions? 3

What is the sum value of the


3 probabilities for each 1
datapoint?
How to calculate the probability of datapoints in the
first component (inferring)?

p x = 𝜋0 𝑁 𝑋 𝜇0 , 𝜎0 + 𝜋1 𝑁 𝑋 𝜇1 , 𝜎1 + 𝜋2 𝑁(𝑋|𝜇2 , 𝜎2 )
Let’s calculate the responsibility of the first component among the rest for one point x

Let’s call that 𝜏0

𝑁 𝑋 𝜇0 , 𝜎0 𝜋0
𝜏0 =
𝑁 𝑋 𝜇0 , 𝜎0 𝜋0 + 𝑁 𝑋 𝜇1 , 𝜎1 𝜋1 + 𝑁 𝑋 𝜇2 , 𝜎2 𝜋2

𝑝 𝑥 𝑧0 𝑝(𝑧0 )
𝜏0 =
𝑝 𝑥 𝑧0 𝑝(𝑧0 ) + 𝑝 𝑥 𝑧1 𝑝(𝑧1 ) + 𝑝 𝑥 𝑧1 𝑝(𝑧1 )

𝑝(𝑥, 𝑧0 ) 𝑝(𝑥, 𝑧0 )
𝜏0 = 𝑘=2 = = 𝑝(𝑧0 |𝑥)
σ𝑘=0 𝑝(𝑥, 𝑧𝑘 ) 𝑝(𝑥)
Given a datapoint x, what is probability of that datapoint in component 0

If I have 100 datapoints and 3 components, what is the size of 𝜏? 100X3


Inferring Cluster Membership

• We have representations of the joint 𝑝(𝑥, 𝑧𝑛𝑘 |𝜃) and the


marginal, 𝑝(𝑥|𝜃)
• The conditional of 𝑝 𝑧𝑛𝑘 𝑥, 𝜃) can be derived using Bayes rule.
The responsibility that a mixture component takes for explaining an
observation x.
Mixtures of Gaussians
What is the probability of picking a mixture component (Gaussian model)= 𝑝 𝑧 = 𝜋𝑖

AND
Picking data from that specific mixture component = p(𝑥|𝑧)
z is latent, we observe x, but z is hidden

𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝(𝑧) ➔Generative model, Joint distribution

𝑝 𝑥, 𝑧 = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘

𝜋0 𝜋1 𝜋2
𝑥
What are GMM parameters?
Mean 𝜇𝑘 Variance 𝜎𝑘 Size 𝜋𝑘

Marginal probability distribution

p x|𝜃 = ෍ 𝑝(𝑥, 𝑧𝑘 |𝜃) = ෍ 𝑝 𝑥 𝑧𝑘 , 𝜃 𝑝(𝑧𝑘 |𝜃 ) = ෍ 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 )𝜋𝑘


𝑘 𝑘 𝑘

𝑓𝑘 (𝑥) 𝜋𝑘

𝑝 𝑧𝑘 |𝜃 = 𝜋𝑘 Select a mixture component with probability 𝜋

𝑝 𝑥|𝑧𝑘 , 𝜃 = 𝑁(𝑥|𝜇𝑘 , 𝜎𝑘 ) Sample from that component’s Gaussian

𝜋0 𝜋1 𝜋2
𝑥
How about GMM for multimodal distribution?
Gaussian Mixture Model
Why having “Latent variable”
• A variable can be unobserved (latent) because:
it is an imaginary quantity meant to provide some simplified and
abstractive view of the data generation process.
- e.g., speech recognition models, mixture models (soft clustering)…
it is a real-world object and/or phenomena, but difficult or impossible
to measure
- e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
it is a real-world object and/or phenomena, but sometimes wasn’t
measured, because of faulty sensors, etc.

• Discrete latent variables can be used to partition/cluster data


into sub-groups.
• Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc).
Latent variable representation

p x|𝜃 = ෍ 𝑝(𝑥, 𝑧𝑛𝑘 |𝜃) = ෍ 𝑝(𝑧𝑛𝑘 |𝜃)𝑝 𝑥 𝑧𝑛𝑘 , 𝜃 = ෍ 𝜋𝑘 𝑁(𝑥|𝜇𝑘 , Σ𝑘 )


𝑘 𝑘
𝑘=0

𝐾 𝐾
𝑧𝑛𝑘 𝑧𝑛𝑘
𝑝(𝑧𝑛𝑘 |𝜃) = ෑ 𝜋𝑘 𝑝 𝑥 𝑧𝑛𝑘 , 𝜃 = ෑ 𝑁 𝑥 𝜇𝑘 , Σ𝑘
𝑘=1 𝑘=1

Why having the latent variable?


The distribution that we can model using a mixture of Gaussian components is much
more expressive than what we could have modeled using a single component.
Well, we don’t know 𝜋𝑘 , 𝜇𝑘 , Σk
What should we do?
We use a method called “Maximum Likelihood Estimation” (MLE)
to solve the problem.
𝐾

p x = p x|𝜃 = ෍ 𝑝(𝑥, 𝑧𝑘 |𝜃) = ෍ 𝑝(𝑧𝑘 |𝜃)𝑝 𝑥 𝑧𝑘 , 𝜃 = ෍ 𝜋𝑘 𝑁(𝑥|𝜇𝑘 , Σ𝑘 )


𝑘 𝑘
𝑘=0

Let’s identify a likelihood function, why?

Because we use likelihood function to optimize the probabilistic model


parameters!

𝑁 𝑁 𝐾

arg max 𝑝 𝑥|𝜃 = 𝑝 𝑥 𝜋, 𝜇, Σ = ෑ 𝑝 𝑥𝑛 |𝜃 = ෑ ෍ 𝜋𝑘 𝑁(𝑥𝑛 |𝜇𝑘 , Σ𝑘 )


𝑛=1 𝑛=1 𝑘=0
𝑁 𝑁 𝐾

arg max 𝑝 𝑥 = 𝑝 𝑥 𝜋, 𝜇, Σ = ෑ 𝑝 𝑥𝑛 |𝜃 = ෑ ෍ 𝜋𝑘 𝑁(𝑥𝑛 |𝜇𝑘 , Σ𝑘 )


𝑛=1 𝑛=1 𝑘=0

ln[𝑝 𝑥 ] = ln[𝑝 𝑥 𝜋, 𝜇, Σ ]

• As usual: Identify a likelihood function

• And set partials to zero…


Maximum Likelihood of a GMM

• Optimization of means.

)
Maximum Likelihood of a GMM

• Optimization of covariance
Maximum Likelihood of a GMM

• Optimization of mixing term

(𝑧𝑛𝑘 )
MLE of a GMM

Not a closed form solution!!


(𝑧𝑛𝑘 ) 𝜏 is not known exactly
What next?
Outline

• Overview
• Gaussian Mixture Model
• The Expectation-Maximization Algorithm
EM for GMMs

• E-step: Evaluate the Responsibilities


EM for GMMs

• M-Step: Re-estimate Parameters


Expectation Maximization
• Expectation Maximization (EM) is a general algorithm to deal with
hidden variables.
• Two steps:
E-Step: Fill-in hidden values using inference
M-Step: Apply standard MLE method to estimate parameters

• EM always converges to a local minimum of the likelihood.


EM for Gaussian Mixture Model:
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
EM for Gaussian Mixture Model: Example
Demo

• Demo link: https://lukapopijac.github.io/gaussian-mixture-


model/
EM Algorithm for GMM (matrix form)

Book : C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
EM for GMMs

• M-Step: Re-estimate Parameters


EM Algorithm for GMM (matrix form)

𝜸(𝒛𝒏𝒌 ) 𝜸(𝒛𝒏𝒌 )
𝒌 𝒌
𝒌 𝒌
𝜸(𝒛𝒏𝒌 )
𝒌
𝜸(𝒛
(𝒛𝒏𝒌 𝜸(𝒛𝒏𝒌 )
𝒏𝒌 )
𝒌

Book : C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
Relationship to K-means

• K-means makes hard decisions.


Each data point gets assigned to a single cluster.

• GMM/EM makes soft decisions.


Each data point can yield a posterior p(z|x)

• K-means is a special case of EM.


General form of EM
• Given a joint distribution over observed and latent variables:

• Want to maximize:

1. Initialize parameters:

2. E Step: Evaluate:

3. M-Step: Re-estimate parameters (based on expectation of complete-


data log likelihood)

= 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝔼[ln 𝑝(𝑥, 𝑧|𝜃 ]

4. Check for convergence of params or likelihood


Will lead to maximizing
this

Maximizing this
The first term is the expected complete log likelihood and the
second term, which does not depend on 𝜃, is the entropy.

Thus, in the M-step, maximizing with respect to 𝜃


for fixed q we only need to consider the first term:
EM for Gaussian Mixture Model: Example

covariance_type="diag“ or "spherical“ or “full”

Source: Python Data Science Handbook by Jake VanderPlas


𝜇𝑜𝑢𝑡2 (𝑋𝑖 ) Silhouette
Coefficient

𝑚𝑖𝑛
𝜇𝑜𝑢𝑡 𝑋𝑖 = min{𝜇𝑜𝑢𝑡2 𝑋𝑖 , 𝜇𝑜𝑢𝑡1 (𝑋𝑖 )}
𝜇𝑖𝑛 (𝑋𝑖 ) Xi

𝜇𝑜𝑢𝑡1 (𝑋𝑖 )
Silhouette Coefficient

The Silhouette Coefficient for clustering C:


SC close to 1 implies a good clustering (Points are close to their own
clusters but far from other clusters)
54
Take-Home Messages

• The generative process of Gaussian Mixture Model


• Inferring cluster membership based on a learned GMM
• The general idea of Expectation-Maximization
• Expectation-Maximization for GMM
• Silhouette Coefficient

You might also like