0% found this document useful (0 votes)

6 views

gmm

This document discusses the Expectation-Maximization (EM) algorithm for clustering using the Gaussian mixture model, where data is assumed to be generated from a mixture of Gaussian distributions. It outlines the process of estimating parameters through iterative updates, leveraging latent variables to maximize the likelihood function. The document also compares the EM algorithm to the k-means algorithm, highlighting the conditions under which k-means can be seen as a special case of the EM approach.

Uploaded by

Tuấn Đỗ Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

gmm

Uploaded by

Tuấn Đỗ Anh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Expectation-Maximization for the Gaussian Mixture Model

Thomas Bonald
Telecom ParisTech
[email protected]
January 2019

In this note, we present a clustering technique based on the Gaussian mixture model. Data samples
are assumed to be generated by a mixture of k Gaussian distributions, whose parameters are estimated by
an iterative method known as Expectation-Maximization (the EM algorithm). We show that the k-means
algorithm corresponds to the particular case where all Gaussian distributions are assumed to have the same
diagonal covariance matrix, with infinitely small variance.

1 Gaussian mixture model

Gaussian model. The Gaussian distribution of some d-dimensional vector X is characterized by its mean
µ and its covariance matrix Σ. We use the following notation:

X ∼ N (µ, Σ).

The random vector X has a density f if and only if its covariance matrix is invertible, in which case:
1 1 T
Σ−1 (x−µ)
f (x) = p e− 2 (x−µ) ,
(2π)d |Σ|

where |Σ| is the determinant of Σ.

Gaussian mixture model. Now consider k such distributions, with respective density functions f1 , . . . , fk
and respective parameters (µ1 , Σ1 ), . . . , (µk , Σk ). Let π1 , . . . , πk be any probability distribution on {1, . . . , k}.
Select the j-th distribution with probability πj , that is:

X ∼ N (µZ , ΣZ ) with Z ∼ π. (1)

The vector X has density:

k
X
pθ (x) = πj fj (x), (2)
j=1

where the parameter θ = (π, µ, Σ) consists of:

• the mixing distribution π = (π1 , . . . , πk ),

• the set of means µ = (µ1 , . . . , µk ),

• the set of covariance matrices Σ = (Σ1 , . . . , Σk ).

1
2 Maximum likelihood
Assume we seek to estimate the parameter θ based on n i.i.d. samples x1 , . . . , xn of the Gaussian mixture
model. Denoting by x the vector (x1 , . . . , xn ), we get the likelihood:
n
Y
pθ (x) = pθ (xi ),
i=1

and the log-likelihood:

n
X
`(θ) = log pθ (x) = log pθ (xi ).
i=1

In view of (2),  
n
X Xk
`(θ) = log  πj fj (xi ) . (3)
i=1 j=1

The maximum likelihood estimator is:

θ? = arg max `(θ). (4)
θ

This problem is hard to solve in practice, even numerically, since the function θ 7→ −`(θ) is not convex in
general.

3 Expectation-Maximization
The Expectation-Maximization (EM) algorithm is an iterative method for finding a local maximum of the
likelihood. This technique applies to any mixture model (in fact, any model with latent variables). It is
based on the observation that the problem (4) would be easy to solve given the latent variables z1 , . . . , zn
having been used implicitly in (1) to generate the data samples x1 , . . . , xn .

Latent variables. Let z the vector of latent variables (z1 , . . . , zn ). In view of (2), the Gaussian mixture
model is the marginal distribution of the joint distribution:

pθ (x, z) = pθ (z)pθ (x|z),

with
n
Y n
Y n
Y n
Y
pθ (z) = pθ (zi ) = πzi and pθ (x|z) = pθ (xi |zi ) = fzi (xi ).
i=1 i=1 i=1 i=1
Given the latent variables, the log-likelihood becomes:
n
X n
X
`(θ; z) = log pθ (x, z) = log πzi + log fzi (xi ),
i=1 i=1

so that each set of parameters (µ1 , Σ1 ), . . . , (µk , Σk ) can be estimated separately using the corresponding
samples (we refer the reader to Appendix A for the maximum likelihood estimator of a Gaussian distribution).
Specifically, the log-likelihood `(θ; z) is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (5)
n
and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = 1{zi =j} xi , Σ̂j = 1{zi =j} (xi − µ̂j )(xi − µ̂j )T , (6)
nj i=1 nj i=1

2
where
n
X
nj = 1{zi =j}
i=1

is the number of samples generated according to the j-th Gaussian distribution (which we assume positive).
The key problem is that the latent variables z1 , . . . , zn are unknown.

Estimation of the latent variables. The conditional distribution of the latent variables given the data
samples follows from:
pθ (x, z) = pθ (x)pθ (z|x). (7)
Since:
n
Y
pθ (x, z) = πzi fzi (xi ),
i=1

we get:
n
Y n
Y
pθ (z|x) = pθ (zi |xi ) ∝ πzi fzi (xi ).
i=1 i=1

In particular, the probability that sample i comes from distribution j is:

pij ∝ πj fj (xi ). (8)

Now given some initial parameter θ0 , we can use the corresponding distribution of the latent variables
given by (8) to get the expected log-likelihood of x:
X
`θ0 (θ) = pθ0 (z|x)`(θ; z),
z
k X
X n
= pij (log πj + log fj (xi )) .
j=1 i=1

This expected log-likelihood is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (9)
n

and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = pij xi , Σ̂j = pij (xi − µ̂j )(xi − µ̂j )T , (10)
nj i=1 nj i=1

where
n
X
nj = pij
i=1

is the expected number of samples generated according to the j-th Gaussian distribution.
Thus, starting from some initial parameter θ0 , we can compute the conditional distribution of the latent
variables, given the data samples, and deduce a new estimate of the parameter, θ1 = (π̂, µ̂, Σ̂). By successive
iterations, we obtain a sequence of parameters θ0 , θ1 , θ2 , . . . which is expected to converge to a good approx-
imation of the optimal parameter θ? (i.e., that solving the problem (4)). We shall prove in the next section
that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing, which guarantees
that the EM algorithm converges a local maximum of the likelihood. This is not the global maximum of the
likelihook in general.

3
EM algorithm. A pseudo-code of the algorithm is shown below. The outcome is a soft clustering of
the data samples, with pij the probability that sample i belongs to cluster k. A regular clustering can be
obtained by selecting for each sample i the cluster j maximizing pij .

Algorithm 1: EM algorithm for the Gaussian mixture model

Input: Data samples x1 , . . . , xn ; number of clusters k
Output: p1 , . . . , pn , probability distributions of samples 1, . . . , n over the k clusters
1 Sample random values µ1 , . . . , µk from x1 , . . . , xn
1
Pn
2 x̄ ← n xi
2 1
Pi=1
n 2
3 σ ← n i=1 ||xi − x̄||
4 for j = 1 to k do
2
5 Σj ← σk I
6 πj ← 1
7 while no convergence do
// Expectation
8 for i = 1 to n do
9 s←0
10 for j = 1 to k do
π 1 T −1
11 pij ← √ j e− 2 (xi −µj ) Σj (xi −µj )
|Σj |
12 s ← s + pij
13 for j = 1 to k do
14 pij ← pij /s
// Maximization
15 for j = 1 to k do
16 πj ← 0
17 µj ← 0
18 for i = 1 to n do
19 πj ← πj + pij
20 µj ← µj + pij xi
21 µj ← µj /πj
22 Σj ← 0
23 for i = 1 to n do
24 Σj ← Σj + pij (xi − µj )(xi − µj )T
25 Σj ← Σj /πj

4
A key problem is in the choice of the initial parameter θ0 , and more specifically in the choice of the
initial means µ1 , . . . , µk , corresponding to the cluster centers. In the above pseudo-code, this is obtained by
random sampling k values among the data samples x1 , . . . , xn , as in the k-means algorithm. Since this initial
choice has a strong impact on the final result, several independent instances of the algorithm can be run,
the best instance, i.e., that maximizing (3), being selected eventually. Another common strategy consists in
selecting the cluster centers far from one another, as in the k-means++ algorithm.
The choice of the initial values of the covariance matrices is also critical. Here σ 2 is chosen as the average
square distance between data points, in view of the equality
n n
1X 1 X
||xi − x̄||2 = 2 ||xi − xj ||2 .
n i=1 n i,j=1

If σ 2 is much larger than the typical square distance between data samples, then the probability distributions
p1 , . . . , pn tend to be close to uniform, the means µ1 , . . . , µk converging to x̄, the center of the data samples.
The number of iterations depends on the stopping criterion. For instance, one may decide that conver-
gence has occured whenever the main cluster of each data sample (that maximizing pij for data sample i)
remains unchanged.
The complexity of the algorithm is in O(nk) per iteration, which may be prohibitive for large values of
k. The complexity may be reduced to O(nm), for some integer m, by looking for the m nearest clusters of
each data sample, using some appropriate data structure.

4 Convergence
In view of (7), the log-likelihood can be written:

`(θ) = log pθ (x, z) − log pθ (z|x), (11)

whenever pθ (x, z) > 0.

Now let θt be the estimate of θ at step t of the algorithm. Since the equality (11) holds for each value of z,
provided pθ (x, z) > 0, we can take the expectation with respect to the corresponding conditional distribution
of the latent variables, pθt (z|x), and we obtain:
X
`(θ) = `θt (θ) − pθt (z|x) log pθ (z|x).
z

Now the difference in log-likelihood is:

`(θ) − `(θt ) = `θt (θ) − `θt (θt ) + D(θt ||θ),

where
X pθt (z|x)
D(θt ||θ) = pθt (z|x) log
z
pθ (z|x)

is the Kullback-Leibler divergence between the probability distributions pθt (z|x) and pθ (z|x). This quantity
is non-negative (this is Gibbs’ inequality, see Appendix B). Since:

θt+1 = arg max `θt (θ),

we get:
`(θt+1 ) − `(θt ) = `θt (θt+1 ) − `θt (θt ) + D(θt ||θt+1 ) ≥ D(θt ||θt+1 ) ≥ 0,
showing that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing and thus
converges.

5
5 Comparison with k-means
Consider the Gaussian mixture model with common covariance matrix σ 2 I, for some parameter σ > 0, and
uniform mixing distribution:

X ∼ N (µZ , σ 2 I) with Z ∼ U({1, . . . , k}).

The density becomes:

k
1X
f (x) = fj (x),
k j=1

with:
||x−µj ||2
1
fj (x) = d e− 2σ 2 .
(2πσ 2 ) 2

The variance σ 2 is assumed to be known so that the only parameter is θ = µ, the vector of means. We refer
to this model as the symmetric Gaussian mixture model.

Algorithm 2: EM algorithm for the symmetric Gaussian mixture model

Input: Data samples x1 , . . . , xn ; distance σ; number of clusters k
Output: p1 , . . . , pn , probability distributions of samples 1, . . . , n over the k clusters
1 Sample random values µ1 , . . . , µk from x1 , . . . , xn
2 while no convergence do
// Expectation
3 for i = 1 to n do
4 s←0
5 for j = 1 to k do
||xi −µj ||2
6 pij ← e− 2σ2
7 s ← s + pij
8 for j = 1 to k do
9 pij ← pij /s
// Maximization
10 for j = 1 to k do
11 πj ← 0
12 µj ← 0
13 for i = 1 to n do
14 πj ← πj + pij
15 µj ← µj + pij xi
16 µj ← µj /πj

When σ 2 → 0, the expectation step becomes:

1 if j = l,
pij =
0 otherwise,

where l is the index for which the distance ||xi − µl || is minimum (assuming this index is unique). The
algorithm is exactly k-means.
In general, the algorithm provides a soft clustering, with the parameter σ controlling the spread of each
cluster. When σ 2 → +∞, each sample belongs to each cluster with the same probability 1/k.

6
6 A simple Gaussian mixture model
Finally, consider the Gaussian mixture model where all covariance matrices are diagonal. This is a good trade-
off between the Gaussian mixture model (with O(kd2 ) parameters to be learned, where d is the dimension
of the data samples) and the k-means algorithm, corresponding to the simplistic case of a uniform mixing
distribution and covariance matrices equal to σ 2 I for some fixed, small parameter σ 2 . The EM algorithm is
the same as Algorithm 1, with the update of the covariance matrices replaced by:

Σj ← Σj + pij diag((xi − µj )2 ),

where (xi − µj )2 refers to the vector equal to the square of vector xi − µj componentwise. The dependency
across dimensions is no longer taken into account, but the algorithm is more robust in that the inversion of
the covariance matrices is straightforward.

Further reading
• The initial paper on the EM algorithm [Dempster et al., 1977].
• A concise tutorial on the EM algorithm and variants [Roche, 2011].

Appendix
A Maximum likelihood for the Gaussian model
Consider the Gaussian model in dimension d:
1 1 T
Σ−1 (x−µ)
pθ (x) = p e− 2 (x−µ) ,
(2π)d |Σ|

with parameter θ = (µ, Σ). For n i.i.d. samples x1 , . . . , xn of the distribution, we get:

pθ (x) = pθ (x1 ) . . . pθ (xn ).

The log-likelihood is:

n
X
`(θ) = log pθ (x) = log pθ (xi ),
i=1

that is:
n
n 1X
`(θ) = c − log |Σ| − (xi − µ)T Σ−1 (xi − µ).
2 2 i=1

with c = − nd
2 log(2π).
The gradient in µ is the vector:
n
∂`(θ) X
=− Σ−1 (xi − µ),
∂µ i=1

which is equal to 0 for µ = µ̂ with:

n
1X
µ̂ = xi .
n i=1
This is the empirical mean of the samples.

7
Now let Λ = Σ−1 . Since |Λ| = |Σ|−1 , we can rewrite the log-likelihood as:
n
n 1X
`(θ) = c + log |Λ| − (xi − µ)T Λ(xi − µ).
2 2 i=1

We obtain1
n
∂`(θ) n 1X
= Λ−1 − (xi − µ)(xi − µ)T ,
∂Λ 2 2 i=1

which is equal to 0 for Λ−1 = Σ̂, with

n
1X
Σ̂ = (xi − µ)(xi − µ)T .
n i=1

For µ = µ̂, this is the empirical covariance matrix of the samples.

B Kullback-Leibler divergence and Gibbs’ inequality

Let p and q be two probability distributions over {1, . . . , k}. The Kullback-Leibler divergence between p and
q is defined by:
k
X pj
D(p||q) = pj log .
j=1
qj

It is a measure of how the probability distributions p and q differ. Observe that D(p||q) = +∞ whenever
qj = 0 while pj > 0 for some j. We have:
D(p||q) ≥ 0,
with equality if and only if p = q. This is Gibbs’ inequality, which follows from Jensen’s inequality on
observing that:

pZ qZ qZ
D(p||q) = E log = −E log ≥ − log E = − log 1 = 0,
qZ pZ pZ

where the expectation is taken over Z, a random variable having distribution p. Since the log is strictly
q
concave, the inequality is an equality if and only if pjj is a constant, for each j such that pj > 0, which means
that p = q.

References
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological).
[Roche, 2011] Roche, A. (2011). EM algorithm and variants: An informal tutorial. arXiv:1105.1476.

1 For any matrix A, the gradient of the determinant |A| in A is the comatrix of A. In particular, the gradient of log |A| in A

is A−1 for any positive, symmetric matrix A. Moreover, for any vectors u, v, the gradient of uT Av in A is the matrix uv T .

Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
Module13 GaussianMixtureModel
No ratings yet
Module13 GaussianMixtureModel
17 pages
GAUSSIAN MIXTURES
No ratings yet
GAUSSIAN MIXTURES
5 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Expectation Maximization
No ratings yet
Expectation Maximization
19 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
16) ISM-Session 16 - 30th and 31st March 2024
No ratings yet
16) ISM-Session 16 - 30th and 31st March 2024
36 pages
AI29
No ratings yet
AI29
3 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
Clustering Mixture
No ratings yet
Clustering Mixture
22 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Applied Stat
No ratings yet
Applied Stat
2 pages
Mixture Models and Clustering
No ratings yet
Mixture Models and Clustering
8 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
20-gaussian-mixture-model
No ratings yet
20-gaussian-mixture-model
55 pages
Gaussian Mixture Modelling GMM
No ratings yet
Gaussian Mixture Modelling GMM
11 pages
EM at RIT
No ratings yet
EM at RIT
17 pages
Chapter 1 - Part1
No ratings yet
Chapter 1 - Part1
56 pages
GMM Said Crv10 Tutorial
No ratings yet
GMM Said Crv10 Tutorial
27 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
CB PDF
No ratings yet
CB PDF
69 pages
5
No ratings yet
5
29 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
EM-converted
No ratings yet
EM-converted
22 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Gaussian Distribution
No ratings yet
Gaussian Distribution
5 pages
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
No ratings yet
Andrew Rosenberg - Lecture 18: Gaussian Mixture Models and Expectation Maximization
34 pages
Lecture Expectation Maximization
No ratings yet
Lecture Expectation Maximization
58 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
L08_GMM
No ratings yet
L08_GMM
11 pages
Get One More Story in Your Member Preview When You Sign Up. It's Free
No ratings yet
Get One More Story in Your Member Preview When You Sign Up. It's Free
12 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
No ratings yet
Algoritmo E-M. Utilizado para Calcular La Mezcla de Gausianas
8 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
TD10 - td_gmm_2025
No ratings yet
TD10 - td_gmm_2025
1 page
The Kullback-Liebler Distance and Entropy
No ratings yet
The Kullback-Liebler Distance and Entropy
5 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
12 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
PROBABILISTIC Learning Jb-new
No ratings yet
PROBABILISTIC Learning Jb-new
13 pages
cs229 Notes9 PDF
No ratings yet
cs229 Notes9 PDF
9 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
Wk04 machine learning
No ratings yet
Wk04 machine learning
6 pages
unsupervised_learning_clustering_math
No ratings yet
unsupervised_learning_clustering_math
28 pages
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
No ratings yet
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
47 pages
EM GaussianMixture Example
No ratings yet
EM GaussianMixture Example
2 pages
15_GMC
No ratings yet
15_GMC
4 pages
7.Estimation Clustering
No ratings yet
7.Estimation Clustering
56 pages
Expectation Maximization (EM) Algorithm.pptx
No ratings yet
Expectation Maximization (EM) Algorithm.pptx
47 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Chap 4
No ratings yet
Chap 4
27 pages
Chapter12_StoryTellingWithData
No ratings yet
Chapter12_StoryTellingWithData
41 pages
Chap 5
No ratings yet
Chap 5
25 pages
VERBAL REASONING
No ratings yet
VERBAL REASONING
11 pages
Lecture 4
No ratings yet
Lecture 4
40 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
BDA MQP 1
No ratings yet
BDA MQP 1
29 pages
IML2016 Solutions06-2
No ratings yet
IML2016 Solutions06-2
6 pages
Improving Arabic Document Clustering Using K-Means Algorithm and Particle Swarm Optimization
No ratings yet
Improving Arabic Document Clustering Using K-Means Algorithm and Particle Swarm Optimization
7 pages
Crime Analytics: Exploring Analysis of Crimes Through R Programming Language
No ratings yet
Crime Analytics: Exploring Analysis of Crimes Through R Programming Language
5 pages
Lab 7
No ratings yet
Lab 7
4 pages
Clustering Algorithms: K-Means
No ratings yet
Clustering Algorithms: K-Means
17 pages
A Database For Handwritten Text Recognition Research
No ratings yet
A Database For Handwritten Text Recognition Research
5 pages
data science notes Mtech
No ratings yet
data science notes Mtech
115 pages
ML Review PPT 2
No ratings yet
ML Review PPT 2
29 pages
Guru Nanak Dev Engineering College, Ludhiana
No ratings yet
Guru Nanak Dev Engineering College, Ludhiana
48 pages
Data Analytics KCS 051
No ratings yet
Data Analytics KCS 051
2 pages
Bikku2019 Springer Con
No ratings yet
Bikku2019 Springer Con
9 pages
Amixed integer nonlinear programming model for site-specific management zone problem
No ratings yet
Amixed integer nonlinear programming model for site-specific management zone problem
10 pages
Integrating Clustering With Different Data Mining Techniques in The Diagnosis of Heart Disease
No ratings yet
Integrating Clustering With Different Data Mining Techniques in The Diagnosis of Heart Disease
10 pages
K-Means Clustering and Affinity Clustering Based On Heterogeneous Transfer Learning
No ratings yet
K-Means Clustering and Affinity Clustering Based On Heterogeneous Transfer Learning
7 pages
DWM-theory
No ratings yet
DWM-theory
37 pages
FLANN - Fast Library For Approximate Nearest Neighbors User Manual
No ratings yet
FLANN - Fast Library For Approximate Nearest Neighbors User Manual
15 pages
CAIS-Demo
No ratings yet
CAIS-Demo
15 pages
Quiz No 01: Pattern Recognition
No ratings yet
Quiz No 01: Pattern Recognition
6 pages
A Fuzzy K-Means Clustering Algorithm Using Cluster Center Displacement
No ratings yet
A Fuzzy K-Means Clustering Algorithm Using Cluster Center Displacement
15 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
An Identification and Detection Process For Wheat Using ML
No ratings yet
An Identification and Detection Process For Wheat Using ML
11 pages
Mridul Report
No ratings yet
Mridul Report
43 pages
s18 Cu6051np Cw1 17031944 Nirakar Sigdel
No ratings yet
s18 Cu6051np Cw1 17031944 Nirakar Sigdel
18 pages
Design and Implementation of Fake Currency Detection System
No ratings yet
Design and Implementation of Fake Currency Detection System
5 pages
Application of Clustering For Student Result Analysis: April 2019
No ratings yet
Application of Clustering For Student Result Analysis: April 2019
5 pages
Conglomerate Inc.'s New PDA: A Segmentation Study: Case Handout
No ratings yet
Conglomerate Inc.'s New PDA: A Segmentation Study: Case Handout
13 pages
K-Means PHP
100% (1)
K-Means PHP
4 pages
CCL MiniProject
No ratings yet
CCL MiniProject
8 pages

gmm

Uploaded by

gmm

Uploaded by

Expectation-Maximization for the Gaussian Mixture Model

1 Gaussian mixture model

where |Σ| is the determinant of Σ.

X ∼ N (µZ , ΣZ ) with Z ∼ π. (1)

The vector X has density:

where the parameter θ = (π, µ, Σ) consists of:

• the set of means µ = (µ1 , . . . , µk ),

and the log-likelihood:

The maximum likelihood estimator is:

pθ (x, z) = pθ (z)pθ (x|z),

In particular, the probability that sample i comes from distribution j is:

pij ∝ πj fj (xi ). (8)

Algorithm 1: EM algorithm for the Gaussian mixture model

`(θ) = log pθ (x, z) − log pθ (z|x), (11)

whenever pθ (x, z) > 0.

Now the difference in log-likelihood is:

`(θ) − `(θt ) = `θt (θ) − `θt (θt ) + D(θt ||θ),

θt+1 = arg max `θt (θ),

X ∼ N (µZ , σ 2 I) with Z ∼ U({1, . . . , k}).

The density becomes:

Algorithm 2: EM algorithm for the symmetric Gaussian mixture model

When σ 2 → 0, the expectation step becomes:

pθ (x) = pθ (x1 ) . . . pθ (xn ).

The log-likelihood is:

which is equal to 0 for µ = µ̂ with:

which is equal to 0 for Λ−1 = Σ̂, with

For µ = µ̂, this is the empirical covariance matrix of the samples.

B Kullback-Leibler divergence and Gibbs’ inequality

You might also like