0% found this document useful (0 votes)
6 views

gmm

This document discusses the Expectation-Maximization (EM) algorithm for clustering using the Gaussian mixture model, where data is assumed to be generated from a mixture of Gaussian distributions. It outlines the process of estimating parameters through iterative updates, leveraging latent variables to maximize the likelihood function. The document also compares the EM algorithm to the k-means algorithm, highlighting the conditions under which k-means can be seen as a special case of the EM approach.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

gmm

This document discusses the Expectation-Maximization (EM) algorithm for clustering using the Gaussian mixture model, where data is assumed to be generated from a mixture of Gaussian distributions. It outlines the process of estimating parameters through iterative updates, leveraging latent variables to maximize the likelihood function. The document also compares the EM algorithm to the k-means algorithm, highlighting the conditions under which k-means can be seen as a special case of the EM approach.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Expectation-Maximization for the Gaussian Mixture Model

Thomas Bonald
Telecom ParisTech
[email protected]
January 2019

In this note, we present a clustering technique based on the Gaussian mixture model. Data samples
are assumed to be generated by a mixture of k Gaussian distributions, whose parameters are estimated by
an iterative method known as Expectation-Maximization (the EM algorithm). We show that the k-means
algorithm corresponds to the particular case where all Gaussian distributions are assumed to have the same
diagonal covariance matrix, with infinitely small variance.

1 Gaussian mixture model


Gaussian model. The Gaussian distribution of some d-dimensional vector X is characterized by its mean
µ and its covariance matrix Σ. We use the following notation:

X ∼ N (µ, Σ).

The random vector X has a density f if and only if its covariance matrix is invertible, in which case:
1 1 T
Σ−1 (x−µ)
f (x) = p e− 2 (x−µ) ,
(2π)d |Σ|

where |Σ| is the determinant of Σ.

Gaussian mixture model. Now consider k such distributions, with respective density functions f1 , . . . , fk
and respective parameters (µ1 , Σ1 ), . . . , (µk , Σk ). Let π1 , . . . , πk be any probability distribution on {1, . . . , k}.
Select the j-th distribution with probability πj , that is:

X ∼ N (µZ , ΣZ ) with Z ∼ π. (1)

The vector X has density:


k
X
pθ (x) = πj fj (x), (2)
j=1

where the parameter θ = (π, µ, Σ) consists of:


• the mixing distribution π = (π1 , . . . , πk ),

• the set of means µ = (µ1 , . . . , µk ),


• the set of covariance matrices Σ = (Σ1 , . . . , Σk ).

1
2 Maximum likelihood
Assume we seek to estimate the parameter θ based on n i.i.d. samples x1 , . . . , xn of the Gaussian mixture
model. Denoting by x the vector (x1 , . . . , xn ), we get the likelihood:
n
Y
pθ (x) = pθ (xi ),
i=1

and the log-likelihood:


n
X
`(θ) = log pθ (x) = log pθ (xi ).
i=1

In view of (2),  
n
X Xk
`(θ) = log  πj fj (xi ) . (3)
i=1 j=1

The maximum likelihood estimator is:


θ? = arg max `(θ). (4)
θ

This problem is hard to solve in practice, even numerically, since the function θ 7→ −`(θ) is not convex in
general.

3 Expectation-Maximization
The Expectation-Maximization (EM) algorithm is an iterative method for finding a local maximum of the
likelihood. This technique applies to any mixture model (in fact, any model with latent variables). It is
based on the observation that the problem (4) would be easy to solve given the latent variables z1 , . . . , zn
having been used implicitly in (1) to generate the data samples x1 , . . . , xn .

Latent variables. Let z the vector of latent variables (z1 , . . . , zn ). In view of (2), the Gaussian mixture
model is the marginal distribution of the joint distribution:

pθ (x, z) = pθ (z)pθ (x|z),

with
n
Y n
Y n
Y n
Y
pθ (z) = pθ (zi ) = πzi and pθ (x|z) = pθ (xi |zi ) = fzi (xi ).
i=1 i=1 i=1 i=1
Given the latent variables, the log-likelihood becomes:
n
X n
X
`(θ; z) = log pθ (x, z) = log πzi + log fzi (xi ),
i=1 i=1

so that each set of parameters (µ1 , Σ1 ), . . . , (µk , Σk ) can be estimated separately using the corresponding
samples (we refer the reader to Appendix A for the maximum likelihood estimator of a Gaussian distribution).
Specifically, the log-likelihood `(θ; z) is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (5)
n
and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = 1{zi =j} xi , Σ̂j = 1{zi =j} (xi − µ̂j )(xi − µ̂j )T , (6)
nj i=1 nj i=1

2
where
n
X
nj = 1{zi =j}
i=1

is the number of samples generated according to the j-th Gaussian distribution (which we assume positive).
The key problem is that the latent variables z1 , . . . , zn are unknown.

Estimation of the latent variables. The conditional distribution of the latent variables given the data
samples follows from:
pθ (x, z) = pθ (x)pθ (z|x). (7)
Since:
n
Y
pθ (x, z) = πzi fzi (xi ),
i=1

we get:
n
Y n
Y
pθ (z|x) = pθ (zi |xi ) ∝ πzi fzi (xi ).
i=1 i=1

In particular, the probability that sample i comes from distribution j is:

pij ∝ πj fj (xi ). (8)

Now given some initial parameter θ0 , we can use the corresponding distribution of the latent variables
given by (8) to get the expected log-likelihood of x:
X
`θ0 (θ) = pθ0 (z|x)`(θ; z),
z
k X
X n
= pij (log πj + log fj (xi )) .
j=1 i=1

This expected log-likelihood is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (9)
n

and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = pij xi , Σ̂j = pij (xi − µ̂j )(xi − µ̂j )T , (10)
nj i=1 nj i=1

where
n
X
nj = pij
i=1

is the expected number of samples generated according to the j-th Gaussian distribution.
Thus, starting from some initial parameter θ0 , we can compute the conditional distribution of the latent
variables, given the data samples, and deduce a new estimate of the parameter, θ1 = (π̂, µ̂, Σ̂). By successive
iterations, we obtain a sequence of parameters θ0 , θ1 , θ2 , . . . which is expected to converge to a good approx-
imation of the optimal parameter θ? (i.e., that solving the problem (4)). We shall prove in the next section
that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing, which guarantees
that the EM algorithm converges a local maximum of the likelihood. This is not the global maximum of the
likelihook in general.

3
EM algorithm. A pseudo-code of the algorithm is shown below. The outcome is a soft clustering of
the data samples, with pij the probability that sample i belongs to cluster k. A regular clustering can be
obtained by selecting for each sample i the cluster j maximizing pij .

Algorithm 1: EM algorithm for the Gaussian mixture model


Input: Data samples x1 , . . . , xn ; number of clusters k
Output: p1 , . . . , pn , probability distributions of samples 1, . . . , n over the k clusters
1 Sample random values µ1 , . . . , µk from x1 , . . . , xn
1
Pn
2 x̄ ← n xi
2 1
Pi=1
n 2
3 σ ← n i=1 ||xi − x̄||
4 for j = 1 to k do
2
5 Σj ← σk I
6 πj ← 1
7 while no convergence do
// Expectation
8 for i = 1 to n do
9 s←0
10 for j = 1 to k do
π 1 T −1
11 pij ← √ j e− 2 (xi −µj ) Σj (xi −µj )
|Σj |
12 s ← s + pij
13 for j = 1 to k do
14 pij ← pij /s
// Maximization
15 for j = 1 to k do
16 πj ← 0
17 µj ← 0
18 for i = 1 to n do
19 πj ← πj + pij
20 µj ← µj + pij xi
21 µj ← µj /πj
22 Σj ← 0
23 for i = 1 to n do
24 Σj ← Σj + pij (xi − µj )(xi − µj )T
25 Σj ← Σj /πj

4
A key problem is in the choice of the initial parameter θ0 , and more specifically in the choice of the
initial means µ1 , . . . , µk , corresponding to the cluster centers. In the above pseudo-code, this is obtained by
random sampling k values among the data samples x1 , . . . , xn , as in the k-means algorithm. Since this initial
choice has a strong impact on the final result, several independent instances of the algorithm can be run,
the best instance, i.e., that maximizing (3), being selected eventually. Another common strategy consists in
selecting the cluster centers far from one another, as in the k-means++ algorithm.
The choice of the initial values of the covariance matrices is also critical. Here σ 2 is chosen as the average
square distance between data points, in view of the equality
n n
1X 1 X
||xi − x̄||2 = 2 ||xi − xj ||2 .
n i=1 n i,j=1

If σ 2 is much larger than the typical square distance between data samples, then the probability distributions
p1 , . . . , pn tend to be close to uniform, the means µ1 , . . . , µk converging to x̄, the center of the data samples.
The number of iterations depends on the stopping criterion. For instance, one may decide that conver-
gence has occured whenever the main cluster of each data sample (that maximizing pij for data sample i)
remains unchanged.
The complexity of the algorithm is in O(nk) per iteration, which may be prohibitive for large values of
k. The complexity may be reduced to O(nm), for some integer m, by looking for the m nearest clusters of
each data sample, using some appropriate data structure.

4 Convergence
In view of (7), the log-likelihood can be written:

`(θ) = log pθ (x, z) − log pθ (z|x), (11)

whenever pθ (x, z) > 0.


Now let θt be the estimate of θ at step t of the algorithm. Since the equality (11) holds for each value of z,
provided pθ (x, z) > 0, we can take the expectation with respect to the corresponding conditional distribution
of the latent variables, pθt (z|x), and we obtain:
X
`(θ) = `θt (θ) − pθt (z|x) log pθ (z|x).
z

Now the difference in log-likelihood is:

`(θ) − `(θt ) = `θt (θ) − `θt (θt ) + D(θt ||θ),

where
X pθt (z|x)
D(θt ||θ) = pθt (z|x) log
z
pθ (z|x)

is the Kullback-Leibler divergence between the probability distributions pθt (z|x) and pθ (z|x). This quantity
is non-negative (this is Gibbs’ inequality, see Appendix B). Since:

θt+1 = arg max `θt (θ),


θ

we get:
`(θt+1 ) − `(θt ) = `θt (θt+1 ) − `θt (θt ) + D(θt ||θt+1 ) ≥ D(θt ||θt+1 ) ≥ 0,
showing that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing and thus
converges.

5
5 Comparison with k-means
Consider the Gaussian mixture model with common covariance matrix σ 2 I, for some parameter σ > 0, and
uniform mixing distribution:

X ∼ N (µZ , σ 2 I) with Z ∼ U({1, . . . , k}).

The density becomes:


k
1X
f (x) = fj (x),
k j=1

with:
||x−µj ||2
1
fj (x) = d e− 2σ 2 .
(2πσ 2 ) 2

The variance σ 2 is assumed to be known so that the only parameter is θ = µ, the vector of means. We refer
to this model as the symmetric Gaussian mixture model.

Algorithm 2: EM algorithm for the symmetric Gaussian mixture model


Input: Data samples x1 , . . . , xn ; distance σ; number of clusters k
Output: p1 , . . . , pn , probability distributions of samples 1, . . . , n over the k clusters
1 Sample random values µ1 , . . . , µk from x1 , . . . , xn
2 while no convergence do
// Expectation
3 for i = 1 to n do
4 s←0
5 for j = 1 to k do
||xi −µj ||2
6 pij ← e− 2σ2
7 s ← s + pij
8 for j = 1 to k do
9 pij ← pij /s
// Maximization
10 for j = 1 to k do
11 πj ← 0
12 µj ← 0
13 for i = 1 to n do
14 πj ← πj + pij
15 µj ← µj + pij xi
16 µj ← µj /πj

When σ 2 → 0, the expectation step becomes:



1 if j = l,
pij =
0 otherwise,

where l is the index for which the distance ||xi − µl || is minimum (assuming this index is unique). The
algorithm is exactly k-means.
In general, the algorithm provides a soft clustering, with the parameter σ controlling the spread of each
cluster. When σ 2 → +∞, each sample belongs to each cluster with the same probability 1/k.

6
6 A simple Gaussian mixture model
Finally, consider the Gaussian mixture model where all covariance matrices are diagonal. This is a good trade-
off between the Gaussian mixture model (with O(kd2 ) parameters to be learned, where d is the dimension
of the data samples) and the k-means algorithm, corresponding to the simplistic case of a uniform mixing
distribution and covariance matrices equal to σ 2 I for some fixed, small parameter σ 2 . The EM algorithm is
the same as Algorithm 1, with the update of the covariance matrices replaced by:

Σj ← Σj + pij diag((xi − µj )2 ),

where (xi − µj )2 refers to the vector equal to the square of vector xi − µj componentwise. The dependency
across dimensions is no longer taken into account, but the algorithm is more robust in that the inversion of
the covariance matrices is straightforward.

Further reading
• The initial paper on the EM algorithm [Dempster et al., 1977].
• A concise tutorial on the EM algorithm and variants [Roche, 2011].

Appendix
A Maximum likelihood for the Gaussian model
Consider the Gaussian model in dimension d:
1 1 T
Σ−1 (x−µ)
pθ (x) = p e− 2 (x−µ) ,
(2π)d |Σ|

with parameter θ = (µ, Σ). For n i.i.d. samples x1 , . . . , xn of the distribution, we get:

pθ (x) = pθ (x1 ) . . . pθ (xn ).

The log-likelihood is:


n
X
`(θ) = log pθ (x) = log pθ (xi ),
i=1

that is:
n
n 1X
`(θ) = c − log |Σ| − (xi − µ)T Σ−1 (xi − µ).
2 2 i=1

with c = − nd
2 log(2π).
The gradient in µ is the vector:
n
∂`(θ) X
=− Σ−1 (xi − µ),
∂µ i=1

which is equal to 0 for µ = µ̂ with:


n
1X
µ̂ = xi .
n i=1
This is the empirical mean of the samples.

7
Now let Λ = Σ−1 . Since |Λ| = |Σ|−1 , we can rewrite the log-likelihood as:
n
n 1X
`(θ) = c + log |Λ| − (xi − µ)T Λ(xi − µ).
2 2 i=1

We obtain1
n
∂`(θ) n 1X
= Λ−1 − (xi − µ)(xi − µ)T ,
∂Λ 2 2 i=1

which is equal to 0 for Λ−1 = Σ̂, with


n
1X
Σ̂ = (xi − µ)(xi − µ)T .
n i=1

For µ = µ̂, this is the empirical covariance matrix of the samples.

B Kullback-Leibler divergence and Gibbs’ inequality


Let p and q be two probability distributions over {1, . . . , k}. The Kullback-Leibler divergence between p and
q is defined by:
k
X pj
D(p||q) = pj log .
j=1
qj

It is a measure of how the probability distributions p and q differ. Observe that D(p||q) = +∞ whenever
qj = 0 while pj > 0 for some j. We have:
D(p||q) ≥ 0,
with equality if and only if p = q. This is Gibbs’ inequality, which follows from Jensen’s inequality on
observing that:
     
pZ qZ qZ
D(p||q) = E log = −E log ≥ − log E = − log 1 = 0,
qZ pZ pZ

where the expectation is taken over Z, a random variable having distribution p. Since the log is strictly
q
concave, the inequality is an equality if and only if pjj is a constant, for each j such that pj > 0, which means
that p = q.

References
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological).
[Roche, 2011] Roche, A. (2011). EM algorithm and variants: An informal tutorial. arXiv:1105.1476.

1 For any matrix A, the gradient of the determinant |A| in A is the comatrix of A. In particular, the gradient of log |A| in A

is A−1 for any positive, symmetric matrix A. Moreover, for any vectors u, v, the gradient of uT Av in A is the matrix uv T .

You might also like