gmm
gmm
Thomas Bonald
Telecom ParisTech
[email protected]
January 2019
In this note, we present a clustering technique based on the Gaussian mixture model. Data samples
are assumed to be generated by a mixture of k Gaussian distributions, whose parameters are estimated by
an iterative method known as Expectation-Maximization (the EM algorithm). We show that the k-means
algorithm corresponds to the particular case where all Gaussian distributions are assumed to have the same
diagonal covariance matrix, with infinitely small variance.
X ∼ N (µ, Σ).
The random vector X has a density f if and only if its covariance matrix is invertible, in which case:
1 1 T
Σ−1 (x−µ)
f (x) = p e− 2 (x−µ) ,
(2π)d |Σ|
Gaussian mixture model. Now consider k such distributions, with respective density functions f1 , . . . , fk
and respective parameters (µ1 , Σ1 ), . . . , (µk , Σk ). Let π1 , . . . , πk be any probability distribution on {1, . . . , k}.
Select the j-th distribution with probability πj , that is:
1
2 Maximum likelihood
Assume we seek to estimate the parameter θ based on n i.i.d. samples x1 , . . . , xn of the Gaussian mixture
model. Denoting by x the vector (x1 , . . . , xn ), we get the likelihood:
n
Y
pθ (x) = pθ (xi ),
i=1
In view of (2),
n
X Xk
`(θ) = log πj fj (xi ) . (3)
i=1 j=1
This problem is hard to solve in practice, even numerically, since the function θ 7→ −`(θ) is not convex in
general.
3 Expectation-Maximization
The Expectation-Maximization (EM) algorithm is an iterative method for finding a local maximum of the
likelihood. This technique applies to any mixture model (in fact, any model with latent variables). It is
based on the observation that the problem (4) would be easy to solve given the latent variables z1 , . . . , zn
having been used implicitly in (1) to generate the data samples x1 , . . . , xn .
Latent variables. Let z the vector of latent variables (z1 , . . . , zn ). In view of (2), the Gaussian mixture
model is the marginal distribution of the joint distribution:
with
n
Y n
Y n
Y n
Y
pθ (z) = pθ (zi ) = πzi and pθ (x|z) = pθ (xi |zi ) = fzi (xi ).
i=1 i=1 i=1 i=1
Given the latent variables, the log-likelihood becomes:
n
X n
X
`(θ; z) = log pθ (x, z) = log πzi + log fzi (xi ),
i=1 i=1
so that each set of parameters (µ1 , Σ1 ), . . . , (µk , Σk ) can be estimated separately using the corresponding
samples (we refer the reader to Appendix A for the maximum likelihood estimator of a Gaussian distribution).
Specifically, the log-likelihood `(θ; z) is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (5)
n
and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = 1{zi =j} xi , Σ̂j = 1{zi =j} (xi − µ̂j )(xi − µ̂j )T , (6)
nj i=1 nj i=1
2
where
n
X
nj = 1{zi =j}
i=1
is the number of samples generated according to the j-th Gaussian distribution (which we assume positive).
The key problem is that the latent variables z1 , . . . , zn are unknown.
Estimation of the latent variables. The conditional distribution of the latent variables given the data
samples follows from:
pθ (x, z) = pθ (x)pθ (z|x). (7)
Since:
n
Y
pθ (x, z) = πzi fzi (xi ),
i=1
we get:
n
Y n
Y
pθ (z|x) = pθ (zi |xi ) ∝ πzi fzi (xi ).
i=1 i=1
Now given some initial parameter θ0 , we can use the corresponding distribution of the latent variables
given by (8) to get the expected log-likelihood of x:
X
`θ0 (θ) = pθ0 (z|x)`(θ; z),
z
k X
X n
= pij (log πj + log fj (xi )) .
j=1 i=1
This expected log-likelihood is maximum for the empirical mixing distribution (π̂1 , . . . , π̂k ):
nj
∀j = 1, . . . , k, π̂j = , (9)
n
and the empirical means and covariance matrices (µ̂1 , Σ̂1 ), . . . , (µ̂k , Σ̂k ):
n n
1 X 1 X
∀j = 1, . . . , k, µ̂j = pij xi , Σ̂j = pij (xi − µ̂j )(xi − µ̂j )T , (10)
nj i=1 nj i=1
where
n
X
nj = pij
i=1
is the expected number of samples generated according to the j-th Gaussian distribution.
Thus, starting from some initial parameter θ0 , we can compute the conditional distribution of the latent
variables, given the data samples, and deduce a new estimate of the parameter, θ1 = (π̂, µ̂, Σ̂). By successive
iterations, we obtain a sequence of parameters θ0 , θ1 , θ2 , . . . which is expected to converge to a good approx-
imation of the optimal parameter θ? (i.e., that solving the problem (4)). We shall prove in the next section
that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing, which guarantees
that the EM algorithm converges a local maximum of the likelihood. This is not the global maximum of the
likelihook in general.
3
EM algorithm. A pseudo-code of the algorithm is shown below. The outcome is a soft clustering of
the data samples, with pij the probability that sample i belongs to cluster k. A regular clustering can be
obtained by selecting for each sample i the cluster j maximizing pij .
4
A key problem is in the choice of the initial parameter θ0 , and more specifically in the choice of the
initial means µ1 , . . . , µk , corresponding to the cluster centers. In the above pseudo-code, this is obtained by
random sampling k values among the data samples x1 , . . . , xn , as in the k-means algorithm. Since this initial
choice has a strong impact on the final result, several independent instances of the algorithm can be run,
the best instance, i.e., that maximizing (3), being selected eventually. Another common strategy consists in
selecting the cluster centers far from one another, as in the k-means++ algorithm.
The choice of the initial values of the covariance matrices is also critical. Here σ 2 is chosen as the average
square distance between data points, in view of the equality
n n
1X 1 X
||xi − x̄||2 = 2 ||xi − xj ||2 .
n i=1 n i,j=1
If σ 2 is much larger than the typical square distance between data samples, then the probability distributions
p1 , . . . , pn tend to be close to uniform, the means µ1 , . . . , µk converging to x̄, the center of the data samples.
The number of iterations depends on the stopping criterion. For instance, one may decide that conver-
gence has occured whenever the main cluster of each data sample (that maximizing pij for data sample i)
remains unchanged.
The complexity of the algorithm is in O(nk) per iteration, which may be prohibitive for large values of
k. The complexity may be reduced to O(nm), for some integer m, by looking for the m nearest clusters of
each data sample, using some appropriate data structure.
4 Convergence
In view of (7), the log-likelihood can be written:
where
X pθt (z|x)
D(θt ||θ) = pθt (z|x) log
z
pθ (z|x)
is the Kullback-Leibler divergence between the probability distributions pθt (z|x) and pθ (z|x). This quantity
is non-negative (this is Gibbs’ inequality, see Appendix B). Since:
we get:
`(θt+1 ) − `(θt ) = `θt (θt+1 ) − `θt (θt ) + D(θt ||θt+1 ) ≥ D(θt ||θt+1 ) ≥ 0,
showing that the corresponding sequence of log-likelihoods `(θ0 ), `(θ1 ), `(θ2 ), . . . is non-decreasing and thus
converges.
5
5 Comparison with k-means
Consider the Gaussian mixture model with common covariance matrix σ 2 I, for some parameter σ > 0, and
uniform mixing distribution:
with:
||x−µj ||2
1
fj (x) = d e− 2σ 2 .
(2πσ 2 ) 2
The variance σ 2 is assumed to be known so that the only parameter is θ = µ, the vector of means. We refer
to this model as the symmetric Gaussian mixture model.
where l is the index for which the distance ||xi − µl || is minimum (assuming this index is unique). The
algorithm is exactly k-means.
In general, the algorithm provides a soft clustering, with the parameter σ controlling the spread of each
cluster. When σ 2 → +∞, each sample belongs to each cluster with the same probability 1/k.
6
6 A simple Gaussian mixture model
Finally, consider the Gaussian mixture model where all covariance matrices are diagonal. This is a good trade-
off between the Gaussian mixture model (with O(kd2 ) parameters to be learned, where d is the dimension
of the data samples) and the k-means algorithm, corresponding to the simplistic case of a uniform mixing
distribution and covariance matrices equal to σ 2 I for some fixed, small parameter σ 2 . The EM algorithm is
the same as Algorithm 1, with the update of the covariance matrices replaced by:
Σj ← Σj + pij diag((xi − µj )2 ),
where (xi − µj )2 refers to the vector equal to the square of vector xi − µj componentwise. The dependency
across dimensions is no longer taken into account, but the algorithm is more robust in that the inversion of
the covariance matrices is straightforward.
Further reading
• The initial paper on the EM algorithm [Dempster et al., 1977].
• A concise tutorial on the EM algorithm and variants [Roche, 2011].
Appendix
A Maximum likelihood for the Gaussian model
Consider the Gaussian model in dimension d:
1 1 T
Σ−1 (x−µ)
pθ (x) = p e− 2 (x−µ) ,
(2π)d |Σ|
with parameter θ = (µ, Σ). For n i.i.d. samples x1 , . . . , xn of the distribution, we get:
that is:
n
n 1X
`(θ) = c − log |Σ| − (xi − µ)T Σ−1 (xi − µ).
2 2 i=1
with c = − nd
2 log(2π).
The gradient in µ is the vector:
n
∂`(θ) X
=− Σ−1 (xi − µ),
∂µ i=1
7
Now let Λ = Σ−1 . Since |Λ| = |Σ|−1 , we can rewrite the log-likelihood as:
n
n 1X
`(θ) = c + log |Λ| − (xi − µ)T Λ(xi − µ).
2 2 i=1
We obtain1
n
∂`(θ) n 1X
= Λ−1 − (xi − µ)(xi − µ)T ,
∂Λ 2 2 i=1
It is a measure of how the probability distributions p and q differ. Observe that D(p||q) = +∞ whenever
qj = 0 while pj > 0 for some j. We have:
D(p||q) ≥ 0,
with equality if and only if p = q. This is Gibbs’ inequality, which follows from Jensen’s inequality on
observing that:
pZ qZ qZ
D(p||q) = E log = −E log ≥ − log E = − log 1 = 0,
qZ pZ pZ
where the expectation is taken over Z, a random variable having distribution p. Since the log is strictly
q
concave, the inequality is an equality if and only if pjj is a constant, for each j such that pj > 0, which means
that p = q.
References
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological).
[Roche, 2011] Roche, A. (2011). EM algorithm and variants: An informal tutorial. arXiv:1105.1476.
1 For any matrix A, the gradient of the determinant |A| in A is the comatrix of A. In particular, the gradient of log |A| in A
is A−1 for any positive, symmetric matrix A. Moreover, for any vectors u, v, the gradient of uT Av in A is the matrix uv T .