Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页
Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页
then, by continuity, any local maximum of L(q, θ) will also be a local maximum of
ln p(X|θ).
Consider the case of N independent data points x1 , . . . , xN with corresponding
latent variables z1 , . . . , zN . The joint distribution p(X, Z|θ) factorizes over the data
points, and this structure can be exploited in an incremental form of EM in which
at each EM cycle only one data point is processed at a time. In the E step, instead
of recomputing the responsibilities for all of the data points, we just re-evaluate the
responsibilities for one data point. It might appear that the subsequent M step would
require computation involving the responsibilities for all of the data points. How-
ever, if the mixture components are members of the exponential family, then the
responsibilities enter only through simple sufficient statistics, and these can be up-
dated efficiently. Consider, for instance, the case of a Gaussian mixture, and suppose
we perform an update for data point m in which the corresponding old and new
values of the responsibilities are denoted γ old (zmk ) and γ new (zmk ). In the M step,
the required sufficient statistics can be updated incrementally. For instance, for the
Exercise 9.26 means the sufficient statistics are defined by (9.17) and (9.18) from which we obtain
new
γ (zmk ) − γ old (zmk )
µ k = µk +
new old
xm − µoldk (9.78)
Nknew
together with
Nknew = Nkold + γ new (zmk ) − γ old (zmk ). (9.79)
The corresponding results for the covariances and the mixing coefficients are analo-
gous.
Thus both the E step and the M step take fixed time that is independent of the
total number of data points. Because the parameters are revised after each data point,
rather than waiting until after the whole data set is processed, this incremental ver-
sion can converge faster than the batch version. Each E or M step in this incremental
algorithm is increasing the value of L(q, θ) and, as we have shown above, if the
algorithm converges to a local (or global) maximum of L(q, θ), this will correspond
to a local (or global) maximum of the log likelihood function ln p(X|θ).
Exercises
9.1 ( ) www Consider the K-means algorithm discussed in Section 9.1. Show that as
a consequence of there being a finite number of possible assignments for the set of
discrete indicator variables rnk , and that for each such assignment there is a unique
optimum for the {µk }, the K-means algorithm must converge after a finite number
of iterations.
9.2 ( ) Apply the Robbins-Monro sequential estimation procedure described in Sec-
tion 2.3.5 to the problem of finding the roots of the regression function given by
the derivatives of J in (9.1) with respect to µk . Show that this leads to a stochastic
K-means algorithm in which, for each data point xn , the nearest prototype µk is
updated using (9.5).
456 9. MIXTURE MODELS AND EM
9.3 ( ) www Consider a Gaussian mixture model in which the marginal distribution
p(z) for the latent variable is given by (9.10), and the conditional distribution p(x|z)
for the observed variable is given by (9.11). Show that the marginal distribution
p(x), obtained by summing p(z)p(x|z) over all possible values of z, is a Gaussian
mixture of the form (9.7).
9.4 ( ) Suppose we wish to use the EM algorithm to maximize the posterior distri-
bution over parameters p(θ|X) for a model containing latent variables, where X is
the observed data set. Show that the E step remains the same as in the maximum
likelihood case, whereas in the M step the quantity to be maximized is given by
Q(θ, θ old ) + ln p(θ) where Q(θ, θ old ) is defined by (9.30).
9.5 ( ) Consider the directed graph for a Gaussian mixture model shown in Figure 9.6.
By making use of the d-separation criterion discussed in Section 8.2, show that the
posterior distribution of the latent variables factorizes with respect to the different
data points so that
N
p(Z|X, µ, Σ, π) = p(zn |xn , µ, Σ, π). (9.80)
n=1
9.6 ( ) Consider a special case of a Gaussian mixture model in which the covari-
ance matrices Σk of the components are all constrained to have a common value
Σ. Derive the EM equations for maximizing the likelihood function under such a
model.
9.7 ( ) www Verify that maximization of the complete-data log likelihood (9.36) for
a Gaussian mixture model leads to the result that the means and covariances of each
component are fitted independently to the corresponding group of data points, and
the mixing coefficients are given by the fractions of points in each group.
9.8 ( ) www Show that if we maximize (9.40) with respect to µk while keeping the
responsibilities γ(znk ) fixed, we obtain the closed form solution given by (9.17).
9.9 ( ) Show that if we maximize (9.40) with respect to Σk and πk while keeping the
responsibilities γ(znk ) fixed, we obtain the closed form solutions given by (9.19)
and (9.22).
9.10 ( ) Consider a density model given by a mixture distribution
K
p(x) = πk p(x|k) (9.81)
k=1
and suppose that we partition the vector x into two parts so that x = (xa , xb ).
Show that the conditional density p(xb |xa ) is itself a mixture distribution and find
expressions for the mixing coefficients and for the component densities.
Exercises 457
K
p(x) = πk p(x|k) (9.82)
k=1
1
N
E[x] = xn ≡ x. (9.83)
N
n=1
Hence show that if the parameters of this model are initialized such that all compo-
nents have the same mean µk = µ for k = 1, . . . , K, then the EM algorithm will
converge after one iteration, for any choice of the initial mixing coefficients, and that
this solution has the property µk = x. Note that this represents a degenerate case of
the mixture model in which all of the components are identical, and in practice we
try to avoid such solutions by using an appropriate initialization.
9.14 ( ) Consider the joint distribution of latent and observed variables for the Bernoulli
distribution obtained by forming the product of p(x|z, µ) given by (9.52) and p(z|π)
given by (9.53). Show that if we marginalize this joint distribution with respect to z,
then we obtain (9.47).
9.15 ( ) www Show that if we maximize the expected complete-data log likelihood
function (9.55) for a mixture of Bernoulli distributions with respect to µk , we obtain
the M step equation (9.59).
9.16 ( ) Show that if we maximize the expected complete-data log likelihood function
(9.55) for a mixture of Bernoulli distributions with respect to the mixing coefficients
πk , using a Lagrange multiplier to enforce the summation constraint, we obtain the
M step equation (9.60).
9.17 ( ) www Show that as a consequence of the constraint 0 p(xn |µk ) 1 for
the discrete variable xn , the incomplete-data log likelihood function for a mixture
of Bernoulli distributions is bounded above, and hence that there are no singularities
for which the likelihood goes to infinity.
458 9. MIXTURE MODELS AND EM
where
D
M
xij
p(x|µk ) = µkij . (9.85)
i=1 j =1
9.25 ( ) www Show that the lower bound L(q, θ) given by (9.71), with q(Z) =
p(Z|X, θ (old) ), has the same gradient with respect to θ as the log likelihood function
ln p(X|θ) at the point θ = θ (old) .
9.26 ( ) www Consider the incremental form of the EM algorithm for a mixture of
Gaussians, in which the responsibilities are recomputed only for a specific data point
xm . Starting from the M-step formulae (9.17) and (9.18), derive the results (9.78)
and (9.79) for updating the component means.
9.27 ( ) Derive M-step formulae for updating the covariance matrices and mixing
coefficients in a Gaussian mixture model when the responsibilities are updated in-
crementally, analogous to the result (9.78) for updating the means.