0% found this document useful (0 votes)
30 views

Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页

Uploaded by

qiuyihuang1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Bishop-Pattern-Recognition-and-Machine-Learning-2006 第455 - 459页

Uploaded by

qiuyihuang1999
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Exercises 455

then, by continuity, any local maximum of L(q, θ) will also be a local maximum of
ln p(X|θ).
Consider the case of N independent data points x1 , . . . , xN with corresponding
latent variables z1 , . . . , zN . The joint distribution p(X, Z|θ) factorizes over the data
points, and this structure can be exploited in an incremental form of EM in which
at each EM cycle only one data point is processed at a time. In the E step, instead
of recomputing the responsibilities for all of the data points, we just re-evaluate the
responsibilities for one data point. It might appear that the subsequent M step would
require computation involving the responsibilities for all of the data points. How-
ever, if the mixture components are members of the exponential family, then the
responsibilities enter only through simple sufficient statistics, and these can be up-
dated efficiently. Consider, for instance, the case of a Gaussian mixture, and suppose
we perform an update for data point m in which the corresponding old and new
values of the responsibilities are denoted γ old (zmk ) and γ new (zmk ). In the M step,
the required sufficient statistics can be updated incrementally. For instance, for the
Exercise 9.26 means the sufficient statistics are defined by (9.17) and (9.18) from which we obtain
 new 
γ (zmk ) − γ old (zmk )
µ k = µk +
new old
xm − µoldk (9.78)
Nknew

together with
Nknew = Nkold + γ new (zmk ) − γ old (zmk ). (9.79)
The corresponding results for the covariances and the mixing coefficients are analo-
gous.
Thus both the E step and the M step take fixed time that is independent of the
total number of data points. Because the parameters are revised after each data point,
rather than waiting until after the whole data set is processed, this incremental ver-
sion can converge faster than the batch version. Each E or M step in this incremental
algorithm is increasing the value of L(q, θ) and, as we have shown above, if the
algorithm converges to a local (or global) maximum of L(q, θ), this will correspond
to a local (or global) maximum of the log likelihood function ln p(X|θ).

Exercises
9.1 ( ) www Consider the K-means algorithm discussed in Section 9.1. Show that as
a consequence of there being a finite number of possible assignments for the set of
discrete indicator variables rnk , and that for each such assignment there is a unique
optimum for the {µk }, the K-means algorithm must converge after a finite number
of iterations.
9.2 ( ) Apply the Robbins-Monro sequential estimation procedure described in Sec-
tion 2.3.5 to the problem of finding the roots of the regression function given by
the derivatives of J in (9.1) with respect to µk . Show that this leads to a stochastic
K-means algorithm in which, for each data point xn , the nearest prototype µk is
updated using (9.5).
456 9. MIXTURE MODELS AND EM

9.3 ( ) www Consider a Gaussian mixture model in which the marginal distribution
p(z) for the latent variable is given by (9.10), and the conditional distribution p(x|z)
for the observed variable is given by (9.11). Show that the marginal distribution
p(x), obtained by summing p(z)p(x|z) over all possible values of z, is a Gaussian
mixture of the form (9.7).
9.4 ( ) Suppose we wish to use the EM algorithm to maximize the posterior distri-
bution over parameters p(θ|X) for a model containing latent variables, where X is
the observed data set. Show that the E step remains the same as in the maximum
likelihood case, whereas in the M step the quantity to be maximized is given by
Q(θ, θ old ) + ln p(θ) where Q(θ, θ old ) is defined by (9.30).
9.5 ( ) Consider the directed graph for a Gaussian mixture model shown in Figure 9.6.
By making use of the d-separation criterion discussed in Section 8.2, show that the
posterior distribution of the latent variables factorizes with respect to the different
data points so that


N
p(Z|X, µ, Σ, π) = p(zn |xn , µ, Σ, π). (9.80)
n=1

9.6 ( ) Consider a special case of a Gaussian mixture model in which the covari-
ance matrices Σk of the components are all constrained to have a common value
Σ. Derive the EM equations for maximizing the likelihood function under such a
model.
9.7 ( ) www Verify that maximization of the complete-data log likelihood (9.36) for
a Gaussian mixture model leads to the result that the means and covariances of each
component are fitted independently to the corresponding group of data points, and
the mixing coefficients are given by the fractions of points in each group.
9.8 ( ) www Show that if we maximize (9.40) with respect to µk while keeping the
responsibilities γ(znk ) fixed, we obtain the closed form solution given by (9.17).
9.9 ( ) Show that if we maximize (9.40) with respect to Σk and πk while keeping the
responsibilities γ(znk ) fixed, we obtain the closed form solutions given by (9.19)
and (9.22).
9.10 ( ) Consider a density model given by a mixture distribution


K
p(x) = πk p(x|k) (9.81)
k=1

and suppose that we partition the vector x into two parts so that x = (xa , xb ).
Show that the conditional density p(xb |xa ) is itself a mixture distribution and find
expressions for the mixing coefficients and for the component densities.
Exercises 457

9.11 ( ) In Section 9.3.2, we obtained a relationship between K means and EM for


Gaussian mixtures by considering a mixture model in which all components have
covariance I. Show that in the limit  → 0, maximizing the expected complete-
data log likelihood for this model, given by (9.40), is equivalent to minimizing the
distortion measure J for the K-means algorithm given by (9.1).
9.12 ( ) www Consider a mixture distribution of the form


K
p(x) = πk p(x|k) (9.82)
k=1

where the elements of x could be discrete or continuous or a combination of these.


Denote the mean and covariance of p(x|k) by µk and Σk , respectively. Show that
the mean and covariance of the mixture distribution are given by (9.49) and (9.50).
9.13 ( ) Using the re-estimation equations for the EM algorithm, show that a mix-
ture of Bernoulli distributions, with its parameters set to values corresponding to a
maximum of the likelihood function, has the property that

1 
N
E[x] = xn ≡ x. (9.83)
N
n=1

Hence show that if the parameters of this model are initialized such that all compo-
nents have the same mean µk = µ  for k = 1, . . . , K, then the EM algorithm will
converge after one iteration, for any choice of the initial mixing coefficients, and that
this solution has the property µk = x. Note that this represents a degenerate case of
the mixture model in which all of the components are identical, and in practice we
try to avoid such solutions by using an appropriate initialization.
9.14 ( ) Consider the joint distribution of latent and observed variables for the Bernoulli
distribution obtained by forming the product of p(x|z, µ) given by (9.52) and p(z|π)
given by (9.53). Show that if we marginalize this joint distribution with respect to z,
then we obtain (9.47).
9.15 ( ) www Show that if we maximize the expected complete-data log likelihood
function (9.55) for a mixture of Bernoulli distributions with respect to µk , we obtain
the M step equation (9.59).
9.16 ( ) Show that if we maximize the expected complete-data log likelihood function
(9.55) for a mixture of Bernoulli distributions with respect to the mixing coefficients
πk , using a Lagrange multiplier to enforce the summation constraint, we obtain the
M step equation (9.60).
9.17 ( ) www Show that as a consequence of the constraint 0  p(xn |µk )  1 for
the discrete variable xn , the incomplete-data log likelihood function for a mixture
of Bernoulli distributions is bounded above, and hence that there are no singularities
for which the likelihood goes to infinity.
458 9. MIXTURE MODELS AND EM

9.18 ( ) Consider a Bernoulli mixture model as discussed in Section 9.3.3, together


with a prior distribution p(µk |ak , bk ) over each of the parameter vectors µk given
by the beta distribution (2.13), and a Dirichlet prior p(π|α) given by (2.38). Derive
the EM algorithm for maximizing the posterior probability p(µ, π|X).
9.19 ( ) Consider a D-dimensional variable x each of whose components i is itself a
multinomial variable of degree M so that x is a binary vector with components
 xij
where i = 1, . . . , D and j = 1, . . . , M , subject to the constraint that j xij = 1 for
all i. Suppose that the distribution of these variables is described by a mixture of the
discrete multinomial distributions considered in Section 2.2 so that

K
p(x) = πk p(x|µk ) (9.84)
k=1

where

D 
M
xij
p(x|µk ) = µkij . (9.85)
i=1 j =1

The parameters µkij represent the probabilities


 p(xij = 1|µk ) and must satisfy
0  µkij  1 together with the constraint j µkij = 1 for all values of k and i.
Given an observed data set {xn }, where n = 1, . . . , N , derive the E and M step
equations of the EM algorithm for optimizing the mixing coefficients πk and the
component parameters µkij of this distribution by maximum likelihood.
9.20 ( ) www Show that maximization of the expected complete-data log likelihood
function (9.62) for the Bayesian linear regression model leads to the M step re-
estimation result (9.63) for α.
9.21 ( ) Using the evidence framework of Section 3.5, derive the M-step re-estimation
equations for the parameter β in the Bayesian linear regression model, analogous to
the result (9.63) for α.
9.22 ( ) By maximization of the expected complete-data log likelihood defined by
(9.66), derive the M step equations (9.67) and (9.68) for re-estimating the hyperpa-
rameters of the relevance vector machine for regression.
9.23 ( ) www In Section 7.2.1 we used direct maximization of the marginal like-
lihood to derive the re-estimation equations (7.87) and (7.88) for finding values of
the hyperparameters α and β for the regression RVM. Similarly, in Section 9.3.4
we used the EM algorithm to maximize the same marginal likelihood, giving the
re-estimation equations (9.67) and (9.68). Show that these two sets of re-estimation
equations are formally equivalent.
9.24 ( ) Verify the relation (9.70) in which L(q, θ) and KL(qp) are defined by (9.71)
and (9.72), respectively.
Exercises 459

9.25 ( ) www Show that the lower bound L(q, θ) given by (9.71), with q(Z) =
p(Z|X, θ (old) ), has the same gradient with respect to θ as the log likelihood function
ln p(X|θ) at the point θ = θ (old) .
9.26 ( ) www Consider the incremental form of the EM algorithm for a mixture of
Gaussians, in which the responsibilities are recomputed only for a specific data point
xm . Starting from the M-step formulae (9.17) and (9.18), derive the results (9.78)
and (9.79) for updating the component means.
9.27 ( ) Derive M-step formulae for updating the covariance matrices and mixing
coefficients in a Gaussian mixture model when the responsibilities are updated in-
crementally, analogous to the result (9.78) for updating the means.

You might also like