0% found this document useful (0 votes)
3 views

prml_solution_manual-2

The document contains exercise solutions for 'Pattern Recognition and Machine Learning' by Cristopher Bishop, focusing on various mathematical concepts including error functions, probability, and Gaussian distributions. It provides detailed proofs and derivations for exercises related to minimizing error functions, calculating probabilities, and verifying properties of probability densities. The solutions emphasize the importance of linear and nonlinear transformations, as well as the normalization conditions for Gaussian distributions.

Uploaded by

a85414765
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

prml_solution_manual-2

The document contains exercise solutions for 'Pattern Recognition and Machine Learning' by Cristopher Bishop, focusing on various mathematical concepts including error functions, probability, and Gaussian distributions. It provides detailed proofs and derivations for exercises related to minimizing error functions, calculating probabilities, and verifying properties of probability densities. The solutions emphasize the importance of linear and nonlinear transformations, as well as the normalization conditions for Gaussian distributions.

Uploaded by

a85414765
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Pattern Recognition and Machine Learning

Cristopher Bishop

Exercise Solutions

Stefan Stefanache

October 10, 2022


Chapter 1

Introduction

TODO: 1.15, 1.16, 1.20, 1.26, 1.27 + CALCULUS OF VARIATIONS: 1.25, 1.34

Exercise 1.1 ?
Consider the sum-of-squares error function given by (1.2) in which the function y(x, w) is given
by the polynomial (1.1). Show that the coefficients w = {wi } that minimize this error function
are given by the solution to the following set of linear equations
M
X
Aij wj = Ti (1.122)
j=0

where
N
X N
X
i+j
Aij = (xn ) , Ti = (xn )i tn . (1.123)
n=1 n=1

Here a suffix i or j denotes the index of a component, whereas (x)i denotes x raised to the power
of i.

Proof. The function y(x, w) is given by


M
X
y(x, w) = w j xj (1.1)
j=0

and the error function is given by


N
1X
E(w) = {y(xn , w) − tn }2 (1.2)
2 n=1

Since we want to find the coefficients w for which the error function is minimized, we compute
its derivative with respect to w:
 N  N
d d 1X 2 1X d
E(w) = {y(xn , w) − tn } = {y(xn , w)2 − 2tn y(xn , w) + t2n }
dw dw 2 n=1 2 n=1 dw

1
N N
X d X d
= y(xn , w) y(xn , w) − tn y(xn , w) (1.1.1)
n=1
dw n=1
dw

We continue by computing the derivative of y(xn , w) separately and obtain that:


 
x1n
d
y(xn , w) =  ...  (1.1.2)
 
dw
xM
n

By substituting the result of (1.1.2) into (1.1.1) we get that:

d
E(w) = B − T (1.1.3)
dw
where T is given by (1.123) and
N
X
Bi = xin y(xn , w)
n=1

Now, we easily find that


N 
X M
X  N X
X M
Bi = xin wj xjn = xni+j wj = Ai w
n=1 j=0 n=1 j=0

where A is given by (1.123). Now, the critical point of E(w) is given by the equation:

Ai w = Ti

which is equivalent with (1.122).

Exercise 1.2 ?
Write down the set of coupled linear equations, analogous to (1.122), satisfied by the coefficients
wi which minimize the regularized sum-of-squares error function given by (1.4).

Proof. The regularized sum-of-squares error function is given by


N
1X λ
E(w)
e = {y(xn , w) − tn }2 + ||w||2 (1.4)
2 i=1 2

We’ll have a similar approach to the previous exercise, i.e. we compute the derivative of the
regularized error function and find the associated critical point. We notice that
λ
E(w)
e = E(w) + ||w||2
2
so
d e d λ d
E(w) = E(w) + · ||w||2
dw dw 2 dw

2
One could easily prove that
d
||w||2 = 2w
dw
so by using this and (1.1.3) (where we substitute B = Aw), we have that:
d e
E(w) = Aw + λw − T = (A + λI)w − T
dw
We obtain the critical point when the derivative is 0, so when

(A + λI)w = T

which is equivalent with the system of linear equations


M
X
Cij wj = Ti
j=0

where
Cij = Aij + λIij

Exercise 1.3 ??
Suppose that we have three coloured boxes r (red), b (blue), and g (green). Box r contains 3 apples,
4 oranges and 3 limes, box b contains 1 apple, 1 orange, and 0 limes, and box g contains 3 apples,
3 oranges, and 4 limes. If a box is chosen at random with probabilities p(r) = 0.2, p(b) = 0.2,
p(g) = 0.6, and a picee of fruit is removed from the box (with equal probability of selecting any of
the items in the box), then what is the probability of selecting an apple? If we observe that the
selected fruit is in fact an orange, what is the probability that it came from the green box?

Proof. The conditional probabilities of obtaining a fruit knowing that we are searching in a certain
box are easily found since the fruits are equally likely to be extracted. We also now the probabilities
of choosing a specific box, so we can simply apply the sum rule to obtain the probability of getting
an apple:
3 1 3
p(apple) = p(apple|r)p(r) + p(apple|b)p(b) + p(apple|g)p(g) = · 0.2 + · 0.2 + · 0.6 = 34%
10 2 10
If we know the selected fruit is an orange, the probability that it came from the green box is
given by the Bayes’ theorem:

p(g)p(orange|g)
p(g|orange) = (1.3.1)
p(orange)
The probability of choosing the green box is known and the probability of getting an orange
from the green box is also easily found. We only need to find the probability of extracting an
orange in the general case:
4 1 3
p(orange) = p(orange|r)p(r) + p(orange|b)p(b) + p(orange|g)p(g) = · 0.2 + · 0.2 + · 0.6 = 36%
10 2 10
3
The needed probability is now found by substituting the values in (1.3.1):
3
0.6 · 10 1
p(g|orange) = 36 = = 50%
100
2

Exercise 1.4 ??
Consider a probability density px (x) defined over a continuous variable x, and suppose that we
make a nonlinear change of variable using x = g(y), so that the density transforms according to
(1.27). By differentiating (1.27), show that the location yb of the maximum of the density in y is not
in general related to the location xb of the maximum of the density over x by the simple functional
relation x
b = g(b y ) as a consequence of the Jacobian factor. This shows that the maximum of a
probability density (in contrast to a simple function) is dependent of the choice of variable. Verify
that, in the case of a linear transformation, the location of the maximum transforms in the same
way as the variable itself.

Proof. If we make a nonlinear change of variable x = g(y) in the probbability density px (x), it
transforms according to
py (y) = px (g(y))|g 0 (y)| (1.27)
We assume that the mode of px (x) is given by an unique x
b, i.e.

p0x (x) = 0 ⇐⇒ x = x
b

Now, let s ∈ {−1, 1} such that g 0 (y) = sg 0 (y). The derivative of (1.27) with respect to y is
given by:
p0y (y) = sp0 x(g(y)){g 0 (y)}2 + spx (g(y))g 00 (y)
For a linear change of variable, we have that g 00 (y) = 0, so the mode of py (y) is given by
g 0 (y) = 0 and since x = g(y), respectively x0 = g 0 (y) we have that x
b = g(b y ). Therefore, for a linear
change of variable, the location of the maximum transforms in the same way as the variable itself.
For a nonlinear change of variable, the second derivative will not be generally 0, so the mode
is not given by g 0 (y) = 0 anymore. As a result, in general x b 6= g(b
y ), so the location of the mode
will transform differently from the variable itself.

Exercise 1.5 ?
Using the definition (1.38) show that var[f (x)] satisfies (1.39).

Proof. The variance is defined by

var[f ] = E (f (x) − E[f (x)])2


 
(1.38)

We expand the square and then use the linearity of expectation to obtain:

var[f ] = E f (x)2 − 2f (x)E[f (x)] + E[f (x)]2 = E[f (x)2 ] − 2E f (x)E[f (x)] + E E[f (x)]2
     

4
Since E[f (x)] is a constant, the expression of the variance becomes:

var[f ] = E[f (x)2 ] − 2E[f (x)]2 + E[f (x)]2 = E[f (x)2 ] − E[f (x)]2 (1.39)

Exercise 1.6 ?
Show that if two variables x and y are independent, then their covariance is zero.

Proof. The covariance of two random variables is given by:

cov[x, y] = Ex,y [xy] − E[x]E[y] (1.41)

We assume that the variables are continuous, but the discrete case result is similarly obtained.
If x and y are independent, we have that p(x, y) = p(x)p(y), so
ZZ ZZ Z  Z 
Ex,y [xy] = p(x, y)xy dx dy = p(x)p(y)xy dx dy = p(x)x dx p(y)y dy

= E[x]E[y]

and (1.41) becomes 0.

Exercise 1.7 ??
In this exercise, we prove the normalization condition (1.48) for the univariate Gaussian. To do
this consider the integral Z ∞  
1 2
I= exp − 2 x dx (1.124)
−∞ 2σ
which we can evaluate by first writing its square in the form
Z ∞Z ∞  
2 1 2 1 2
I = exp − 2 x − 2 y dx dy (1.125)
−∞ −∞ 2σ 2σ

Now make the transformation from Cartesian coordinates (x, y) to polar coordinates (r, θ) and
then substitute u = r2 . Show that, by performing the integrals over θ and u, and then taking the
square root of both sides, we obtain
I = (2πσ 2 )1/2 (1.126)
Finally, use this result to show that the Gaussian distribution N (x|µ, σ 2 ) is normalized.

Proof. We transform (1.125) from Cartesian coordinates to polar coordinates and obtain:
Z 2π Z ∞ Z 2π Z ∞
r2 sin2 θ + r2 cos2 θ r2
   
2
I = exp − r dr dθ = exp − 2 r dr dθ
0 0 2σ 2 0 0 2σ

5
We use the substitution u = r2 and then compute the integral to get:
∞
1 2π ∞ 1 2π
Z Z   Z  Z 2π
2 u 2 u 2
I = exp − 2 du dθ = −2σ exp − 2 dθ = σ dθ = 2πσ 2
2 0 0 2σ 2 0 2σ 0 0

If we take the square root of this we see that

I = (2πσ 2 )1/2 (1.126)

We can assume without loss of generality that the mean of the Gaussian is 0, as we could make
the change of variable y = x − µ. Therefore, by using (1.126) we obtain
Z ∞  
2 1 x I
N (x|µ, σ ) = √ exp − 2 dx = √ =1
2πσ 2 −∞ 2σ 2πσ 2
which shows that the Gaussian distribution is normailized.

Exercise 1.8 ??
By using a change of variables, verify that the univariate Gaussian given by (1.46) satisfies (1.49).
Next, by differentiating both sides of the normalization condition
Z ∞
N (x|µ, σ 2 ) dx = 1 (1.127)
−∞

with respect to σ 2 , verify that the Gaussian satisfies (1.50). Finally, show that (1.51) holds.

Proof. We start by computing the expected value of the Gaussian:


Z ∞ Z ∞
(x − µ)2
 
2 1
E[x] = N (x|µ, σ )x dx = √ exp − x dx
−∞ 2πσ 2 −∞ 2σ 2
We do a little trick to prepare for the substitution u = (x − µ)2 :
Z ∞ Z ∞
(x − µ)2 (x − µ)2
   
1 µ
E[x] = √ exp − (x − µ) dx + √ exp − dx
2πσ 2 −∞ 2σ 2 2πσ 2 −∞ 2σ 2
Since the Gaussian is normalized, the second term of the expression will be µ. By using the
substitution u = (x − µ)2 , the expected value becomes:
Z ∞  
1 u
E[x] = √ exp − 2 du + µ
2 2πσ 2 ∞ 2σ
We notice that the endpoints of the integral are ”equal” (one could rewrite it as a limit of an
integral with actual equal endpoints), so its value is 0. Therefore,

E[x] = µ (1.49)

Now, we take the derivative of (1.127) with respect to σ 2 and obtain:


Z ∞
(x − µ)2
   
∂ 1
√ exp − dx = 0
∂σ 2 2πσ 2 −∞ 2σ 2

6

(x − µ)2
Z  
I 1 ∂
− √ +√ exp − dx = 0
2σ 3 2π 2πσ 2 −∞ ∂σ 2 2σ 2
Z ∞
(x − µ)2 (x + µ)2
 
1 1
− 2+√ exp − dx = 0
2σ 2πσ 2 −∞ 2σ 4 2σ 2

We let J be the integral term and compute it separately:


Z ∞
(x + µ)2
 
1 2
J= 4 (x − µ) exp − dx
2σ −∞ 2σ 2
Z ∞
(x + µ)2 2µ ∞ (x + µ)2 µ2
  Z  
1 2
= 4 x exp − dx − x exp − dx + I
2σ −∞ 2σ 2 2σ 4 −∞ 2σ 2 2σ 4

If we multiply by the normalization constants, the integrals become expected values and the I
factor vanishes. Therefore:
√ 
1 2µ µ2

2 2
J = 2πσ E[x ] − 4 E[x] + 4
2σ 4 2σ 2σ
We substitute J back in the initial expression to obtain:
1 1
− 2
+ 4 (E[x2 ] − 2µ2 + µ2 ) = 0
2σ 2σ
from which is straightforard to show that

E[x2 ] = σ 2 + µ2 (1.50)

Finally, one can easily see that:

var[x] = E[x2 ] − E[x]2 = σ 2 (1.51)

Exercise 1.9 ?
Show that the mode (i.e. the maximum) of the Gaussian distribution (1.46) is given by µ. Similarly,
show that the mode of the multivariate Gaussian (1.52) is given by µ.

Proof. In the univariate case, we start by taking the derivative of (1.46) with respect to x :

(x − µ)2 1 (x − µ)2 (x − µ)2


    
∂ 2 1 ∂
N (x|µ, σ ) = √ exp − =√ exp −
∂x 2πσ 2 ∂x 2σ 2 2πσ 2 2σ 4 2σ 2
We notice that the derivative is 0, for x = µ, so the mode of the univariate Gaussian is given
by the mean.

Analogously, we take the derivative of (1.52) with respect to x and get:


  
∂ 1 1 ∂ 1 T −1
N (x|µ, Σ) = exp − (x − µ) Σ (x − µ)
∂x (2π)D/2 |Σ|1/2 ∂x 2

7
The covariance matrix Σ is both nonsingular and symmetric, so one can easily show that Σ−1
will be symmetric too. Therefore, we have that (see matrix cookbook):

(x − µ)T Σ−1 (x − µ) = 2Σ−1 (x − µ)
∂x
As a result, our derivative becomes
 
∂ 1 1 1
N (x|µ, Σ) = − D/2
exp − (x − µ) Σ (x − µ) Σ−1 (x − µ)
T −1
∂x (2π) |Σ|1/2 2
and is 0 for x = µ, so like in the case of the univariate distribution, the mode of the multivariate
distribution is given by the mean µ.

Exercise 1.10 ?
Suppose that the two variables x and z are statistically independent. Show that the mean and
variance of their sum satisfies
E[x + z] = E[x] + E[z] (1.128)
var[x + z] = var[x] + var[z] (1.129)

Proof. Since the variables are independent, we have that p(x, z) = p(x)p(z). Therefore, by using
this, the expression of the expected value and the fact that the distributions are normalized, we
have that
Z ∞Z ∞
E[x + z] = p(x, z)(x + z) dx dz
−∞ −∞
Z ∞Z ∞
= p(x)p(z)x + p(x)p(z)z dx dz
−∞ −∞
Z ∞ Z ∞  Z ∞ 
= p(z) p(x)x dx + p(z)z p(x) dx dz
−∞ −∞ −∞
Z ∞
= p(z)E[x] + p(z)z dz
−∞
Z ∞ Z ∞
= E[x] p(z) dz + p(z)z dz
−∞ −∞
= E[x] + E[z] (1.128)
Analogously, we can solve the discrete case. Now, by using all the available tools, i.e. (1.39)
and (1.128), the linearity of the expectation and the independence of variables, we have that the
variance of the sum is given by:
var[x + z] = E[(x + z)2 ] − E[x + z]2 = E[x2 + 2xz + z 2 ] − (E[x] + E[z])2
= E[x2 ] + 2E[x]E[z] + E[z 2 ] − E[x]2 − E[x2 + 2xz + z 2 ] − E[z]2
= E[x2 ] − E[x]2 + E[z 2 ] − E[z]2
= var[x] + var[z] (1.129)

8
Exercise 1.11 ?
By setting the derivatives of the log likelihood function (1.54) with respect to µ and σ 2 equal to
zero, verify the results (1.55) and (1.56).

Proof. The log likelihood of the Gaussian is given by:


N
2 1 X N N
ln p(x|µ, σ ) = − 2 (xn − µ)2 − ln σ 2 − ln(2π) (1.54)
2σ n=1 2 2

By taking the derivative of (1.54) with respect to µ we get that:


 N   XN N 
∂ 2 1 ∂ X 2 1 ∂ 2
X
2
ln p(x|µ, σ ) = − 2 (xn − µ) = − 2 x −2 xn µ + N µ
∂µ 2σ ∂µ n=1 2σ ∂µ n=1 n n=1
XN 
1
= 2 xn − N µ
σ n=1

which is 0 for the maximum point:


N
1 X
µM L = xn (1.55)
N n=1
Now, we want the variance that maximizes the log likelihood, so we take the derivative of (1.54)
(by using µM L ) with respect to σ 2 :
N  N 
∂ 2 1 X 2 N 1 X 2 2
ln p(x|µM L , σ ) = 4 (xn − µM L ) − 2 = 4 (xn − µM L ) − N σ
∂σ 2 2σ n=1 2σ 2σ n=1

The derivative is 0 for the maximum point


N
2 1 X
σM L = (xn − µM L )2 (1.56)
N n=1

Exercise 1.12 ??
Using the results (1.49) and (1.50), show that

E[xn xm ] = µ2 + Inm σ 2 (1.130)

where xn and xm denote data points sampled from a Gaussian distribution with mean µ and
variance σ 2 , and Inm satisfies Inm = 1 if n = m and Inm = 0 otherwise. Hence prove the results
(1.57) and (1.58).

9
Proof. We assume that the data points are i.i.d, so we have that the variables xn and xm are not
independent for n 6= m and independent for n = m. Therefore,
(
µ2 n 6= m
E[xn xm ] = 2 2
µ +σ n=m
which is equivalent with (1.130). Now, the expectation of µM L is given by:
 X N  N
1 1 X
E[µM L ] = E xn = E[xn ] = µ (1.57)
N n=1 N n=1
2
Similarly, the expectation of σM L is given by:
 X N  N
2 1 2 1 X
E[σM L ] = E (xn − µM L ) = E[x2n − 2xn µM L + µ2M L ]
N n=1 N n=1
N
1 X 2
= (µ + σ 2 − 2E[xn µM L ] + E[µ2M L ])
N n=1
We compute each expectation separately and get:
N N −1 XN N N −1 N
σ2
X 
2 1 2
X 1 X 2 2 X X
E[µM L ] = 2 E xn + 2 xi xj = 2 E[xn ] + 2 E[xi xj ] = + µ2
N n=1 i=1 j=i+1
N n=1 N i=1 j=i+1 N
N
σ2
 X 
1 1
E[xn µM L ] = E xn xi = (σ 2 + N µ2 ) = + µ2
N i=1
N N
By putting everything together, we obtain
 
2 N −1 2
E[σM L] = σ (1.58)
N

Exercise 1.13 ?
Suppose that the variance of a Gaussian is estimated using the result (1.56) but with the maximum
likelihood estimate µM L replaced with the true value µ of the mean. Show that this estimator has
the property that its expectation is given by the true variance σ 2 .

Proof. Let
N
∗ 2 1 X
σM L = (xn − µ)2
N n=1
be the estimator described in the hypothesis. It’s straightforward to show that the expectation of
the estimator is the actual variance:
N   N
∗ 2 1 X 2 2 1 X 2
E[σM L ] = E[xn ] − 2E[xn µ] + E[µ ] = (σ + µ2 − 2µ2 + µ2 ) = σ 2
N n=1 N n=1

10
Exercise 1.14 ??
S A
Show that an arbitrary square matrix with elements wij can be written in the form wij = wij + wij
S A S S
where wij and wij are symmetric and anti-symmetric matrices, respectively, satisfying wij = wji
A A
and wij = −wji for all i and j. Now consider the second order term in a higher order polynomial
in D dimensions, given by
XD X D
wij xi xj (1.131)
i=1 j=1

Show that
D X
X D D X
X D
S
wij xi xj = wij xi xj (1.132)
i=1 j=1 i=1 j=1

so that the contribution from the anti-symmetric vanishes. We therefore see that, without loss
of generality, the matrix of coefficients wij can be chosen to be symmetric, and so not all of the
D2 elements of this matrix can be chosen independently. Show that the number of independent
S
parameters in the matrix wij is given by D(D + 1)/2.

Proof. If we consider the system of equations


S A S A
wij = wij + wij wji = wij − wij
we quickly reach the conclusion that the solutions are given by
S wij + wji A wij − wji
wij = wij = (1.14.1)
2 2
such that for all i and j,
S A
wij = wij + wij
The coefficient matrix w associated with the second order higher order polynomial in D di-
mensions is actually a D × D symmetric matrix. Therefore, from (1.14.1) we’d have that wS = w
and wA = 0D , where 0D is the null matrix of dimension D, so (1.132) definitely holds as the
anti-symmetric contribution vanishes.
We consider as independent parameters of the matrix w the elements on and above the diagonal,
since the ones under the diagonal are reflections of the ones above. There are
D D
X X D(D + 1) D(D + 1)
(D − i + 1) = D2 + D − i = D2 + D − =
i=1 i=1
2 2

such independent parameters

Exercise 1.15 ? ? ?
In this exercise and the next, we explore how the number of independent parameters in a polynomial
grows with the order M of the polynomial and with the dimensionality D of the input space. We
start by writing down the M th order term for a polynomial in D dimensions in the form
D X
X D D
X
... wi1 ,i2 ,...,iM xi1 xi2 · . . . xiM (1.133)
i1 =1 i2 =1 iM =1

11
The coefficients wi1 ,i2 ,...iM compromise DM elements, but the number of independent parameters
is significantly fewer due to the many interchange symmetries of the factor xi1 , xi2 . . . xiM . Begin
by showing that the redundancy in the coefficients can be removed by rewriting the M th order
term in the form
i1
D X iM −1
X X
... ei1 ,i2 ,...,iM xi1 xi2 · . . . xiM
w (1.134)
i1 =1 i2 =1 iM =1

Note that the precise relationship between the w e coefficients and w coefficients need not be
made explicit. Use this result to show that the number of independent parameters n(D, M ), which
appear at order M , satisfies the following recursion relation
D
X
n(D, M ) = n(i, M − 1) (1.135)
i=1

Next use proof by induction to show that the following result holds
D
X (i + M − 2)! (D + M − 1)!
= (1.136)
i=1
(i − 1)!(M − 1)! (D − 1)!M !

which can be done by first proving the result for D = 1 and arbitrary M by making use of the result
0! = 1, then assuming it is correct for dimension D and verifying that it is correct for dimension
D + 1. Finally, use the two previous results, together with proof by induction, to show

(D + M − 1)!
n(D, M ) = (1.137)
(D − 1)!M !

To do this, first show that the result is true for M = 2, and any value of D ≥ 1, by comparison
with the result of Exercise 1.14. Then make use of (1.135), together with (1.136), to show that, if
the result holds at order M − 1, then it will also hold at order M .

Proof.

Exercise 1.17 ??
The gamma function is defined by
Z ∞
Γ(x) = ux−1 e−u du (1.141)
0

Using integration by parts, prove the relation Γ(x + 1) = xΓ(x). Show also that Γ(1) = 1 and
hence that Γ(x + 1) = x! when x is an integer.

Proof. Knowing that −ux e−u → 0 as u → ∞, we integrate Γ(x + 1) by parts and obtain:
Z ∞ ∞ Z ∞
−u 0 x −u
Γ(x + 1) = x
u (−e ) du = −u e +x ux−1 e−u du = xΓ(x)
0 0 0

12
Computing Γ(1) is also easily done by integrating by parts:
Z ∞ Z ∞ ∞ Z ∞
−u −u 0 −u
Γ(1) = ue du = u(−e ) du = −ue + e−u du = 1
0 0 0 0

We can prove by induction that Γ(x + 1) = x! when x is an integer. This is obviously valid for
x = 0, since 0! = 1. Now, assume that Γ(k) = (k − 1)!, for k ∈ N. Then,

Γ(k + 1) = kΓ(k) = k · (k − 1)! = k!

Therefore, Γ(n + 1) = n! for all n ∈ N.

Exercise 1.18 ??
We can use the result (1.126) to derive an expression for the surface area SD and the volume VD ,
of a sphere of unit radius in D dimensions. To do this, consider the following result, which is
obtained by transforming from Cartesian to polar coordinates
D Z ∞ Z ∞
−x2i 2
Y
e dxi = SD e−r rD−1 dr (1.142)
i=1 −∞ 0

Using the definition (1.141) of the Gamma function, together with (1.126), evaluate both sides
of this equation, and hence show that

2π D/2
SD = (1.143)
Γ(D/2)
Next, by integrating with respect to radius from 0 to 1, show that the volume of the unit sphere
in D dimensions is given by
SD
VD = (1.144)
D

Finally, use the results Γ(1) = 1 and Γ(3/2) = π/2 to show that (1.143) and (1.144) reduce
to the usual expressions for D = 2 and D = 3.

Proof. We observe that the left side factor of (1.142) looks like (1.126) for σ 2 = 1/2. Therefore,
D Z ∞ D
−x2i
Y Y
e dxi = π 1/2 = π D/2
i=1 −∞ i=1

One can easily notice that the integral in the right side of (1.142) can be written as:
Z ∞ Z ∞
1 ∞ −u (D−2)/2
Z
−r2 D−1 −r2 2 (D−2)/2 1
e r dr = e (r ) r dr = e u du = Γ(D/2) du
0 0 2 0 2
where we made the substitution u = r2 .
Therefore, from those results and from (1.142), we find that

2π D/2
SD = (1.143)
Γ(D/2)

13
The volume of the unit hypersphere is now given by the integral
Z 1
SD
VD = SD rD−1 dr = (1.144)
0 D
Now, we get the expected results for D = 2 and D = 3:

2π 2π 3/2 4π
S2 = = 2π V2 = π S3 = = 4π V3 =
Γ(1) Γ( 32 ) 3

Exercise 1.19 ??
Consider a sphere of radius a in D-dimensions together with the concentric hypercube of side 2a,
so that the sphere touches the hypercube at the centres of each of its sides. By using the results of
Exercise 1.18, show that the ratio of the volume of the sphere to the volume of the cube is given
by
volume of sphere π D/2
= (1.145)
volume of cube D2D−1 Γ(D/2)
Now, make use of Stirling’s formula in the form

Γ(x + 1) ' (2π)1/2 e−x xx+1/2 (1.146)

which is valid for x  1, to show that, as D → ∞, the ratio (1.145) goes to zero. Show also
that the ratio of the distance from the centre of the√hypercube to one of the corners, divided by
the perpendicular distance to one of the sides, is D, which therefore goes to ∞ as D → ∞.
From these results we see that, in a space of high dimensionality, most of the volume of a cube is
concentrated in a large number of corners, which themselves become very lone ’spikes’ !

Proof. Using the results of Exercise 1.18, we have that the volume of D-dimensional hypersphere
of radius a is
2π D/2 aD
VDsphere (a) =
DΓ(D/2)
We also know that the volume of the D−hypercube of size 2a is given by:

VDcube (2a) = (2a)D = 2D aD

Therefore the ratio of the volumes is given by

VDsphere (a) π D/2


= (1.145)
VDcube (a) D2D−1 Γ(D/2)

By using Stirling’s approximation, we have that

π D/2 π D/2
lim = lim
D→∞ D2D−1 Γ(D/2) D→∞ D2D−1 (2π)1/2 e1−D/2 (D/2 − 1)D/2−1/2

14
 D/2  D/2−1 √ 
π e D−2
= lim · · √ =0
D→∞ 4 D/2 − 1 D π
Now, we want to find the ratio between the distance from the centre of the hypercube to one of
the corners and the distance from the centre to a side. We can consider without loss of generality
a D-dimensional hypercube of length 2α, centered in the origin 0D of the RD Cartesian system.
The center of a hypercube side takes the form s = (α1 , α2 , . . . , αD ), where αi ∈ {0, a} such that
||s|| = a, i.e. only one coordinate is equal to a and the rest are 0. On the other hand, the corners
of the hypercube
√ take the form c = (β1 , β2 , . . . , βD ), where βi ∈ {±a}. We’ll then have that
||c|| = a D. As a result, our ratio looks like expected:

distance from center to corner ||s|| a D √
= = = D
distance from center to side ||c|| a

Exercise 1.21 ??
Consider two nonnegative numbers a and b, and show that, if a ≤ b, then a ≤ (ab)1/2 . Use
this result to show that, if the decision regions of a two-class classification problem are chosen to
minimize the probability of misclassification, this probability will satisfy
Z
p(mistake) ≤ {p(x, C1 )p(x, C2 )}1/2 dx (1.150)

Proof. We start by proving the identity. We have that


a ≤ (ab)1/2 ⇐⇒ a2 ≤ ab ⇐⇒ a2 − ab ≤ 0 ⇐⇒ a(a − b) ≤ 0
which is true since a ≤ b.
Now, since the regions are chosen to minimize the probability of misclassification, for an in-
dividual value of x, the region Rk with the higher joint/posterior probability associated to Ck is
chosen, so:
p(x, C2 ) ≤ p(x, C1 ), ∀x ∈ R1 p(x, C1 ) ≤ p(x, C2 ), ∀x ∈ R2
By applying the a ≤ (ab)1/2 identity above, we get that
p(x, C2 ) ≤ {p(x, C1 )p(x, C2 )}1/2 , ∀x ∈ R1 p(x, C1 ) ≤ {p(x, C1 )p(x, C2 )}1/2 , ∀x ∈ R2
If we integrate the inequalities over the associated regions, we have that:
Z Z
p(x, C2 ) dx ≤ {p(x, C1 )p(x, C2 )}1/2 dx
R1 R1
Z Z
p(x, C1 ) dx ≤ {p(x, C1 )p(x, C2 )}1/2 dx
R2 R2
By summing the above inequalities, we find that:
Z Z Z
p(x, C2 ) dx + p(x, C1 ) dx ≤ {p(x, C1 )p(x, C2 )}1/2
R1 R2

which is equivalent to (1.150).

15
Exercise 1.22 ?
Given a loss matrix with elements Lkj , the expected risk is minimized, if for each x, we choose the
class that minimizes (1.81). Verify that, when the loss matrix is given by Lkj = 1 − Ikj , where Ikj
are the elements of the identity matrix, this reduces to the criterion of choosing the class having
the largest posterior probability. What is the interpretation of this form of loss matrix?

Proof. The expectation is minimized if for each x we choose the class Cj such that the quantity
X
Lkj p(Ck |x) (1.81)
k

is minimized. For Lkj = 1 − Ikj the quantity becomes


X X
(1 − Ikj )p(Ck |x) = p(Ck |x) − p(Cj |x) = 1 − p(Cj |x)
k k

and it’s obviously minimised by choosing the class Cj having the largest posterior probability
p(Cj |x)
This form of loss matrix makes each mistake have the same ”weight”, no mistake is worse than
another.

Exercise 1.23 ?
Derive the criterion for minimizing the expected loss when there is a general loss matrix and general
prior probabilities for the classes.

Proof. Minimizing the expected loss


XXZ
E[L] = Lkj p(x, Cj ) dx (1.80)
k j Rj

is equivalent with minimizing X


Lkj p(x, Ck )
k

for each x. Therefore, by using Bayes’ theorem, we have that the criterion of minimizing the
expected loss is the class Cj for each x such that
X
Lkj p(x|Ck )
k

is minimized.

16
Exercise 1.24 ??
Consider a classification problem in which the loss incurred when an input vector from class Ck is
classified as belonging to class Cj is given by the loss matrix Lkj , and for which the loss incurred
in selecting the reject option is λ. Find the decision criterion that will give the minimum expected
loss. Verify that this reduces to the reject criterion discussed in Section 1.5.3 when the loss matrix
is given by Lkj = 1 − Ikj . What is the relationship between λ and the rejection threshold θ?

Proof. The decision criterion reduces to choosing the minimum between the loss of choosing the
best class and the reject loss λ. Therefore, if
X
α = argmin Lkj p(x|Ck )
j
k

we choose the class α if the above quantity is less than λ and use the reject option otherwise. If
the loss matrix is given by Lkj = 1 − Ikj , then

α = argmin{1 − p(Cj |x)}


j

which makes Cα the class with the highest posterior probability. Therefore the criterion reduces to
the one discussed in Section 1.5.3. If the highest posterior probability is smaller than 1 − λ, then
we use the reject option. This is equivalent with using θ = 1 − λ in Section 1.5.3.

Exercise 1.25 ? CALCULUS OF VARIATIONS


Consider the generalization of the squared loss function (1.87) for a single target variable t to the
case of multiple target variables described by the vector t given by
ZZ
E[L(t, y(x)] = ||y(x) − t||2 p(x, t) dx dt (1.151)

Using the calculus of the variations, show that the function y(x) for which this expected loss is
minimized is given by y(x) = Et [t|x]. Show that this result reduces to (1.89) for the case of a
single target variable t.

Exercise 1.26 ? TODO


By expansion of the square in (1.151), derive a result analogous to (1.90), and hence show that the
function y(x) that minimizes the expected square loss for the case of a vector t of target variables
is again given by the conditional expectation of t.

Exercise 1.27 ?? TODO


Consider the expected loss for regression problems under the Lq loss function given by (1.91).
Write down the condition that y(x) must satisfy in order to minimize E[Lq ]. Show that, for q = 1,
this solution represents the conditional median, i.e., the function y(x) such that the probability

17
mass for t < y(x) is the same for t ≥ y(x). Also show that the minimum expected Lq loss for q → 0
is given by the conditional mode, i.e., by the function y(x) equal to the value t that maximizes
p(t|x) for each x.

Proof.

Exercise 1.28 ?
In Section 1.6, we introduced the idea of entropy h(x) as the information gained on observing the
value of a random variable x having distribution p(x). We saw that, for independent variables x and
y for which p(x, y) = p(x)p(y), the entropy functions are additive, so that h(x, y) = h(x) + h(y). In
this exercise, we derive that the relation between h and p in the form of a function h(p). First show
that h(p2 ) = 2h(p), and hence by induction that h(pn ) = nh(p) where n is a positive integer. Hence
show that h(pn/m ) = n/mh(p) where m is also a positive integer. This implies that h(px ) = xh(p)
where x is a positive rational number, and hence by continuity when it is a positive real number.
Finally, show that this implies h(p) must take the form h(p) ∝ ln p.

Proof. For independent variables x and y we have that:

h(x, y) = − log2 p(x, y) = − log2 p(x)p(y) = − log2 p(x) − log2 p(y) = h(x) + h(y)

Next, we show that:


h(p2 ) = − log2 p2 = −2 log2 p = 2h(p)
and more generally for a positive integer n:

h(pn ) = − log2 pn = −n log2 p = nh(p)

This can be extended to rational number by letting n, m ∈ N and showing that:


n n
h(pn/m ) = − log2 pn/m = − log2 p = h(p)
m m
Finally, since
1
h(p) = − log2 p = − ln p
ln 2
we have that h(p) ∝ ln p.

Exercise 1.29
Consider an M -state discrete random variable x, and use Jensen’s inequality in the form (1.115)
to show that the entropy of the distribution p(x) satisfies H[x] ≤ ln M .

18
Proof. The entropy of the distribution p(x) is given by:
M
X
H[x] = − p(xi ) ln p(xi )
i=1

We apply Jensen’s inequality with λi = p(xi ) and the convex function f (x) = ln(x) to obtain:
X M 
2
H[x] ≤ − ln p(x) (1.29.1)
i=1

One can prove by using Lagrange multipliers that


M
X 1
p(x)2 ≤
i=1
M
Therefore, by substituting into (1.29.1) and using the fact that ln x is strictly increasing on
(0, ∞), we have that
H[x] ≤ ln M

Exercise 1.30 ??
Evaluate the Kullback-Leibler divergence (1.113) between two Gaussians p(x) = N (x|µ, σ 2 ) and
q(x) = N (x|m, s2 ).

Proof. The Kullback-Leibler divergence is given by


Z  
q(x)
KL(p||q) = − p(x) ln dx (1.113)
p(x)
We start by splitting the integral into:
Z Z
KL(p||q) = − p(x) ln q(x) dx + p(x) ln p(x) dx

The negation of the second term will be equal to the entropy of the Gaussian, that is:
1
Hp [x] = {1 + ln 2πσ 2 }

(1.110)
2
We have that
1  (x − m)2
ln q(x) = ln N (x|m, s2 ) = ln 2πs2 −
2 s2
so by using the fact that the Gaussian is normalized and by noticing the expected values, the KL
divergence becomes:
 m2
Z Z  Z
1 2 2m 1 2
KL(p||q) = 2 p(x)x dx − 2 p(x)x dx + ln 2πs + 2 p(x) dx − Hp [x]
s s 2 s
1 2m m2 s 1
= 2 E[x2 ] − 2 E[x] + 2 + ln +
s s s σ 2
2 2
1 s σ + (µ − m)
= + ln +
2 σ s2

19
Exercise 1.31 ??
Consider two variables x and y having joint distribution p(x, y). Show that the differential entropy
of this pair of variables satisfies
H[x, y] ≤ H[x] + H[y] (1.152)
with equality if, and only if x and y are statistically independent.

Proof. The differential entropy of two variables x and y is given by

H[x, y] = H[y|x] + H[x] (1.112)

so (1.152) becomes equivalent with

H[y|x] − H[y] ≤ 0 (1.31.1)

which we’re going to prove now.


We start by rewriting the entropy H[y] as
Z ZZ
H[y] = − p(y) ln p(y) dy = − p(x, y) ln p(y) dx dy

Therefore, since the differential entropy is given by


ZZ
H[y|x] = p(x, y) ln p(y|x) dx dy (1.111)

we have that
ZZ ZZ
H[y|x] − H[y] = − p(x, y) ln p(y|x) dx dy + p(x, y) ln p(y) dx dy
ZZ  
p(y)
= p(x, y) ln dx dy
p(y|x)

By using the inequality ln α ≤ α − 1, for all α > 0, we obtain:


ZZ  
p(y)
H[y|x] − H[y] ≤ p(x)p(y|x) − 1 dx dy
p(y|x)
ZZ ZZ
≤ p(x)p(y) dx dy − p(x)p(y) dxy
ZZ
≤ p(x)p(y) dx dy − 1

≤0

which proves (1.31.1), respectively (1.152).

20
Exercise 1.32 ?
Consider a vector x of continuous variables with distribution p(x) and corresponding entropy H[x].
Suppose that we make a nonsingular linear transformation of x to obtain a new variable y = Ax.
Show that the coresponding entropy is given by H[y] = H[x] + ln |A| where |A| denotes the
determinant of A.
Proof. By generalizing (1.27) for the multivariate case, we have that:

∂x ∂A−1 y
py (y) = px (x) = px (x) = px (x)|A−1 | = px (x)|A|−1
∂y ∂y

∂x
where J = = |A|−1 is the Jacobian determinant.
∂y
Now, the entropy of y is given by:
Z Z Z
px (x) px (x) dy px (x)
H[y] = − py (y) ln py (y) dy = − ln dx = − px (x) ln dx
|A| |A| dx |A|
Z Z
= − px (x) ln px dx + ln |A| px (x) dx

= H[y] + ln |A|

Exercise 1.33 ??
Suppose that the conditional entropy H[y|x] between two discrete random variables x and y is
zero. Show that, for all values of x such that p(x) > 0, the variable y must be a function of x, in
other words for each x there is only one value of y such that p(y|x) 6= 0.

Proof. Assuming x, y have N respectively M outcomes, we can rewrite the conditional entropy as:
N X
X M N
X M
X
H[y|x] = − p(xi , yj ) ln p(yj |xi ) = − p(xi ) p(yj |xi ) ln p(yj |xi )
i j i j

Since all the sum terms have the same sign, the entropy is 0 if each term is 0. Therefore, the
entropy is 0 if for each p(xi ) > 0, the inner sum terms are 0. This happens only for p(yj |xi ) = 0 or
M
X
ln p(yj |xi ) = 0, which means that p(yj |xi ) ∈ {0, 1}. Since p(yj |xi ) = 1, we have that for each
j=1
xi there is an unique yj such that p(yj |xi ) = 1, which proves our hypothesis.

Exercise 1.34 ?? CALCULUS OF VARIATIONS


Use the calculus of variations to show that the stationary point of the functional (1.108) is given by
(1.108). Then use the constraints (1.105), (1.106) and (1.107) to eliminate de Lagrange multipliers
and hence show that the maximum entropy solution is given by the Gaussian (1.109).

21
Exercise 1.35 ?
Use the results (1.106) and (1.107) to show that the entropy of the univariate Gaussian (1.109) is
given by (1.110).

Proof. The entropy of the univariate Gaussian is given by:


Z
H[x] = − N (x|µ, σ 2 ) ln N (x|µ, σ 2 ) dx
(x − µ)2
Z Z
1 2
N (x|µ, σ ) dx + N (x|µ, σ 2 )
2

= − ln 2πσ dx
2 σ2
 µ2
 Z Z Z
1 2 1 µ
= − ln 2πσ + 2 N (x|µ, σ ) dx + 2 N (x|µ, σ )x dx − 2 N (x|µ, σ 2 )x dx
2 2 2
2 σ 2σ σ

By using the fact that the Gaussian is normalized and by noticing the expression of the expected
value, we have that

µ2
 
1 2
 1 2 2µ 1 1 2

H[x] = − ln 2πσ + 2 + 2 E[x ] − 2 E[x] = 1 − ln 2πσ (1.110)
2 2σ 2σ 2σ 2 2

Exercise 1.36 ?
A strictly convex function is defined as one for which every chord lies above the function. Show
that this is equivalent to the condition that the second derivative of the function be positive.

Proof. Suppose that f is a twice differentiable function. By summing the Taylor expansions of
f (x + h) and f (x − h), one can show that

f (x + h) + f (x − h) − 2f (x)
f 00 (x) = lim
h→0 h2
Therefore, we have that

f 00 (x) > 0 ⇐⇒ f (x + h) + f (x − h) − 2f (x) > 0


1 1
⇐⇒ f (x + h) + f (x − h) − f (x) > 0
2 2
If f is strictly convex, we can apply (1.114) in a strict form to obtain
 
1 1 1 1
f (x + h) + f (x − h) − f (x) > f (x + h) + (x − h) − f (x) = 0
2 2 2 2

Therefore, the second derivative of a strictly convex function is positive.

22
Exercise 1.37 ?
Using the definition (1.111) together with the product rule of probability, prove the result (1.112).

Proof. Using the product rule of probability, one could rewrite the entropy of x as:
Z ZZ
H[x] = − p(x) ln p(x) dx = − p(x, y) ln p(x) dx dy

Now, by summing this with (1.111) we see that:


ZZ ZZ
H[y|x] + H[x] = − p(x, y) ln p(y|x) dx dy − p(x, y) ln p(x) dx dy
ZZ
=− p(x, y) ln{p(y|x)p(x)} dx dy
ZZ
=− p(x, y) ln p(x, y) dx dy

= H[x, y] (1.112)

Exercise 1.38 ??
Using proof by induction, show that the inequality (1.114) for convex functions implies the result
(1.115).

Proof. We’ll prove Jensen’s inequality by induction, i.e. if we have N points x1 , . . . xn , f is a


XN
convex function and λi ≥ 0, λi = 1, then
i=1

N
X  N
X
f λi xi ≤ λi f (xi ) (1.115)
i=1 i=1

We consider the base case of the induction to be given by

f (λa + (1 − λ)b) ≤ λf (a) + (1 − λ)f (b) (1.114)

Now, we assume that Jensen’s inequality is true for a set of N points and want to prove that
N
X
it’s also true for N + 1 points. Since λi = 1, there exists at least one λi ≤ 1. We can assume
i=1
without loss of generality that this is λ1 . Therefore, we have that
N +1   N +1 
X X λi
f λ i xi = f λ1 x1 + (1 − λ1 ) xi
i=1 i=2
1 − λ1

23
Since λ1 and 1 − λ1 are both nonnegative and sum to 1, we can apply (1.115) to the right-hand
side of the equality to obtain:
N +1  N +1
X X λi
f λi xi ≤ λ1 f (x1 ) + (1 − λ1 ) f (xi )
i=1 i=2
1 − λ1
N
X +1
≤ λ1 f (x1 ) + f (xi )
i=2
N
X +1
≤ λi f (xi ) (1.115)
i=1

Therefore, we proved Jensen’s inequality by induction.

Exercise 1.39 ? ? ?
Consider two binary variables x and y having the joint distribution given in Table 1.3. Evaluate
the following quantities:

(a) H[x] (c) H[y|x] (e) H[x, y]

(b) H[y] (d) H[x|y] (f) I[x, y]

Draw a diagram to show the relationship between these various quantities.

y y
0 1
x 0 1/3 1/3
x 1 0 1/3

Table 1.3 The joint distribution p(x, y) used in Exercise 1.39.

Proof. Through straightforward computations using the discrete formula for the entropy, we have

(a) H[x] = −2/3 ln 2 + ln 3 (c) H[x|y] = 2/3 ln 2 (e) H[x, y] = ln 3

(b) H[y] = −2/3 ln 2 + ln 3 (d) H[y|x] = 2/3 ln 2 (f) I[x, y] = −4/3 ln 2 + ln 3

The diagram shows the relationship between the entropies. Note that the joint entropy H[x, y]
occupies all three colored areas.

Exercise 1.40 ?
By applying Jensen’s inequality (1.115) with f (x) = ln x, show that the arithmetic mean of a set
of real numbers is never less than their geometric mean.

24
Exercise 1.39 Diagram

Proof. Let N be the cardinality of the considered set of real numbers. By considering f (x) = ln x
(which is convex) and λi = 1/N , we use Jensen’s inequality to obtain:
 N  N YN   YN 1/N 
1 X 1 X 1
ln xi ≤ ln xi = ln xi = ln xi
N i=1 N i=1 N i=1 i=1

Since ln x is increasing, the above inequality is equivalent with:


N Y n 1/N
1 X
xi ≤ xi
N i=1 i=1

which proves that the arithmetic mean of a set of real numbers is never less than their geometric
mean.

Exercise 1.41 ?
Using the sum and product rules of probability, show that the mutual information I(x, y) satisfies
the relation (1.121).

Proof. The mutual information between the variables x and y is given by:
ZZ  
p(x)p(y)
I[x, y] = − p(x, y) ln dx dy (1.120)
p(x, y)

25
We split the integral and use the product and sum rules of probability to obtain the desired
result:
ZZ ZZ
I[x, y] = − p(x, y) ln p(x) dx dy + p(x, y) ln p(x|y) dx dy
Z ZZ
= − p(x) ln p(x) dx + p(x, y) ln p(x|y) dx dy

= H[x] − H[x|y] (1.121)

Analogously, one could easily show that also I[x, y] = H[y] − H[y|x]

26
Chapter 2

Probability Distributions

Exercise 2.1 ?
Verify that the Bernoulli distribution (2.2) satisfies the following properties
1
X
p(x|µ) = 1 (2.257)
x=0
E[x] = µ (2.258)
var[x] = µ(1 − µ) (2.259)
Show that the entropy H[x] of a Bernoulli distributed random binary variable x is given by
H[x] = −µ ln µ − (1 − µ) ln(1 − µ) (2.260)

Proof. The Bernoulli distribution is given by


Bern(x|µ) = µx (1 − µ)1−x (2.2)
The properties are easily verified:
1
X
p(x|µ) = p(x = 0|µ) + p(x = 1|µ) = µ0 (1 − µ)1 + µ1 (1 − µ)0 = 1 (2.257)
x=0

1
X
E[x] = xp(x|µ) = 0 · p(x = 0|µ) + 1 · p(x = 1|µ) = µ (2.258)
x=0
1
X
var[x] = E[x2 ]−E[x]2 = x2 p(x|µ)−µ2 = 02 ·p(x = 0|µ)+12 ·p(x = 1|µ)−µ2 = µ(1−µ) (2.259)
x=0
The entropy is also straightforward to derive:
1
X
H[x] = − p(x|µ) ln p(x|µ) = −p(x = 0|µ) ln p(x = 0|µ) − p(x = 1|µ) ln p(x = 1|µ)
x=0
= −µ ln µ − (1 − µ) ln(1 − µ)

27
Exercise 2.2 ??
The form of the Bernoulli distribution given by (2.2) is not symmetric between the two values
of x. In some situations, it will be more convenient to use an equivalent formulation for which
x ∈ {−1, 1}, in which case the distribution can be written
 (1−x)/2  (1+x)/2
1−µ 1+µ
p(x|µ) = (2.261)
2 2

where µ ∈ [−1, 1]. Show that the distribution (2.261) is normalized, and evaluate its mean, variance
and entropy.

Proof. The distribution is normalized since


X 1−µ 1+µ
p(x|µ) = p(x = −1|µ) + p(x = 1|µ) = + =1
x
2 2

The other properties are also easily derived:


X 1+µ 1−µ
E[x] = xp(x|µ) = p(x = 1|µ) − p(x = −1|µ) = − =µ
x
2 2

X
var[x] = E[x2 ] − E[x]2 = x2 p(x|µ) − µ2 = p(x = −1|µ) + p(x = 1|µ) − µ2
x
1+µ 1−µ
= + − µ2 = (1 − µ)(1 + µ)
2 2

X
H[x] = − p(x|µ) ln p(x|µ) = −p(x = −1|µ) ln p(x = −1|µ) − p(x = 1|µ) ln p(x = 1|µ)
x
   
1−µ 1−µ 1+µ 1+µ
=− ln − ln
2 2 2 2

Exercise 2.3 ??
In this exercise, we prove that the binomial distribution (2.9) is normalized. First use the definition
(2.10) of the number of combinations of m identical objects chosen from a total of N to show that
     
N N N +1
+ = (2.262)
m m−1 m

Use this result to prove by induction the following result


N  
X
N N m
(1 + x) = x (2.263)
m=0
m

28
which is known as the binomial theorem, and which is valid for all real values of x. Finally, show
that the binomial distribution is normalized, so that
N  
X N m
µ (1 − µ)N −m = 1 (2.264)
m=0
m

which can be done by first pulling out a factor (1 − µ)N out of the summation and then making
use of the binomial theorem.
Proof. The binomial distribution is given by
 
N m
Bin(m|N, µ) = µ (1 − µ)N −m (2.9)
m

By using (2.10), we prove (2.262)


   
N N N! N!
+ = +
m m−1 (N − m)!m! (N − m + 1)!(m − 1)!
(N − m + 1)N ! mN !
= +
(N − m + 1)!m! (N − m + 1)!m!
(N + 1)!
=
(N − m + 1)!m!
 
N +1
= (2.262)
m

We aim to prove (2.263) by induction. The base case for N = 1 is obviously true since
   
1 1
1+x= + x
0 1

Now, suppose that the case for N = k ∈ N∗ is true, i.e.


k  
k
X k m
(1 + x) = x
m=0
m

By using this and (2.262), we show that


k  
k+1
X k
(1 + x) = (1 + x) xm
m=0
m
k   k  
X k m X k m+1
= x + x
m=0
m m=0
m
k
X k   k+1  
m
X k
=1+ x + xm
m=1
m m=1
m − 1
    k    
k+1 k + 1 k+1 X k k
= + x + + xm
0 k+1 m=1
m m−1

29
k+1  
X k+1
= xm
m=0
m

which by induction proves that (2.263) is indeed true.


Finally, we use this result to show that the Binomial distribution is normalized:
N   N   m
X N m N −m N
X N µ
µ (1 − µ) = (1 − µ)
m=0
m m=0
m 1−µ
 N
N µ
= (1 − µ) 1 +
1−µ
=1 (2.264)

Exercise 2.4 ??
Show that the mean of the binomial distribution is given by (2.11). To do this, differentiate both
sides of the normalization condition (2.264) with respect to µ and then rearrange to obtain an
expression for the mean of m. Similarly, by differentiating (2.264) twice with respect to µ and
making use of the result (2.11) for the mean of the binomial distribution prove the result (2.12)
for the variance of the binomial.
Proof. We start by differentiating both sides of (2.264) with respect to µ:
N  
∂ X N m
µ (1 − µ)N −m = 0
∂µ m=0 m
N    
X N m N −m m m−N
µ (1 − µ) + =0
m=0
m µ 1 − µ
 XN   N  
1 1 N m N −m N X N m
+ m µ (1 − µ) − µ (1 − µ)N −m = 0
µ 1 − µ m=0 m 1 − µ m=0 m

We recognize the expression of the binomial distribution and use the fact that it is normalized, to
obtain:
 X N N
1 1 N X
+ mBin(m|N, µ) = Bin(m|N, µ)
µ 1 − µ m=0 1 − µ m=0
 
1−µ
+ 1 E[m] = N
µ

which directly gives us the desired result, that is


N
X
E[m] = mBin(m|N, µ) = N µ (2.11)
m=0

30
To derive the variance, we differentiate twice both sides of (2.264), so
N  
∂2 X N m
µ (1 − µ)N −m = 0
∂µ2 m=0 m
N
1 X
Bin(m|N, µ){m2 + m(2µ − 2N µ − 1) + (N − 1)N µ2 } = 0
µ2 (1 − µ)2 m=0
N
X N
X N
X
Bin(m|N, µ)(m − N µ)2 + (2µ − 1) mBin(m|N, µ) − N µ2 Bin(m|N, µ) = 0
m=0 m=0 m=0
var[m] + (2µ − 1)E[m] − N µ2 = 0

which gives us the desired result, i.e.


N
X
var[m] ≡ (m − E[m])2 Bin(m|N, µ) = N µ(1 − µ) (2.12)
m=0

Exercise 2.5 ??
In this exercise, we prove that the beta distribution, given by (2.13), is correctly normalized, so
that (2.14) holds. This is equivalent to showing that
Z 1
Γ(a)Γ(b)
µa−1 (1 − µ)b−1 dµ = (2.265)
0 Γ(a + b)

From the definition (1.141) of the gamma function, we have


Z ∞ Z ∞
a−1
Γ(a)Γ(b) = exp(−x)x dx + exp(−y)y b−1 dy (2.266)
0 0

Use this expression to prove (2.265) as follows. First bring the integral over y inside the integrand
of the integral over x, next make the change of variable t = y +x, where x is fixed, then interchange
the order of the x and t integrations, and finally make the change of variable x = tµ where t is
fixed.

Proof. The problem is easily solved by following the provided steps. By bringing the integral over
y inside the integrand of the integral over x we obtain that
Z ∞Z ∞
Γ(a)Γ(b) = exp{−(x + y)}xa−1 y b−1 dy dx
0 0

We know use the change of variable t = y + x with x fixed to get


Z ∞Z ∞
Γ(a)Γ(b) = exp(−t)xa−1 (x − t)b−1 dt dx
0 0

31
Interchanging the order of integrations yields
Z ∞Z ∞
Γ(a)Γ(b) = exp(−t)xa−1 (x − t)b−1 dx dt
0 0

which by making the change of variable x = tµ with t fixed becomes


Z ∞Z ∞
Γ(a)Γ(b) = exp(−t)(tµ)a−1 (tµ − t)b−1 t dµ dt
0 0

By separating the t terms from the first integral, we have that


Z ∞ Z ∞
a+b−1
Γ(a)Γ(b) = exp(−t)t dt µa−1 (1 − µ)b−1 dµ
0 0

Finally, we notice that the first integral is equal to Γ(a + b) and by noting the fact that µ is a
probability, so its range is [0, 1], we obtain the desired result:
Z 1
Γ(a)Γ(b)
µa−1 (1 − µ)b−1 dµ = (2.265)
0 Γ(a + b)

Exercise 2.6 ?
Make use of the result (2.265) to show that the mean, variance, and mode of the beta distribution
(2.13) are given respectively by
a
E[µ] = (2.267)
a+b
ab
var[µ] = 2
(2.268)
(a + b) (a + b + 1)
a−1
mode[µ] = (2.269)
a+b−2

Proof. The beta distribution is given by

Γ(a + b) a−1
Beta(µ|a, b) = µ (1 − µ)b−1 (2.13)
Γ(a)Γ(b)

By using (2.265) and the fact that Γ(x + 1) = xΓ(x), we obtain the mean of the Beta distribution:
Z 1
Γ(a + b) 1 a
Z
Γ(a + b) Γ(a + 1)Γ(b) a
E[µ] = µBeta(µ|a, b) dµ = µ (1 − µ)b−1 dµ = · =
0 Γ(a)Γ(b) 0 Γ(a)Γ(b) Γ(a + b + 1) a+b
(2.267)

From this result, we can also easily get the variance:


Z 1 2
a
var[µ] = µ− Beta(µ|a, b) dµ
0 a+b

32
1 Z 1 Z 1
a2
Z
2 2a
= µ Beta(µ|a, b) dµ − µBeta(µ|a, b) dµ + Beta(µ|a, b) dµ
0 a+b 0 (a + b)2 0
Γ(a + b) Γ(a + 2)Γ(b) 2a a a2
= · − · +
Γ(a)Γ(b) Γ(a + b + 2) a + b a + b (a + b)2
2
a(a + 1) 2a a2
= − +
(a + b)(a + b + 1) a + b (a + b)2
ab
= (2.267)
(a + b)2 (a + b + 1)

Finally, the mode of the distribution is given by getting the value of µ for which the derivative of
the distribution is 0,
∂ ∂ a−1
Beta(µ|a, b) = 0 ⇐⇒ µ (1 − µ)b−1 = 0
∂µ ∂µ
⇐⇒ (a − 1)µa−2 (1 − µ)b−1 + (b − 1)µa−1 (1 − µ)b−2 = 0
⇐⇒ µa−2 (1 − µ)b−2 {(a − 1)(1 − µ) + (b − 1)µ} = 0
⇐⇒ (a − 1)(1 − µ) + (b − 1)µ = 0
a−1
⇐⇒ µ =
a+b−2
so indeed
a−1
mode[µ] = (2.268)
a+b−2

Exercise 2.7 ??
Consider a binomial random variable x given by (2.9), with prior distribution for µ given by the
beta distribution (2.13), and suppose we have observed m occurences of x = 1 and l occurences
of x = 0. Show that the posterior mean value of µ lies between the prior mean and the maximum
likelihood estimate for µ. To do this, show that the posterior mean can be written as λ times the
prior mean plus (1 − λ) times the maximum likelihood estimate, where 0 ≤ λ ≤ 1. This illustrates
the concept of the posterior distribution being a compromise between the prior distribution and
the maximum likelihood solution.
a a+m
Proof. The prior mean is , the posterior mean is and the maximum likelihood
a+b a+m+b+l
m
estimate is . Suppose that our hypothesis is true, i.e. there exists a λ such that we can have
m+l
our equality and 0 ≤ λ ≤ 1. Then we’d have that:

a+m λm (1 − λ)a
= +
a+m+b+l m+ l a+b 
a+m a m a
− =λ +
a+m+b+l a+b m+l a+b
bm − al (a + b)(a + m)
λ= ·
(a + b)(a + m + b + l) bm − al

33
l+m
λ=
a+m+b+l
This λ obviously exists and 0 ≤ λ ≤ 1, so our hypothesis is true and the posterior mean value of
x lies between the prior mean and the maximum likelihood estimate for µ.

Exercise 2.8 ?
Consider two variables x and y with joint distribution p(x, y). Prove the following two results

E[x] = Ey [Ex [x|y]] (2.270)

var[x] = Ey [varx [x|y]] + vary [Ex [x|y]] (2.271)


Here Ex [x|y] denotes the expectation of x under the conditional distribution p(x|y), with a similar
notation for the conditional variance.

Proof. The first is straightforward to derive:


ZZ ZZ Z Z 
E[x] = xp(x, y) dx dy = xp(x|y)p(y) dx dy = xp(x|y) dx p(y) dy
Z
= Ex [x|y]p(y) dy = Ey [Ex [x|y]] (2.270)

However, proving (2.271) is slightly more complicated. We’ll compute each term separately:

Exercise 2.10 ??
Using the property Γ(x + 1) = xΓ(x) of the gamma function, derive the following results for the
mean, variance, and covariance of the Dirichlet distribution given by (2.38)
αj
E[µj ] = (2.273)
αj

αj (α0 − αj )
var[µj ] = (2.274)
α02 (α0 + 1)
αj αl
cov[µj µl ] = − 2 , j 6= l (2.275)
α0 (α0 + 1)
where α0 is defined by (2.39).

Proof. The Dirichlet distribution is given by


K
Γ(α0 ) Y
Dir(µ|α) = µαk k −1 (2.38)
Γ(α1 ) . . . Γ(αK ) k=1

34
Besides the property that Γ(x + 1) = xΓ(x), we’ll be using the fact that the distribution is
normalized, specifically that
K
Z Y
Γ(α1 )Γ(α2 ) . . . Γ(αK )
µkαk −1 dµ = ,
k=1
Γ(α0 )

where α0 is defined by (2.39).


The expected value is then given by
Z
E[µj ] = µj Dir(µ|α) dµ
Z
Γ(α0 ) α
= µα1 1 −1 . . . µj j . . . µαKK −1 dµ
Γ(α1 )Γ(α2 ) . . . Γ(αK )
Γ(α0 ) Γ(α1 ) . . . Γ(αj + 1) . . . Γ(αK )
= ·
Γ(α1 )Γ(α2 ) . . . Γ(αK ) Γ(α0 + 1)
αj
= (2.273)
α0
This can now be used to derive the variance:
Z
var[µj ] = (µj − E[µj ])2 Dir(µ|α) dµ
Z  2
αj
= µj − Dir(µ|α) dµ
α0
αj2
Z Z Z
2 2αj
= µj Dir(µ|α) dµ − µj Dir(µ|α) dµ + 2 Dir(µ|α) dµ
α0 α0
αj2
Z
Γ(α0 ) α1 −1 αj +1 αK −1 2αj
= µ1 . . . µj . . . µK dµ − E[µj ] + 2
Γ(α1 )Γ(α2 ) . . . Γ(αK ) α0 α0
2
Γ(α0 ) Γ(α1 ) . . . Γ(αj + 2) . . . Γ(αK ) αj
= · − 2
Γ(α1 )Γ(α2 ) . . . Γ(αK ) Γ(α0 + 2) α0
2
αj (αj + 1) αj
= −
α0 (α0 + 1) α02
αj (α0 − αj )
= 2 (2.275)
α0 (α0 + 1)
The covariance is given by
αj αl
cov[µj µl ] = E[µj µl ] − E[µj ]E[µl ] = E[µj µl ] −
α02
By computing the expectation separately, we find that
Z
E[µj µl ] = µj µl Dir(µ|α) dµ
Z
Γ(α0 ) α
= µα1 1 −1 . . . µj j . . . µαl l . . . µαKK −1 dµ
Γ(α1 )Γ(α2 ) . . . Γ(αK )
Γ(α0 ) Γ(α1 ) . . . Γ(αj + 1) . . . Γ(αl + 1) . . . Γ(αK )
= ·
Γ(α1 )Γ(α2 ) . . . Γ(αK ) Γ(α0 + 2)

35
αj αl
=
α0 (α0 + 1)
Finally, the covariance becomes
αj αl αj αl αj αl
cov[µj µl ] = − 2 =− 2 (2.275)
α0 (α0 + 1) α0 α0 (α0 + 1)

Exercise 2.11 ?
By expressing the expectation of ln µj under the Dirichlet distribution (2.38) as a derivative with
respect to αj , show that
E[ln µj ] = ψ(αj ) − ψ(α0 ) (2.276)
where α0 is given by (2.39) and
d
ψ(a) ≡ ln Γ(a) (2.277)
da
is the digamma function.

Proof. We start by taking the partial derivative of the Dirichlet distribution with respect to αj :
 K 
∂ ∂ Γ(α0 ) Y αk −1
Dir(µ|α) = µ
∂αj ∂αj Γ(α1 )Γ(α2 ) . . . Γ(αK ) k=1 k
 Y K  K 
∂ Γ(α0 ) αk −1 Γ(α0 ) ∂ Y αk −1
= µ + µ
∂αj Γ(α1 )Γ(α2 ) . . . Γ(αK ) k=1 k Γ(α1 )Γ(α2 ) . . . Γ(αK ) ∂αj k=1 k

Our goal is to compute both terms separately. Firstly, since a small change in one of the sum
terms is equivalent to a small change in the sum itself, i.e.
∂ ∂
Γ(α0 ) = Γ(α0 )
∂αj ∂α0
we have that
∂ ∂
∂ Γ(α0 ) ∂αj
Γ(α0 ) − ∂αj
Γ(αj )
=
∂αj Γ(αj ) Γ(αj )2
 ∂ Γ(α ) ∂
Γ(αj ) 
Γ(α0 ) ∂αj 0 ∂αj
= −
Γ(αj ) Γ(α0 ) Γ(αj )
 
Γ(α0 ) ∂ ∂
= ln Γ(α0 ) − ln Γ(αj )
Γ(αj ) ∂α0 ∂αj
Γ(α0 )
= (ψ(α0 ) − ψ(αj ))
Γ(αj )
and therefore, that
 YK
∂ Γ(α0 )
µαk −1 = (ψ(α0 ) − ψ(αj ))Dir(µ|α)
∂αj Γ(α1 )Γ(α2 ) . . . Γ(αK ) k=1 k

36
Now, since
K K
∂ Y αk −1 αj−1 −1 αj+1 −1 ∂ αj −1 Y
µk α1 −1
= (µ1 αK −1
. . . µj−1 µj+1 . . . µK ) µj = ln µj µαk k −1
∂αj k=1 ∂αj k=1

it follows that  K 
Γ(α0 ) ∂ Y αk −1
µ = ln µj Dir(µ|α)
Γ(α1 )Γ(α2 ) . . . Γ(αK ) ∂αj k=1 k
By substituting into the initial expression,

Dir(µ|α) = Dir(µ|α)(ln µj + ψ(α0 ) − ψ(αj ))
∂αj

and then integrating with respect to µ, we obtain the desired result:

E[ln µj ] = ψ(αj ) − ψ(α0 ) (2.276)

37
Chapter 3

Linear Models for Regression

Note that the results (3.50∗) and (3.51∗) derived in Exercise 3.12 seem to be different than (3.50)
and (3.51) from the book. There doesn’t seem to be any mention of them in the errata comments,
but the results used in the web solution for Exercise 3.23 seems to be the ones we’ve got, and not
the ones from the book.

Exercise 3.1 ?
Show that the tanh function and the logistic sigmoid function (3.6) are related by

tanh(a) = 2σ(2a) − 1 (3.100)

Hence show that a general linear combination of logistic sigmoid functions of the form
M  
X x − µj
y(x, w) = w0 + wj σ (3.101)
j=1
s

is equivalent to a linear combination of tanh functions of the form


M  
X x − µj
y(x, u) = u0 + uj tanh (3.102)
j=1
s

and find expressions to relate the new parameters {u0 , . . . , uM } to the original parameters
{w0 , . . . , wM }.

Proof. The logistic sigmoid function is given by


1
σ(x) = (3.6)
1 + exp(−x)

and the tanh function is given by

ex − e−x e2x − 1
tanh(x) = =
ex + e−x e2x + 1

38
By starting from the right-hand side of (3.100) and then using the fact that tanh is odd, we obtain
2 1 − e−2a
2σ(2a) − 1 = −1= = − tanh(−a) = tanh(a) (3.100)
e−2a 1 + e−2a
Now, we can express the logistic sigmoid functions as
1 x 1
σ(x) = tanh +
2 2 2
By substituting this in (3.101), we have that
M  
M X wj x − µj
y(x, w) = w0 + + tanh = y(x, u)
2 j=1
2 2s

where
M 1
u0 = w 0 + uj = wj , j ≥ 1
2 2
Therefore, we proved that (3.101) is equivalent to (3.102).

Exercise 3.2 ??
Show that the matrix
Φ(ΦT Φ)−1 ΦT (3.103)
takes any vector v and projects it onto the space spanned by the columns of Φ. Use this result to
show that the least-squares solution (3.15) corresponds to an orthogonal projection of the vector
t onto the manifold S as shown in Figure 3.2.

Proof. Let p be the projection of v onto the space spanned by the columns of Φ. We then have
that p is contained by the space, so p can be written as a linear combination of the columns of Φ,
i.e. there exists x such that p = Φx. By using this and the fact that p − v is orthogonal to the
space, we have that

ΦT (p − v) = 0
ΦT (Φx − v) = 0
ΦT Φx = ΦT v
x = (ΦT Φ)−1 ΦT v

and since p = Φx, this proves our hypothesis, i.e.

p = Φ(ΦT Φ)−1 ΦT v

This translates directly to the least-squares geometry described in Section 3.1.3, where the
manifold S is the space spanned by the columns of Φ. From what we proved above, the projection
of t onto the manifold S is given by y = ΦwML , where

wML = (ΦT Φ)−1 ΦT t (3.15)

is the least-squares solution.

39
Exercise 3.3 ?
Consider a data set in which each data point tn is associated with a weighting factor rn > 0, so
that the sum of squares error function becomes
N
1X
ED (w) = rn {tn − wT φ(xn )}2 (3.104)
2 n=1

Find an expression for the solution w? that minimizes this error function. Give two alternative
interpretations of the weighted sum-of-squares error function in terms of (i) data dependent noise
variance and (ii) replicated data points.

Method 1.
Proof. Since the least-squares error function is convex, the function is minimized in its only critical
point. Similarly to (3.13), the derivative is given by:
N  
∂ 1X ∂ T 2
ED (w) = rn {tn − w φ(xn )}
∂w 2 n=1 ∂w
N
X
= rn {wT φ(xn ) − tn }φ(xn )T
n=1
N
X  N
X
T T
=w rn φ(xn )φ(xn ) − rn tn φ(xn )T
i=1 n=1

By defining the matrix R = diag(r1 , r2 , . . . , rn ) and then setting the derivative to 0, we obtain the
equality
wT ΦRΦT = tT RΦ
which gives the weighted least-squares solution (we get the column vector form):
w? = (ΦT RΦ)−1 ΦT Rt

Method 2.
√ √ √
Proof. Let R = diag(r1 , r2 , . . . , rn ) and R1/2 = diag( r1 , r2 , . . . , rn ) be diagonal matrices such
that R1/2 R1/2 = R. We notice that we can rewrite (3.104) as:
N
1X √ 2
ED (w) = rn {tn − wT φ(xn )}
2 n=1

which we can translate into matrix notation as:


1 T
ED (w) = R1/2 (t − Φw) R1/2 (t − Φw)

2
Since the least-squares error function is convex, the function is minimized in its only critical point.
The derivative is given by

ED (w) = −ΦT (R1/2 )T (R1/2 t − R1/2 Φw)
∂w
40
= ΦT RΦw − ΦT Rt

By setting it to 0, we obtain the solution that minimizes the weighted least-squares error function:

w? = (ΦT RΦ)−1 ΦT Rt

Exercise 3.4 ?
Consider a linear model of the form
D
X
y(x, w) = w0 + w i xi (3.105)
i=1

together with a sum-of-squares error function of the form


N
1X
ED (w) = {y(xn , w) − tn }2 (3.106)
2 n=1

Now suppose that Gaussian noise i with zero mean and variance σ 2 is added independently to each
of the input variables xi . By making use of E[i ] = 0 and E[i j ] = δij σ 2 , show that minimizing
ED averaged over the noise distribution is equivalent to minimizing the sum-of-squares error for
noise-free input variables with the addition of a weight-decay regularization term, in which the
bias parameter w0 is omitted from the regularizer.

Proof. Let the noise-free input variables be denoted by x∗ , such that xi = x∗i + i . (3.105) will
then be equivalent to
D
X D
X D
X
y(x, w) = w0 + wi x∗i + wi i = y(x∗ , w) + wi i
i=1 i=1 i=1

Now, we aim to find the expression of ED averaged over the noise distribution, that is:
N
1X
E[ED (w)] = {E[y(xn , w)2 ] − 2tn E[y(xn , w)] + t2n }
2 n=1

The individual expectations are straightforward to compute. Since E[i ] = 0, we have that
D
X

E[y(xn , w)] = E[y(x , w)] + wi E[i ] = y(x∗ , w)
i=1

Also, E[i j ] = δij σ 2 , so


 D
X D
X 2 
2 ∗ 2 ∗
E[y(xn , w) ] = E y(x , w) + 2y(x , w) wi i + wi i
i=1 i=1

41
D
X D X
X D
∗ 2
= y(x , w) + wi2 E[2i ] +2 wi wj E[i j ]
i=1 i=1 j=i+1
D
X
= y(x∗ , w)2 + σ wi2
n=1

Therefore, we have that


D D
1X Nσ X 2
E[ED (w)] = {y(x∗n , w) − tn }2 + w
2 n=1 2 n=1 i

which shows that ED averaged over the noise distribution is equivalent to the regularized least-
squares error function with λ = N σ. Hence, since the expressions are equivalent, minimizing them
is also equivalent, proving our hypothesis.

Exercise 3.5 ?
Using the technique of Lagrange multipliers, discussed in Appendix E, show that minimization of
the regularized error function (3.29) is equivalent to minimizing the unregularized sum-of-squares
error (3.12) subject to the constraint (3.30). Discuss the relationship between the parameters η
and λ.
Proof. To minimize the unregularized sum-of-squares error (3.12) subject to the constraint (3.30),
is equivalent to minimizing the Lagrangian
N  M 
1X 2
X
q
L(x, λ) = {y(xn , w) − tn } − λ η − |wj |
2 n=1 j=1

subject to the KKT conditions (see E.9, E.10, E.11 in Appendix E). Our Lagrangian and the regu-
larized sum-of-squares error have the same dependency over w, so their minimization is equivalent.
By following (E.11), we have that
 XM 
q
λ η− |wj | = 0
j=1

which means that if w? (λ) is the solution of minimization for a fixed λ > 0, we then have that
M
X
η= |w? (λ)j |q
j=1

Exercise 3.6 ?
Consider a linear basis function regression model for a multivariate target variable t having a
Gaussian distribution of the form
p(t|W, Σ) = N (t|y(x, W), Σ) (3.107)

42
where
y(x, W) = WT φ(x) (3.108)
together with a training data set compromising input basis vectors φ(xn ) and corresponding target
vectors tn , with n = 1, . . . , N . Show that the maximum likelihood solution WML for the parameter
matrix W has the property that each column is given by an expression of the form (3.15), which
was the solution for an isotropic noise distribution. Note that this is independent of the covariance
matrix Σ. Show that the maximum likelihood solution for Σ is given by
N
1 X T
 T
T
Σ= tn − WML φ(xn ) tn − WML φ(xn ) (3.109)
N n=1

Proof. Similarly to what we did in Section 3.1.5, we combine the set of target vectors into a matrix
T of size N × K such that the nth row is given by tTn . We do the same for X. The log likelihood
function is then given by
N
Y
ln p(T|X, W, Σ) = ln N (tn |WT φ(xn ), Σ)
n=1
N
X
= ln N (tn |WT φ(xn ), Σ)
n=1
N  
X 1 T
tn − W φ(xn ) Σ−1 (tn − WT φ(xn )
T
 
= ln exp
n=1
(2π)K/2 |Σ|1/2
N
NK 1 X T
tn − WT φ(xn ) Σ−1 (tn − WT φ(xn )

=− ln(2π) − ln |Σ| +
2 2 n=1

Our goal is to maximise this function with respect to W. One could prove that for a symmetric
matrix B,

(a − Ab)T B(a − Ab) = −2B(a − Ab)bT
∂A
Therefore, we take the derivative of the likelihood and use the fact that Σ−1 is symmetric to
obtain:
N
∂ X ∂ T
tn − WT φ(xn ) Σ−1 (tn − WT φ(xn )

ln p(T|X, W, Σ) =
∂W n=1
∂W
N
X
−1
tn − WT φ(xn ) φ(xn )T

= −2Σ
n=1

By setting the derivative equal to 0, we find the maximum likelihood solution for W:
N
X
−1 T
φ(x) φ(x)T = 0

−2Σ tn − WML
n=1
N
X N
X
−1 T −1 T
Σ tn φ(xn ) = Σ WML φ(xn )φ(xn )T
n=1 n=1

43
Σ−1 TT Φ = Σ−1 WML
T
ΦT Φ
ΦT TΣ−1 = ΦT ΦWML Σ−1

Note that Σ−1 cancels out and we finally get that:

WML = (ΦT Φ)−1 ΦT T

Now, let A, B be two matrices of size N × M and let b1 , b2 , . . . , bN be the column vectors of B.
One could easily prove that
 
AB = A b1 b2 . . . bN = Ab1 Ab2 . . . AbN

By using this for our case, that is to find the columns of WML , we’d find that they are of the form
(3.15), i.e. the nth column of WML is given by
(n)
WML = (ΦT Φ)−1 ΦT T(n)

where T(n) is the nth column of T.

Exercise 3.7 ?
By using the technique of completing the square, verify the result (3.49) for the posterior distri-
bution of the parameters w in the linear basis function model in which mN and SN are defined by
(3.50) and (3.51) respectively.

Proof. Since

p(w|t) ∝ p(w)p(t|X, w, β −1 )
N
Y
∝ N (w|m0 , S0 ) N (tn |wT φ(xn ), β −1 )
n=1

we have that
N
Y
ln p(w|t) = ln N (w|mN , SN ) + ln N (tn |wT φ(xn ), β −1 ) + const (3.7.1)
n=1

We compute the first logarithm, expand the square and keep only the terms that depend on w to
obtain:
1
ln N (w|m0 , S0 ) = − wT S−1 T −1
0 w + w S0 m0 + const
2
By doing the same for the second term, we’ll have that:
N
Y N
X
−1
ln T
N (tn |w φ(xn ), β )= ln N (tn |wT φ(xn ), β −1 )
n=1 n=1
N N
T
X βX T
= βw tn φ(xn ) − w φ(xn )φ(xn )T w + const
n=1
2 n=1

44
β T T
= βwT ΦT t − w Φ Φw + const
2
By replacing back into (3.7.1),
1 β T T
ln p(w|t) = − wT S−1 T −1 T T
0 w + w S0 m0 + βw Φ t − w Φ Φw + const
2 2
1
= − w(S−1 T T T −1 T
0 + βΦ Φ)w + w (S0 m0 + βΦ t) + const
2
The quadratic term corresponds to a Gaussian with the covariance matrix SN , where

S−1 −1 T
N = S0 + βΦ Φ (3.51)

Now, since the mean is found in the linear term, we’d have that
T −1
wT (S−1 T
0 m0 + βΦ t) = w SN mN

which gives
mN = SN (S−1 T
0 m0 + βΦ t) (3.50)
Since we proved both (3.50) and (3.51), we showed that

p(w|t) = N (w|mN , SN ) (3.49)

Exercise 3.8 ??
Consider the linear basis function model in Section 3.1, and suppose that we already have ob-
served N data points, so that the posterior distribution over w is given by (3.49). This posterior
can be regarded as the prior for the next observation. By considering an additional data point
(xN +1 , tN +1 ), and by completing the square in the exponential, show that the resulting posterior
distribution is again given by (3.49) but with SN replaced by SN +1 and mN replaced by mN +1 .

Proof. Our approach will be very similar to the previous exercise. The posterior distribution is
given by the proportionality relation

p(w|t) ∝ p(w)p(tN +1 |xN +1 , w, β −1 )


∝ N (w|mN , SN )N (tN +1 |wT φ(xN +1 ), β −1 )

, so
ln p(w|t) = ln N (w|mN , SN ) + ln N (tN +1 |wT φ(xN +1 ), β −1 ) + const (3.8.1)
We now compute the log likelihood and keep only the terms depending on w to obtain:
β
ln N (tN +1 |wT φ(xN +1 ), β −1 ) = − wT φ(xN +1 )φ(xN +1 )T w − βtN +1 wT φ(xN +1 ) + const
2
By expanding the square and then doing the same with the prior, we have that:
1
ln N (w|mN , SN ) = − wT S−1 T −1
N w + w SN mN + const
2
45
Substituting these results back into (3.8.1) yields:
1
ln p(w|t) = − wT S−1 T T −1
 
N − βφ(x N +1 )φ(xN +1 ) w + w S N mN − βt N +1 φ(xN +1 ) + const
2
which is equivalent to
ln p(w|t) = N (w|mN +1 , SN +1 )
for
S−1 −1
N +1 = SN + βφ(xN +1 )φ(xN +1 )
T
(3.8.2)
and
mN +1 = SN +1 S−1

N mN − βtN +1 φ(xN +1 )

Exercise 3.9 ??
Repeat the previous exercise but instead of completing the square by hand, make use of the general
result for linear-Gaussian models given by (2.116).

Proof. As shown in Section 2.3.3, given a marginal Gaussian distribution for x and a conditional
Gaussian distribution for y given x in the form
p(x) = N (x|µ, Λ−1 ) (2.113)

p(y|x) = N (y|Ax + b, L−1 ) (2.114)


the conditional distribution of x given y is given by
p(x|y) = N (x|Σ{AT L(y − b) + Λµ}, Σ) (2.116)
where
Σ = (Λ + AT LA)−1 (2.117)
Our goal is to match these results with our model. The prior is given by
p(w) = N (mN , SN )
and the likelihood is
p(tN +1 |xN +1 , w, β −1 ) = N (tN +1 |wT φ(xn ), β −1 )
By comparing those with (2.113) and (2.114), we’d have that the variables are related as follows:
x=w y = tN +1 µ = mN Λ−1 = SN A = φ(xN )T b=0 L−1 = β −1
Therefore, the covariance matrix Σ of the conditional (the SN +1 of our posterior) will be given by
substituting our variables into (2.117), so
S−1 −1
N +1 = SN + βφ(xN )φ(xN )
T

The mean can also be easily obtained from (2.116) as


mN +1 = SN +1 S−1

N mN − βtN +1 φ(xN +1 )

46
Exercise 3.10 ??
By making use of the result (2.115) to evaluate the integral in (3.57), verify that the predictive
distribution for the Bayesian linear regression model is given by (3.58) in which the input-dependent
variance is given by (3.59).

Proof. We’ve seen in Section 2.3.3 that given a marginal Gaussian distribution for x and a con-
ditional Gaussian distribution for y given x in the forms (2.113) and (2.114), we have that the
marginal distribution of y is given by
p(y) = N (y|Aµ + b, L−1 + AΛ−1 AT ) (2.115)
Therefore, if we consider the terms under the integral in (3.57), we have that
p(w|t, x, α, β) = N (w|mN , SN )
p(t|w, x, α, β) = N (t|wT φ(x), β −1 )
so the integral now becomes:
Z
p(t|x, t, α, β) = p(t|x, w, β)p(w|t, x, α, β) dw (3.57)
Z
= N (w|mN , SN )N (t|wT φ(x), β −1 ) dw

Our goal is to find the parameters of this distribution. Since the integral involves the convolution
of two Gaussians, by following the notation used in (2.113), (2.114), and (2.115), we’d have that
µ = mN SN = Λ−1 A = φ(x)T b = 0 L−1 = β −1
Finally, by substituting our values into (2.115), it is straightforward to see that the predictive
distribution for the Bayesian linear regression model is given by
p(t|t, x, α, β) = N t|φ(x)T mN , σN
2

(x) (3.58)
where the input-dependent variance is given by
2 1
σN (x) = + φ(x)SN φ(x)T (3.59)
β

Exercise 3.11 ??
We have seen that, as the size of a data set increases, the uncertainty associated with the posterior
distribution over model parameters decreases. Make use of the matrix identity (Appendix C)
(M−1 v) vT M−1

T −1 −1

M + vv =M − (3.110)
1 + vT M−1 v
2
to show that the uncertainty σN +1 (x) associated with the linear regression function given by (3.59)
satisfies
2 2
σN +1 (x) ≤ σN (x) (3.111)

47
Proof. By using (3.59) and then (3.8.2) we have that:
 −1
2 1 T 1 T −1 T
σN +1 (x) = + φ(x) SN +1 φ(x) = φ(x) SN + βφ(xN )φ(xN ) φ(x)
β β
We apply (3.110) with M = S−1
N and v = β
1/2
φ(x) and get that
βSN φ(xN )φ(xN )T SN
 
2 1 T
σN +1 (x) = + φ(x) SN − φ(x)
β 1 + βφ(xN )T SN φ(xN )
1 SN φ(xN )φ(xN )T SN
= + φ(x)T SN φ(x) − φ(x)T 1 φ(x)
β β
+ φ(xN )T SN φ(xN )
2 SN φ(xN )φ(xN )T SN
= σN (x) − φ(x)T 1 φ(x)
β
+ φ(xN )T SN φ(xN )

Therefore,
2 2 T SN φ(xN )φ(xN )T SN
σN (x) − σN +1 (x) = φ(x) 1 φ(x) (3.11.1)
β
+ φ(xN )T SN φ(xN )
Since SN is a precision matrix, it is symmetric, so:
T
SN φ(xN )φ(xN )T SN = φ(xN )T SN φ(xN )T SN = ||SN φ(xN )||2 ≥ 0
Even more, because SN is a precision matrix, it is positive semidefinite. By using this and the fact
that the noise precision constant β is positive, we have that:
1
+ φ(xN )T SN φ(xN ) ≥ 0
β
Hence, we finally have that
SN φ(xN )φ(xN )T SN SN φ(xN )φ(xN )T SN
φ(x)T 1 T
φ(x) = 1 T
||φ(x)||2 ≥ 0
β
+ φ(xN ) SN φ(xN ) β
+ φ(xN ) SN φ(xN )

which, by (3.11.1), becomes equivalent to (3.111).

Exercise 3.12 ??
We saw in Section 2.3.6 that the conjugate prior for a Gaussian distribution with unknown mean
and unknown precision (inverse variance) is a normal-gamma distributuion. This property also
holds for the case of the conditional Gaussian distribution p(t|x, w, β) of the linear regression
model. If we consider the likelihood function (3.10), then the conjugate prior for w and β is given
by
p(w, β) = N (w|m0 , β −1 S0 )Gam(β|a0 , b0 ) (3.112)
Show that the corresponding posterior distribution takes the same functional form, so that
p(w, β|t) = N (w|mN , β −1 SN )Gam(β|aN , bN ) (3.113)
and find expressions for the posterior parameters mN , SN , aN , and bN .

48
Proof. We have that

p(w, β|t) ∝ p(w, β)p(t|w, β)


N
Y
−1
∝ N (w|m0 , β S0 )Gam(β|a0 , b0 ) N (tn |wT φ(xn ), β −1 )
n=1

so
N
Y
ln p(w, β|t) = ln N (w|m0 , β −1 S0 ) + ln Gam(β|a0 , b0 ) + ln N (tn |wT φ(xn ), β −1 ) + const
n=1

We decompose each logarithm, this time also keeping each term depending on β. The log likelihood
is derived like in Exercise 3.7, that is:
N
Y β β N
ln N (tn |wT φ(xn ), β −1 ) = − wT ΦT Φw + βwT ΦT t − tT t + ln β
n=1
2 2 2

The logarithms of factors in the prior are given by:


β β T −1
ln N (w|m0 , β −1 S0 ) = − wT S−1 T −1
0 w + βw S0 m0 − m S m0
2 2 0 0
ln Gam(β|a0 , b0 ) = − ln Γ(a0 ) + a0 ln b0 + a0 ln β − ln β − b0 β
Now, the log of the posterior is given by:
β β T β
ln p(w, β|t) = − wT (S−1 T T −1
0 + Φ Φ)w + βw (S0 m0 + Φ t) −
T
t t − mT0 S−1
0 m0
2 2 2
N
+ ln β + (a0 − 1) ln β − b0 β + const
2
The covariance matrix of the posterior is easily found from the quadratic term, that is:

S−1 −1 T
N = S0 + Φ Φ (3.51*)

The mean is obtained from the linear term by using the fact that

wT S−1 T −1 T
N mN = w (S0 m0 + Φ t)

so
mN = SN (S−1 T
0 m0 + Φ t) (3.50*)
From the constant terms with respect to w we’ll obtain the parameters of the Gamma distribution.
bN is obtained by using the linear terms containing β. Since we already know the covariance and
the mean, we can deduce the linear terms of the posterior distribution, so we’ll have that:
β T −1 β β
−βbN − mN SN mN = − tT t − mT0 S−1
0 m0 − βb0
2 2 2
which gives
1 T T −1
t t + mT0 S−1

bN = b0 + 0 m0 − m S
N N m N (3.12.1)
2
49
Finally, aN is given by the terms containing ln β. By knowing the ln β terms that will be used in
the expansion of the log posterior, we have that
N
(aN − 1) ln β = ln β + (a0 − 1) ln β
2
Hence, it is straightforward to obtain the result
N
aN = a0 + (2.150)
2

Exercise 3.13 ??
Show that the predictive distribution p(t|x, t) for the model discussed in Exercise 3.12 is given by
a Student’s t-distribution of the form

p(t|x, t) = St(t|µ, λ, ν) (3.114)

and obtain expressions for µ, λ, ν.

Proof. The Student’s t-distribution is given by


 1/2  −ν/2−1/2
Γ(ν/2 + 1/2) λ λ(x − µ)2
St(x|µ, λ, ν) = 1+ (2.159)
Γ(ν/2) πν ν
However, our goal is to obtain it in the form
Z ∞
N x|µ, (ηλ)−1 Gam(η|ν/2, ν/2) dη

St(x|µ, λ, ν) = (2.158)
0

which for ν = 2a and λ = a/b is equivalent to (2.159).


We have that the predictive distribution is given by
ZZ
p(t|x, t) = p(t|x, w, β)p(w, β|x, t) dw dβ

The factors under the integral are already known from (3.8) and (3.113), so
ZZ
p(t|x, t) = N (t|wT φ(x), β −1 )N (w|mN , β −1 SN )Gam(β|aN , bN ) dw dβ
Z Z 
T −1 −1
= N (t|w φ(x), β )N (w|mN , β SN ) dw Gam(β|aN , bN ) dβ

The integral with respect to w is actually (3.57), so we know that it’s equal to (3.58). Knowing
this, we have that
Z
p(t|x, t) = N t|φ(x)T mN , β −1 1 + φ(x)T SN φ(x) Gam(β|aN , bN ) dβ
 

50
Exercise 3.14 ??
In this exercise, we explore in more detail the properties of the equivalent kernel defined by (3.62),
where SN is defined by (3.54). Suppose that the basis functions φj (x) are linearly independent and
that the number N of data points is greater than the number M of basis functions. Furthermore,
let one of the basis functions be constant, say φ0 (x) = 1. By taking suitable linear combinations
of these basis functions, we can construct a new basis set ψj (x) spanning the same space but
orthonormal, so that
XN
ψj (xn )ψk (xn ) = Ijk (3.115)
n=1

where Ijk is defined to be 1 if j = k and 0 otherwise, and we take ψ0 (x) = 1. Show that for α = 0,
the equivalent kernel can be written as k(x, x0 ) = ψ(x)T ψ(x0 ) where ψ = (ψ0 , . . . , ψM −1 )T . Use
this result to show that the kernel satisfies the summation constraint
N
X
k(x, xn ) = 1 (3.116)
n=1

Proof. The equivalent kernel is defined by


k(x, x0 ) = βφ(x)T SN φ(x0 ) (3.62)
where
S−1 T
N = αI + βΦ Φ (3.54)
We’ll use the newly defined basis set and construct the coresponding design matrix, whose elements
are given by Ψnj = ψj (xn ), so that
 
ψ0 (x1 ) ψ1 (x1 ) · · · ψM −1 (x1 )
 ψ0 (x2 ) ψ1 (x2 ) · · · ψM −1 (x2 ) 
Ψ =  ..
 
.. . . .. 
 . . . . 
ψ0 (xN ) ψ1 (xN ) · · · ψM −1 (xN )
Since the basis set is orthonormal, we have that
N
X
ΨT Ψ = ψ(xn )ψ(xn )T = I
n=1

Now, for α = 0, SN becomes


−1 1
SN = βΨT Ψ =
β
Therefore, by following (3.62) the equivalent kernel can be written as
k(x, x0 ) = βψ(x)T SN ψ(x0 ) = ψ(x)T ψ(x0 )
Finally, the summation constraint (3.116) obviously holds, since from ψ0 (x) = 1, we have that
N
X N
X N M
X X −1 M
X −1 N
X
T
k(x, xn ) = ψ(x) ψ(xn ) = ψj (x)ψj (xn ) = ψj (x) ψj (xn )ψ0 (xn )
n=1 n=1 n=1 j=0 j=0 n=1

51
M
X −1
= ψj (x)Ij+1,1 = 1
j=0

Exercise 3.15 ?
Consider a linear basis function model for regression in which the parameters α and β are set using
the evidence framework. Show that the function E(mN ) defined by (3.82) satisfies the relation
2E(mN ) = N .

Proof. Our function is given by


β α
E(mN ) = ||t − ΦmN ||2 + mTN mN (3.82)
2 2
We will be using the quantity γ defined in Section 3.5.2 to derive our result. From (3.92) we get
that
γ
mTN mN =
α
and (3.95) gives
N −γ
||t − ΦmN ||2 =
β
Therefore,
2E(mN ) = β||t − ΦmN ||2 + αmTN mN = N − γ + γ = N

Exercise 3.16 ??
Derive the result (3.86) for the log evidence function p(t|α, β) of the linear regression model by
making use of (2.115) to evaluate the integral (3.77) directly.

Proof. The marginal likelihood function is given by the integral


Z
p(t|α, β) = p(t|w, β)p(w|α) dw (3.77)

The first factor under the integral is the likelihood (3.10), while the second factor is given by (3.52).
Therefore, the evidence function becomes
Z Y N
p(t|α, β) = p(tn |wφ(xN ), β −1 )N (w|0, α−1 I) dw
n=1

Our aim is to find a proportional Gaussian form for the likelihood term and then use (2.115) to
evaluate the integral directly. We’ve seen in Exercise 3.12 that
N
Y β β
ln N (tn |wT φ(xn ), β −1 ) = − wT ΦT Φw + βwT ΦT t − tT t + const
n=1
2 2

52
This can be rewritten as a quadratic form which corresponds to a Gaussian:
N
Y β β β β
ln N (tn |wT φ(xn ), β −1 ) = − wT ΦT Φw + wT ΦT t + tT Φw − tT t + const
n=1
2 2 2 2
β
= − ||t − Φw||2 + const
2
1
= − (t − Φw)T (βI)(t − Φw) + const
2
= ln N (t|Φw, β −1 I) + const

Therefore,
N
Y
p(t|w, β) = N (tn |wT φ(xn ), β −1 ) ∝ N (t|Φw, β −1 I)
n=1

so the evidence function is now given by


Z
p(t|α, β) ∝ N (t|Φw, β −1 I)N (w|0, α−1 I) dw

Since the integral involves the convolution of two Gaussians, by following the notation used in
(2.113) and (2.114), we’d have that

x=w y=t µ=0 Λ−1 = αI A=Φ b=0 L−1 = βI

Applying (2.115) yields the Gaussian form of the evidence function:

p(t|α, β) ∝ N (t|0, β −1 I + α−1 ΦΦT )

By applying the Woodbury identity (C.7) with

A = β −1 I B=Φ C = α−1 I D = ΦT

the precision matrix of this Gaussian will be given by

(β −1 I + α−1 ΦΦT )−1 = βI − β 2 Φ(αI + βΦT Φ)−1 ΦT


= βI − β 2 ΦA−1 ΦT

where A is given by (3.81). Hence, by using (3.81) and (3.84), we obtain that the quadratic term
in the exponential of the Gaussian has the form:
1 1
− tT (β −1 I + α−1 ΦΦT )−1 t = − tT (βI − β 2 ΦA−1 ΦT )−1 t
2 2
β β
= − tT t + tT ΦmN
2 2
β T β T
= − t t + mN AmN
2 2
β T β
= − t t + βmTN AmN − mTN AmN
2 2
β T α β
= − t t + βmTN AmN − mTN mN − mN ΦT ΦmN
2 2 2

53
As seen in Exercise 3.18, this is actually equal to −E(mN ), so
1
− tT (β −1 I + α−1 ΦΦT )−1 t = −E(mN )
2
Now, since Φ is a N × M matrix, we apply (C.14) and have that:

β
|β −1 IN + α−1 ΦΦT | = β −N IN + ΦΦT
α
β
= β −N IM + ΦT Φ
α
−M −N
= α β |αIM + βΦT Φ|
= α−M β −N |A|

Threfore, we finally expand the Gaussian form of the evidence function to obtain:

p(t|α, β) ∝ N (t|0, β −1 I + α−1 ΦΦT )


 
1 1 1 T −1 −1 T −1
∝ exp − t (β I + α ΦΦ ) t
(2π)N/2 |β −1 IN + α−1 ΦΦT |1/2 2
1
∝ α−M/2 β −N/2 |A|−1/2 exp{−E(mN )}
(2π)N/2

Hence, we easily derive the log marginal likelihood as


N M N 1
ln p(t|α, β) = − ln(2π) + ln α + ln β − ln |A| − E(mN ) + const (3.86)
2 2 2 2

Exercise 3.17 ?
Show that the evidence function for the Bayesian linear regression model can be written in the
form (3.78) in which E(w) is defined by (3.79).

Proof. The log likelihood is given by


N N
ln p(t|w, β) = ln β − ln(2π) − βED (w) (3.11)
2 2
By applying the exponential function on both sides of the expression, we obtain that
 N/2
β
p(t|w, β) = exp{−βED (w)}

We continue by expanding the Gaussian

p(w|α) = N (w|0, α−1 I) (3.52)

54
to get that
 M/2    M/2
α α α
p(w|α) = exp − wT w = exp{−αEW (w)}
2π 2 2π

Therefore, by replacing into (3.77), we obtain


Z  N/2  M/2 Z
β α
p(t|α, β) = p(t|w, β)p(w|α) dw = exp{−αEW (w) − βED (w)} dw
2π 2π
 N/2  M/2 Z
β α
= exp{−E(w)} dw (3.78)
2π 2π

where
β α
E(w) = βED (w) + αEW (w) = ||t − Φw||2 + wT w (3.79)
2 2

Exercise 3.18 ??
By completing the square over w, show that the error function (3.79) in Bayesian linear regression
can be written in the form (3.80).

Proof. Our first step is expanding E(w):

β α
E(w) = ||t − Φw||2 + wT w (3.79)
2 2
β T β T β β α
= t t − t Φw − wT ΦT t + wT ΦT Φw + wT w
2 2 2 2 2
We continue by doing the same for E(mN ) and obtain that:

β α
E(mN ) = ||t − ΦmN ||2 + mTN mN
2 2
β T β T β β α
= t t − t ΦmN − mTN ΦT t + mTN ΦT ΦmN + mTN mN
2 2 2 2 2
A is a Hessian matrix, so it’s symmetric. By using this and the expressions

A = αI + βΦT Φ (3.81)

mN = βA−1 ΦT t (3.84)
We notice that the negative terms in the expansions can be written as
β β 1 1
− tT Φw − wT Φt = − mTN Aw − wT AmN = −mTN Aw
2 2 2 2
β β 1 1
− tT ΦmN − mTN Φt = − mTN AmN − mTN AmN = −mTN AmN
2 2 2 2

55
Hence,
β T β α
E(w) = t t − mTN Aw + wT ΦT Φw + wT w
2 2 2
β β α
E(mN ) = tT t − mTN AmN + mTN ΦT ΦmN + mTN mN
2 2 2
By taking the difference of the error functions and then repeatedly making use of (3.81), we reach
a point when we can complete the square:
β T T β α α
E(w) − E(mN ) = −mTN Aw + mTN AmN + w Φ Φw − mTN ΦT ΦmN + wT w − mTN mN
2 2 2 2
1
= −mTN Aw + mTN AmN + wT αI + βΦT Φ w − mTN αI + βΦT Φ mN
 
2
1 1
= −mTN Aw + mTN AmN + wT Aw − mTN AmN
2 2
1 T 1 T 1 T 1
= w Aw − mN Aw − w AmN + mTN AmN
2 2 2 2
1
= (w − mN )T A(w − mN )
2
which directly proves that (3.79) can be written as
1
E(w) = E(mN ) + (w − mN )T A(w − mN ) (3.80)
2

Exercise 3.19 ??
Show that the integration over w in the Bayesian linear regression model gives the result (3.85).
Hence show that the log marginal likelihood is given by (3.86).

Proof. We start by rewriting E(w) like in (3.80) and obtain that


Z Z  
1 T
exp{−E(w)} dw = exp − E(mN ) − (w − mN ) A(w − mN ) dw
2
Z  
1 T
= exp{−E(mN )} exp − (w − mN ) A(w − mN ) dw
2
Z  
1 T
= exp{−E(mN )} exp − (w − mN ) A(w − mN ) dw
2

The integral is easily solved by noticing that the quadratic term under the exponential term
corresponds to a Gaussian of the form N (w|mN , A−1 ). Because the Gaussian distribution is
normalized, we then have that
Z
exp{−E(w)} dw
Z  
M/2 −1/2 1 1 1 T
= exp{−E(mN )}(2π) |A| exp (w − mN ) A(w − mN ) dw
(2π)M/2 |A|−1/2 2

56
Z
−1/2
= exp{−E(mN )}(2π) M/2
|A| N (w|mN , A−1 ) dw

= exp{−E(mN )}(2π)M/2 |A|−1/2 (3.85)

By substituting this into (3.78), the evidence function becomes


 N/2
β
p(t|α, β) = αM/2 |A|−1/2 exp{−E(mN )}

Hence, the log marginal likelihood is given by


M N 1 N
ln p(t|α, β) = ln α + ln β − E(mN ) − ln |A| − ln(2π) (3.86)
2 2 2 2

Exercise 3.20 ??
Verify all of the steps needed to show that maximization of the log marginal likelihood function
(3.86) with respect to α leads to the re-estimation equation (3.92).

Proof. The steps taken in the maximization of the (3.86) are quite straightforward, so this proof
will be very similar to what’s in the book. By defining the eigenvector equation

(βΦT Φ)ui = λi ui (3.87)

we’d have that


Aui = (αI + βΦT Φ)ui = αui + (βΦT Φ)ui = (α + λi )ui
which shows that the eigenvalues of A are α + λi , where A is given by (3.81). Now, since the
determinant of a matrix is the product of its eigenvalues, we have that
d d Y d X X 1
ln |A| = ln (λi + α) = ln(λi + α) = (3.88)
dα dα i
dα i i
λi + α

The derivative of (3.86) with respect to α is given by

∂ M 1 1X 1
ln p(t|α, β) = − mTN mN −
∂α 2α 2 2 i λi + α

so the stationary points will satisfy


M 1 1X 1
0= − mTN mN − (3.89)
2α 2 2 i λi + α

Multiplying by 2α and then rearranging, we have that


X 1 X α
 X
λi
T
αm mN = M − α = 1− = =γ
i
λi + α i
λi + α i
α + λi

57
Therefore, the value of α that maximizes the marginal likelihood is given by
γ
α= (3.92)
mTN mN
where γ is defined by
X λi
γ= (3.91)
i
α + λi

Exercise 3.21 ??
An alternative way to derive the result (3.92) for the optimal value of α in the evidence framework
is to make use of the identity (note that we changed the variables A and α for D and δ so that
there is no confusion with the variables used in our framework)
 
d −1 d
ln |D| = Tr D D (3.117)
dδ dδ
Prove this identity by considering the eigenvalue expansion of a real symmetric matrix D, and
making use of the standard results for the determinant and trace of D expressed in terms of its
eigenvalues (Appendix C). Then make use of (3.117) to derive (3.92) starting from (3.86).

Proof. Since D is symmetric, we can consider the N eigenvector equations

Dui = ωi ui

where the eigenvectors ui were chosen to be orthonormal, as seen in Appendix C. By using (C.47)
to rewrite the determinant, the left-hand side of (3.117) becomes
N N N N
d d Y d X X d X 1 d
ln |D| = ln ωi = ln ωi = ln ωi = ωi
dδ dδ i=1 dδ i=1 i=1
dδ i=1
ωi dδ

Now, by considering the eigenvalue expansions given by (C.45) and (C.46),


N N
X
−1
X 1
D= ωi ui uTi D = ui uTi
i=1 i=1
ω i

we can rewrite the term inside the trace operator as:


N
X  N 
−1 d 1 d X
D D= ui uTi T
ωi ui ui
dδ i=1
ωi dδ j=1
N
X N
X
1 d
ui uTi ωj uj uTj

=
i=1
ωi j=1

XN XN      
1 T d T d T d T
= ui ui ωj uj uj + ωj uj uj + ωi ui u
i=1
ωi j=1
dδ dδ dδ j

58
N X N   N X N    
X 1 d T T
X ωj T d T d T
= ωj ui ui uj uj + ui ui uj uj + uj uj
i=1 j=1
ωi dδ i=1 j=1
ωi dδ dδ

We’ll analyze the two sum terms separately. Since the eigenvectors were chosen to be orthonormal,
we have that
uTi uj = Iij (C.33)
Therefore, we’ll only have to keep the terms for which i = j in the first sum, so
N X N   N X N   N  
X 1 d T T
X 1 d T
X 1 d
ωj ui ui uj uj = ωj ui Iij uj = ωj ui ui T
i=1 j=1
ωi dδ i=1 j=1
ωi dδ i=1
ω i dδ

Notice that the trace of this term is actually the left-hand side of (3.117), so we’ll continue by
proving that the second sum term has a trace of 0. The second sum term can then be expanded
as
N X N     X N X N  
X ωj T d T d T 2ωj T d T
ui ui uj uj + uj uj = ui ui uj uj
i=1 j=1
ω i dδ dδ i=1 j=1
ω i dδ
N X N  
X 2ωj d T
= ui Iij u
i=1 j=1
ωi dδ j
N   X N    
X d T d T d T
= 2ui ui = ui ui + ui ui
i=1
dδ i=1
dδ dδ
N N
X d d X d
ui uTi = ui uTi = I = 0N

=
i=1
dδ dδ i=1 dδ

Finally, we have that


  X N X N  
−1 d 1 d
Tr D D = Tr ωj ui uTi uj uTj +
dδ ω dδ
i=1 j=1 i
N X N    
X ωj T d T d T
+ ui ui uj uj + uj uj
i=1 j=1
ω i dδ dδ
X N X N   
1 d T T
= Tr ωj ui ui uj uj +
ω dδ
i=1 j=1 i
X N X N    
ωj T d T d T
+ Tr ui ui uj uj + uj uj
i=1 j=1
ωi dδ dδ
XN   
1 d T
= Tr ωi ui ui + Tr(0N )
i=1
ωi dδ
N  
X 1 d
= ωi
i=1
ωi dδ
d
= ln |D| (3.117)

59
which proves the needed identity. We can derive (3.92) by following exactly the same steps as in
Exercise 3.20 or in the book, except that now we compute (3.88) by using the previously proven
identity and (C.47). Note that A is given by (3.81) and it has the eigenvalues λi + α, so
   
d −1 d −1 d
X 1
αI + βΦ Φ = Tr(A−1 ) =
T

ln |A| = Tr A A = Tr A
dα dα dα i
λi + α

Exercise 3.22 ??
Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal
likelihood function (3.86) with respect to β leads to the re-estimation equation (3.95).

Proof. As in Exercise 3.20, one could prove that A has the eigenvalues λi + α. Also, from (3.87)
the eigenvalues are proportional to β, so dλi /dβ = λi /β, so
d d X 1 X λi γ
ln |A| = ln(λi + α) = = (3.93)
dβ dβ i β i λi + α β

where γ is given by (3.91). Therefore, the derivative of (3.87) with respect to β is given by
N
∂ N −γ 1X
p(t|α, β) = − {tn − mTN φ(xn )}2
∂β 2β 2 n=1

and the critical points satisfy


N
N −γ 1X
= {tn − mTN φ(xn )}2
2β 2 n=1

which after multiplying both sides by 2/(N − γ) gives us the re-estimation equation
N
1 1 X
= {tn − mTN φ(xn )}2 (3.95)
β N − γ n=1

Exercise 3.23 ??
Show that the marginal probability of the data, in other words the model evidence, for the model
described in Exercise 3.12 is given by

1 ba00 Γ(aN ) |SN |1/2


p(t) = (3.118)
(2π)N/2 baNN Γ(a0 ) |S0 |1/2
by first marginalizing with respect to w and then with respect to β.

60
Proof. By marginalizing with respect to β and then with respect to β, the model evidence will be
given by ZZ
p(t) = p(w, β)p(t|w, β) dw dβ

The first factor unde the integral is the prior given by (3.112), while the second factor is the
likelihood (3.11). We proved in Exercise 3.16 that the likelihood is proportional to N (t|Φw, β −1 I),
so the marginal probability becomes
ZZ
p(t) = N (w|m0 , β −1 S0 )Gam(β|a0 , b0 )N (t|Φw, β −1 I) dw dβ
ZZ
= Gam(β|a0 , b0 )N (w|m0 , β −1 S0 )N (t|Φw, β −1 I) dw dβ

By expanding the three distributions, we have that

ba00
ZZ
1 1
p(t) = N +M 1/2
β a0 −1+N/2+M/2 exp{βb0 }
(2π) 2 Γ(a0 ) |S0 |
   
β 2 β T −1
exp − ||t − Φw|| exp − (w − m0 ) S0 (w − m0 ) dw dβ
2 2

Let’s expand the term under the w integral, and then use (3.50) and (3.51) to complete the square:
   
β 2 β T −1
exp − ||t − Φw|| exp − (w − m0 ) S0 (w − m0 )
2 2
 
β 2 β T −1
= exp − ||t − Φw|| − (w − m0 ) S0 (w − m0 )
2 2
 
β T T T T T T −1 T −1 T −1

= exp − t t − 2w Φ t + w Φ Φw + w S0 w − 2w S0 m0 + m0 S0 m0
2
 
β  T −1 T
 T −1
 T
 T T −1

= exp − w S0 + Φ Φ w − 2w S0 m0 + Φ t + t t + m0 S0 m0
2
 
β T −1 T −1 T T −1

= exp − w SN w − 2w SN mN + t t + m0 S0 m0
2
 
β T −1 T −1 T T −1

= exp − (w − mN ) SN (w − mN ) − mN SN mN + t t + m0 S0 m0
2

We can rewrite (3.12.1) as


1 1 1 T −1
b0 = bN − tT t − mT0 S−1
0 m0 + mN SN mN
2 2 2
Therefore,
   
β 2 β T −1
exp{βb0 } exp − ||t − Φw|| exp − (w − m0 ) S0 (w − m0 )
2 2
 
β T −1
= exp{βbN } exp − (w − mN ) SN (w − mN )
2

61
and since both the Gaussian and Gamma distributions are normalized, the marginal probability
finally becomes what we wanted:

1 ba00 1
p(t) = N +M
(2π) 2 Γ(a0 ) |S0 |1/2
Z Z  
a0 −1+N/2+M/2 β T −1
β exp{βbN } exp − (w − mN ) SN (w − mN ) dw dβ
2
a0
1 b0 1
= N +M
(2π) 2 Γ(a0 ) |S0 |1/2
Z Z  M/2
a0 −1+N/2+M/2 2π
β |SN |1/2 exp{βbN }N (w|mN , β −1 SN ) dw dβ
β
ba00 |SN |1/2
Z Z
1 a0 −1+N/2
= N β exp{βbN } N (w|mN , β −1 SN ) dw dβ
(2π) 2 Γ(a0 ) |S0 |1/2
ba00 |SN |1/2
Z
1
= N β a0 −1+N/2 exp{βbN } dβ
(2π) 2 Γ(a0 ) |S0 |1/2
ba00 |SN |1/2
Z
1 Γ(aN )
= Gam(β|aN , bN ) dβ
N
(2π) 2 Γ(a0 ) |S0 | 1/2 baNN
1 ba00 Γ(aN ) |SN |1/2
= (3.118)
(2π)N/2 baNN Γ(a0 ) |S0 |1/2

where aN and bN were derived in Exercise 3.12 and are given by (2.150), respectively (3.12.1).

Exercise 3.24 ??
Repeat the previous exercise but now use Bayes’ theorem in the form

p(t|w, β)p(w, β)
p(t) = (3.119)
p(w, β|t)

and then substitute for the prior and posterior distributions and the likelihood function in order
to derive the result (3.118).

Proof. We start by substituting the prior (3.112), the posterior (3.113) and the likelihood (3.10).
The evidence function becomes
N (t|Φw, β −1 I)N (w|m0 , β −1 S0 )Gam(β|a0 , b0 )
p(t) =
N (w|mN , β −1 SN )Gam(β|aN , bN )

Let’s expand the numerator:


 M/2  N/2 a0 a0 −1
−1 −1 β β b0 β
N (t|Φw, β I)N (w|m0 , β S0 )Gam(β|a0 , b0 ) = exp{βb0 }
2π 2π Γ(a0 )|S0 |1/2
   
β β T −1
exp − ||t − Φw||2 exp − (w − m0 ) S0 (w − m0 )
2 2

62
We’ve seen in the previous exercise that
   
β 2 β T −1
exp{βb0 } exp − ||t − Φw|| exp − (w − m0 ) S0 (w − m0 )
2 2
 
β T −1
= exp{βbN } exp − (w − mN ) SN (w − mN )
2

so the numerator can be written as

N (t|Φw,β −1 I)N (w|m0 , β −1 S0 )Gam(β|a0 , b0 ) =


 M/2  N/2 a0 a0 −1  
β β b0 β β T −1
= exp{βbN } exp − (w − mN ) SN (w − mN )
2π 2π Γ(a0 )|S0 |1/2 2

Since we already have the exponential terms of the Gamma and Gaussian distributions, we can
obtain the distributions by dividing by their normalization constants, so:

N (t|Φw,β −1 I)N (w|m0 , β −1 S0 )Gam(β|a0 , b0 ) =


 M/2  N/2 a0 a0 −1
β β b0 β
=
2π 2π Γ(a0 )|S0 |1/2
  M/2 
Γ(aN ) 2π 1/2 −1
Gam(β|aN , bN ) |SN | N (w|mN , β SN )
baNN β aN −1 β
1 ba00 Γ(aN ) |SN |1/2
= N/2 aN 1/2
Gam(β|aN , bN )N (w|mN , β −1 SN )
(2π) bN Γ(a0 ) |S0 |

Finally, we substitute the numerator back into the evidence function and since the distribution
forms factor out, we prove our hypothesis, that:

1 ba00 Γ(aN ) |SN |1/2


p(t) = (3.118)
(2π)N/2 baNN Γ(a0 ) |S0 |1/2

63
Chapter 4

Linear Models for Classification

Exercise 4.1 ??
Given a set of data points {xn }, we can define the convex hull to be the data set of all points x
given by X
x= αn xn (4.156)
n
P
where αn ≥ 0 and n αn = 1. Consider a second set of points {yn } together with their corre-
sponding convex hull. By definition, the two set of points will be linearly separable if there exists
a vector w
b and a scalar w0 such that w b T xn + w0 > 0 for all xn , and w
b T yn + w0 < 0 for all yn .
Show that if their convex hulls intersect, the two sets of points cannot be linearly separable, and
conversely that if they are linearly separable, their covex hulls do not intersect.

Proof. The vertices of the convex hulls are the data points {xn } and {yn }. Therefore, the edges
of the hulls will be represented by some segments beteween the data points. As a result, any point
situated on the boundary of the hull can be written as a convex combination of the end-points of
the segment it’s contained by. Also, one can easily see that if two hulls intersect, they intersect in
at least one point that is contained by the boundaries of both hulls.

1st Hypothesis: If the hulls intersect, the two sets of points are not linearly separable

Assume that the two hulls intersect in the point z situated on both hulls boundaries. From
what we’ve seen above, the point z can be expressed as a convex combination between two data
points of each set of data points. Therefore, there exist xA , xB from {xn }, yA , yB from {yn } and
λx , λy ∈ [0, 1] such that we can express z as

λx xA + (1 − λx )xB = λy yA + (1 − λy )yB

Suppose that the sets {xn } and {yn } are linearly separable. Then there exists a discriminant
function
θ(a) = wb T a + w0
such that θ(xn ) > 0 for all xn and θ(yn ) < 0 for all yn . From the linearity of the discriminant
function, and rewritting θ(z) using the convex combinations forms, we have that

λx θ(xA ) + (1 − λx )θ(xB ) = λy θ(yA ) + (1 − λy )θ(yB )

64
Since θ(xA ), θ(xB ) > 0 and θ(yA ), θ(yB ) < 0, this expression is obviously false, since the left-hand
side of the equality is positive and the right-hand one is negative. Therefore, our supposition that
the data sets are linearly separable is false and our main hypothesis is true.

2nd Hypothesis: If the two sets of points are linearly separable, then the hulls don’t intersect.

This hypothesis is the counterpositive of the 1st hypothesis. Therefore, it’s valid too.

Exercise 4.2 ?? TODO


Consider the minimization of a sum-of-squares error function (4.15), and suppose that all of the
target vectors in the training set satisfy a linear constraint
aT tn + b = 0 (4.157)
where tn corresponds to the nth row of the matrix T in (4.15). Show that as a consequence of this
constraint, the elements of the model prediction y(x) given by the least-squares solution (4.17)
also satisfy this constraint, so that
aT y(x) + b = 0 (4.158)
To do so, assume that one of the basis functions φ0 (x) = 1, so that the corresponding parameter
w0 plays the role of a bias.

Exercise 4.4 ?
Show that maximization of the class separation criterion given by (3.24) with respect to w, using
a Lagrange multiplier to enforce the constraint wT w = 1, leads to the result that w ∝ (m2 − m1 ).

Proof. Our goal is to maximize


m2 − m1 = wT (m2 − m1 ) (3.24)
T
with the constraint that w w = 1. The corresponding Lagrangian is given by
L(w, λ) = wT (m2 − m1 ) + λ(wT w − 1)
By taking the gradient of this with respect to w and λ, we have that
 
m2 − m1 + 2λw
∇w,λ L(w, λ) =
wT w − 1
Setting to 0 the derivative with respect to w gives the initial result, that is
m1 − m2
w=

By replacing into the λ derivative and setting it to 0, we’d obtain that
1
λ = ||m1 − m2 ||2
4
which gives
2(m1 − m2 )
w= ∝ (m2 − m1 )
||m1 − m2 ||2

65
Exercise 4.5 ?
By making use of (4.20), (4.23), and (4.24), show that the Fischer criterion (4.25) can be written
in the form (4.26).

Proof. The Fisher criterion is defined to be the ratio of the between-class variance to the within-
class variance and is given by
(m2 − m1 )2
J(w) = (4.25)
s21 + s22
where
mk = wT mk (4.23)
and X
s2k = (yn − mk )2 (4.24)
n∈Ck

By substituting (4.23) into the numerator of the Fischer expression,


(m2 − m1 )2 = (wT m2 − wT m1 )2 = wT (m2 − m1 )wT (m2 − m1 ) = wT (m2 − m1 )(m2 − m1 )T w
= wT SB w
where SB is the between-class covariance matrix and is given by
SB = (m2 − m1 )(m2 − m1 )T (4.27)
Similarily, we use the fact that the projection of the D-dimensional input vector w to one dimension
is given by
y = wT x (4.20)
along with (4.23) and (4.24) to rewrite the denominator as
X X
s21 + s22 = (yn − m1 )2 + (yn − m2 )2
n∈C1 n∈C2
X X
= (wT xn − wT m1 )2 + (wT xn − wT m2 )2
n∈C1 n∈C2
X X
= wT (xn − m1 )(xn − m1 )T w + wT (xn − m2 )(xn − m2 )T w
n∈C1 n∈C2
X X 
T T T
=w (xn − m1 )(xn − m1 ) + (xn − m2 )(xn − m2 ) w
n∈C1 n∈C2
T
= w SW w
where SW is the within-class covariance matrix and is given by
X X
SW = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T (4.28)
n∈C1 n∈C2

Finally, by substituting the new expressions into (4.25), we can rewrite the Fischer criterion as
wT SB w
J(w) = (4.26)
wT SW w

66
Exercise 4.7 ?
Show that the logistic sigmoid function
 (4.59) satisfies the property σ(−a) = 1 − σ(a) and that its
y
inverse is given by σ −1 (y) = ln .
1−y

Proof. The sigmoid function is given by


1
σ(a) = (4.59)
1 + e−a
The symmetry property is easily satisfied, as

1 1 + ea + 1 + e−a 1 + ea 1
σ(−a) = = −a
− −a
=1− = 1 − σ(a) (4.60)
1+e a a a
(1 + e )(1 + e ) (1 + e )(1 + e ) 1 + e−a

The sigmoid function is bijective, so inversable. Therefore, let σ(x) = y. Then,


 
1 −x −x 1−y y
y= ⇐⇒ (1 + e )y = 1 ⇐⇒ e = ⇐⇒ x = ln
1 + e−x y 1−y

so the inverse of the sigmoid function is given by


 
−1 y
σ (y) = ln
1−y

Exercise 4.8 ?
Using (4.57) and (4.58), derive the result (4.65) for the posterior class probability in the two-
class generative model with Gaussian densities, and verify the results (4.66) and (4.67) for the
parameters w and w0 .

Proof. It is known that the posterior probability for class C1 can be written as

p(C1 |x) = σ(a) (4.57)

where we have defined


p(x|C1 )p(C1 )
a = ln (4.58)
p(x|C2 )p(C2 )
and σ is the logistic sigmoid function defined by (4.59). We start by expanding a and rewritting
it as
p(C1 )
a = ln p(x|C1 ) − ln p(x|C2 ) + ln
p(C2 )
Since the class-conditional densities are Gaussian, i.e. the density for a class Ck is given by
 
1 1 1 T −1
p(x|Ck ) = N (x|µk , Σ) = exp − (x − µk ) Σ (x − µk ) (4.64)
(2π)D/2 |Σ|1/2 2

67
one can easily obtain that
p(C1 )
a = ln p(x|C1 ) − ln p(x|C2 ) + ln
p(C2 )
1 1 p(C1 )
= µT1 Σ−1 x − µT2 Σ−1 x − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 + ln
2 2 p(C2 )
1 1 p(C1 )
= (µ1 − µ2 )T Σ−1 x − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 + ln
2 2 p(C2 )
T
= w x + w0

where we have defined


w = Σ−1 (µ1 − µ2 ) (4.66)
1 1 p(C1 )
w0 = − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 + ln (4.67)
2 2 p(C2 )
Therefore, the posterior probability for class C1 is given by

p(C1 |x) = σ(wT x + w0 ) (4.65)

Exercise 4.9 ?
Consider a generative classification model for K classes defined by prior class probabilities p(Ck ) =
πk and general class-conditional densities p(φ|Ck ) where φ is the input feature vector. Suppose we
are given a training set {φn , tn } where n = 1, . . . , N , and tn is a binary target vector of length
K that use the 1-of-K coding scheme, so that it has components tnj = Ijk if pattern n is from
class Ck . Assuming that the data points are drawn independently from this model, show that the
maximum-likelihood solution for the prior probabilites is given by
Nk
πk =
N
where Nk is the number of data points assigned to class Ck .

Proof. Let T be the N × K matrix with the rows tTn and Φ the N × M matrix with the rows φTn .
Also, let’s define the column vector π = (π1 , π2 , . . . , πK )T . We have that

p(φn , Ck ) = p(Ck )p(φn |Ck ) = πk p(φn |Ck )

so the likelihood function is given by


N
Y N Y
Y K  tnj
p(T|Φ, π) = p(tn |Φ, π) = πj p(φn |Cj )
n=1 n=1 j=1

The log likelihood is then easily derived as


N X
X K  
ln p(T|Φ, π) = tnj ln πj + tnj ln p(φn |Cj )
n=1 j=1

68
PN
We aim to minimize this with respect to πk , while still maintaining the constraint k=1 πk = 1
Therefore, by only keeping the terms depending on πk , we obtain the Lagrangian
N X
X K N
X 
L(π, λ) = tnj ln πj + λ πj − 1
n=1 j=1 j=1

with the gradient  


N
1 X
tnk + λ
 πk n=1


 
∇πk ,λ L(π, λ) = 



 X N 
πk − 1
 
k=1

By setting this gradient to 0, from the first relation we have that


N
X
πk λ = − tnk = −Nk
n=1

Summing this over k, one can see that


λ = −N
After substituting this into the derivative and then setting it to 0, we obtain the maximum-
likelihood solution for πk , that is
Nk
πk =
N

Exercise 4.10 ??
Consider the classification model of Exercise 4.9 and now suppose that the class-conditional den-
sities are given by Gaussian distributions with a shared covariance matrix, so that

p(φ|Ck ) = N (φ|µk , Σ) (4.160)

Show that the maximum likelihood solution for the mean of the Gaussian distribution for class Ck
is given by
N
1 X
µk = tnk φn (4.161)
Nk n=1
which represents the mean of those feature vectors assigned to class Ck . Similarly, show that the
maximum likelihood solution for the shared covariance matrix is given by
K
X Nk
Σ= Sk (4.162)
k=1
N

where
N
1 X
Sk = tnk (φn − µk )(φk − µk )T (4.163)
Nk n=1

69
Thus Σ is given by a weighted average of the covariances of the data associated with each class,
in which the weighting coefficients are given by the prior probabilities of the classes.

Proof. Using the same notations as in the last exercise, we remember that the log likelihood is
given by
XN X K  
ln p(T|Φ, π) = tnj ln πj + tnj ln p(φn |Cj )
n=1 j=1

By keeping only the parts depending on µk ,


N K
1 XX
ln p(T|Φ, π) = − tnj (φn − µk )T Σ−1 (φn − µk ) + const
2 n=1 j=1

For a symmetric matrix W, one could show that



(x − s)T W(x − s) = −2W(x − s)
∂s
Therefore, the derivative with respect to µk of the log-likelihood is given by
N
∂ X
ln p(T|Φ, π) = tnk Σ−1 (φn − µk )
∂µk n=1

Since N
P
n=1 tnk = Nk , by setting the derivative to 0 and rearranging the terms, we have that the
solution for maximum likelihood is
N
1 X
µk = tnk φn (4.161)
Nk n=1

Similarly, we do the same for the shared covariance matrix. By keeping only the terms depending
on Σ, the log likelihood is given by
N K
N 1 XX
ln p(T|Φ, π) = − ln |Σ| − tnk (φn − µk )T Σ−1 (φn − µk ) + const
2 2 n=1 k=1
N K
N 1 XX
tnk φTn Σ−1 φn − 2φTn Σ−1 µk + µTk Σ−1 µk

=− ln |Σ| −
2 2 n=1 k=1

By using (C.28) and


∂ T −1
a X b = −X−T abT X−T
∂X
we take the derivative of the log likelihood with respect to Σ and obtain:
N K
∂ N 1 XX
ln p(T|Φ, π) = − Σ−1 + tnk Σ−1 φn φTn − 2φn µTk + µk µTk Σ−1

∂Σ 2 2 n=1 j=k
N K
N 1 XX
= − Σ−1 + tnk Σ−1 (φn − µk )(φn − µk )T Σ−1
2 2 n=1 k=1

70
 K 
N −1 1 −1 X
=− Σ + Σ Nk Sk Σ−1
2 2 k=1

where Sk is defined by (4.163). Therefore, by setting this derivative to 0 and rearranging the terms,
we obtain the maximum-likelihood solution for the shared covariance matrix
K
X Nk
Σ= Sk (4.162)
k=1
N

Exercise 4.11 ??
Consider a classification problem with K classses for which the feature vector φ has M components
each of which can take L discrete states. Let the values of the components be represented by a 1-
of-L binary coding scheme. Further suppose that, conditioned on the class Ck , the M components
of φ are independent, so that the class-conditional density factorizes with respect to the feature
vector components. Show that the quantities given by (4.63), which appear in the argument to the
softmax function describing the posterior class probabilties, are linear functions of the components
of φ. Note that this represents an example of the naive Bayes model which is discussed in Section
8.2.2.

Proof. We’ve seen in Section 4.2 that the posterior probabilities can be written as normalized
exponentials:
exp(ak )
p(Ck |φ) = P (4.62)
j exp(aj )

where
ak = ln p(φ|Ck )p(Ck ) (4.63)
Considering the setup of our classification problem, our class-conditional distribution will be of
the form
M YL
φ
Y
p(φ|Ck ) = µkijij
i=1 j=1

where µk is given by (4.161). Therefore, by replacing into (4.63), the arguments of the softmax
function are given by
M X
X L
ak = ln p(Ck ) + φij ln µkij
i=1 j=1

and are obviously linear functions of the components of φ.

Exercise 4.12 ?
Verify the relation (4.88) for the derivative of the logistic sigmoid function defined by (4.59).

71
Proof. By taking the derivative of (4.59), we have that:
2
e−a 1 + e−a
  
∂ ∂ 1 1 1 1
σ(a) = = = − = −
∂a ∂a 1 + e−a (1 + e−a )2 (1 + e−a )2 (1 + e−a )2 1 + e−a 1 + e−a
We recognize the expression of the logistic sigmoid function, so

σ(a) = σ(a) − σ(a)2 = σ(a) 1 − σ(a)

(4.88)
∂a

Exercise 4.13 ?
By making use of the result (4.88) for the derivative of the logistic sigmoid, show that the derivative
of the error function (4.90) for the logistic regression model is given by (4.91).

Proof. The error function for the logistic regression is given by


N
X
E(w) = − ln p(t|w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} (4.90)
n=1

where yn = σ(an ) and an = wT φn . Taking the derivative of the log likelihood function with respect
to w gives
N
X 
∇w ln p(t|w) = ∇w tn ln yn + (1 − tn ) ln(1 − yn )
n=1
N
X  
= tn ∇w ln yn + (1 − tn )∇w ln(1 − yn )
n=1
N  
X tn 1 − tn
= ∇w yn + ∇w (1 − yn )
n=1
yn 1 − yn
N
X tn (1 − yn ) − yn (1 − tn )
= ∇w yn
n=1
yn (1 − yn )
N
X tn − yn
= ∇w yn (4.13.1)
n=1
yn (1 − yn )

Using (4.88), we can compute the gradient term:


∂σ ∂an T 
∇w yn = ∇w σ(wT φn ) = w φn = yn (1 − yn )φn
∂an ∂w
so the gradient of the log likelihood is
N
X
∇w ln p(t|w) = (tn − yn )φn
n=1

72
and the gradient of the error function is then given by
N
X
∇w E(w) = −∇w ln p(t|w) = − (tn − yn )φn (4.91)
n=1

Exercise 4.14 ?
Show that for a linearly separable data set, the maximum likelihood solution for the logistic
regression model is obtained by finding a vector w whose decision boundary wT φ(x) = 0 separates
the classes and then taking the magnitude of w to infinity.

Proof. Suppose there exists w such that the hyperplane wT φ = 0 separates the data set. Because
the data set is linearly separable, wT φa < 0 and wT φb > 0 for all φa ∈ C1 and φb ∈ C2 . One
can observe that the maximum likelihood is obtained when p(C1 |φa ) = 1 and p(C2 |φb ) = 1 for all
φa ∈ C1 , φb ∈ C2 . Since our hyperplane is already chosen, there is a fixed angle θn between each
φn and w such that cos θn 6= 0. Therefore, by using the geometric definition of the dot product

wT φn = kwkkφn k cos θn

we see that our maximization is achieved by taking the magnitude of kwk to infinity, as
1
lim p(C1 |φa ) = lim σ(kwkkφa k cos θa ) = lim =1
kwk→∞ kwk→∞ kwk→∞ 1 + exp{−kwkkφa k cos θa }

and
1
lim p(C2 |φb ) = lim σ(kwkkφb k cos θb ) = lim =0
kwk→∞ kwk→∞ kwk→∞ 1 + exp{−kwkkφb k cos θb }

where φ ∈ C1 , φb ∈ C2 and we’ve used the fact that wT φa < 0 and wT φb > 0.

Exercise 4.15 ??
Show that the Hessian matrix H for the logistic regression model, given by (4.97), is positive
definite. Here R is a diagonal matrix with elements yn (1 − yn ), and yn is the output of the logistic
regression model for input vector xn . Hence show that the error function is a convex function of
w and it has an unique minimum.

Proof. The Hessian of the error function is given by


N
X
H = ∇∇E(w) = yn (1 − yn )φn φTn = ΦT RΦ (4.97)
n=1

73
Let u be a M -dimensional column vector. By using the sum formulation for the hessian matrix,
we have that
N N N
T
X
T
X T X
u Hu = yn (1 − yn )u φn φTn u = yn (1 − yn ) φTn u φTn u = yn (1 − yn )kφTn uk2
n=1 n=1 n=1

which is > 0 since yn is the output of the logistic sigmoid function, so 0 < yn < 1. Because u was
chosen arbitrarily, we have that H is positive definite. As a result, the error function is convex
and has an unique minimum.

Exercise 4.16 ?
Consider a binary classification problem in which each observation xn is known to belong to one
of two classes, corresponding to t = 0 and t = 1, and suppose that the procedure for collecting
training data is imperfect, so that training points are sometimes mislabelled. For every data point
xn , instead of having a value t for the class label, we have instead a value πn representing the
probability that tn = 1. Given a probabilistic model p(t = 1|φ), write down the log likelihood
function appropiate for such a data set.

Proof. Straight away, we can see that p(t = 0|φ) = 1 − p(t = 1|φ). An fair approach would be to
express p(tn |φ) as a weighted average of p(tn = 0|φ) and p(tn = 1|φ dictated by πn . Therefore,
the likelihood would be given by
N N N
1−πn
Y Y Y
πn 1−πn
p(tn = 1|φ)πn 1 − p(tn = 1|φ)

p(t|φ) = p(tn |φ) = p(tn = 1|φ) p(tn = 0|φ) =
n=1 n=1 n=1

which has the log likelihood


N
X 
ln p(t|φ) = πn p(tn = 1|φ) + (1 − πn ) 1 − p(tn = 1|φ)
n=1

Exercise 4.17 ?
Show that the derivatives of the softmax activation function (4.104) where the ak are defined by
(4.105), are given by (4.106).

Proof. The softmax activation function is given by

exp(ak )
yk = P (4.104)
j exp(aj )

where
ak = wkT φ (4.105)

74
Taking the derivative of (4.104) with respect to aj and applying the quotient rule gives
  P
∂yk ∂ exp(ak ) Ikj exp(ak ) i exp(ai ) − exp(ak ) exp(aj )
= P = 2
∂aj ∂aj i exp(ai )

P
i exp(ai )
 
exp(ak ) exp(aj )
=P Ikj − P
j exp(aj ) j exp(aj )

which is equivalent to

yk = yk (Ikj − yj ) (4.106)
∂aj

Exercise 4.18 ?
Using the result (4.106) for the derivatives of the softmax activation function, show that the
gradients of the cross-entropy error (4.108) are given by (4.109).

Proof. The cross-entropy error is given by


N X
X K
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − tnk ln ynk (4.108)
n=1 k=1

Taking its derivative with respect to wj yields


 K X
K  N X
K
∂ ∂ X X ∂
E(w1 , . . . , wK ) = − tnk ln ynk =− tnk ln ynk
∂wj ∂wj n=1 k=1 n=1 k=1
∂wj

By using (4.106) and the chain rule, we have that

∂ 1 ∂ynk 1 ∂ynk ∂aj 1


ln ynk = = = ynk (Ikj − ynj )φn = (Ikj − ynj )φn
∂wj ynk ∂wj ynk ∂aj ∂wj ynk

Replacing back into the gradient,


N X K N N XN 
∂ X X X
E(w1 , . . . , wK ) = − tnk (Ikj − ynj )φn = − tnj φn + tnk ynj φn
∂wj n=1 k=1 n=1 n=1 k=1

gives the desired result


N
∂ X
E(w1 , . . . , wK ) = (ynj − tnj )φn (4.109)
∂wj n=1

75
Exercise 4.19 ?
Write down expressions for the gradient of the log likelihood, as well as the corresponding Hessian
matrix, for the probit regression model defined in Section 4.3.5. These are quantities that would
be required to train such a model using IRLS.

Proof. The probit function is given by


Z a
Φ(a) = N (θ|0, 1) dθ (4.114)
−∞

Therefore, from the fundamental theorem of calculus, we have that


Z a
∂ ∂ 1 a2
Φ(a) = N (θ|0, 1) dθ = N (a|0, 1) = √ e− 2
∂a ∂a −∞ 2π
so  2
∂Φ ∂an 1 a
∇w yn = ∇w Φ(an ) = = √ exp − n φn
∂an ∂w 2π 2
The probit regression model has a very similar form with what we’ve used for the logistic
regression model in Section 4.3.2. More specifically, the log likelihood is still given by (4.90), but
this time with yn = Φ(an ). Therefore, the general form for gradient of the log likelihood derived
in Exercise 4.13, (4.13.1) can be used here too, so:
N N  2
X tn − yn 1 X tn − yn a
∇w ln p(t|w) = ∇w yn = √ exp − n φn
y (1 − yn )
n=1 n
2π n=1 yn (1 − yn ) 2
By taking the gradient of this again, we find the Hessian matrix using the chain rule:
H = ∇w ∇w ln p(t|w)
N   2 
1 X tn − yn a
=√ ∇w exp − n φn
2π n=1 yn (1 − yn ) 2
N    2   2 
1 X tn − yn an tn − yn a
=√ ∇w exp − + ∇w exp − n φn
2π n=1 yn (1 − yn ) 2 yn (1 − yn ) 2
We compute each gradient term separately, so
tn − yn yn (1 − yn ) + (tn − yn )(1 − 2yn ) yn2 − 2tn yn + tn
∇w =− ∇ y
w n = ∇w yn
yn (1 − yn ) yn2 (1 − yn )2 yn2 (1 − yn )2
yn2 − 2tn yn + tn
 2
a
= √ exp − n φn
yn2 (1 − yn )2 2π 2
and  2  2  2
an an a
∇w exp − = −an exp − ∇w an = −an exp − n φn
2 2 2
Hence, the hessian matrix becomes
N 
1 X yn2 − 2tn yn + tn
 2 
an (tn − yn ) a
exp − n φn φn
 2
H= √ √ exp −an φn −
2 2
2π n=1 yn (1 − yn ) 2π yn (1 − yn ) 2

76
Exercise 4.21 ?
Show that the probit function (4.114) and the erf function (4.115) are related by (4.116).

Proof. The error function is given by


Z a  2
2 θ
erf(a) = √ exp − dθ (4.115)
π 0 2

By using the fact that the Gaussian is symmetric around the mean, the probit function can be
rewritten as
Z a Z 0 Z a Z a  2
1 1 θ
Φ(a) = N (θ|0, 1) dθ = N (θ|0, 1) dθ + N (θ|0, 1) dθ = + √ exp − dθ
−∞ −∞ 0 2 2π 0 2
 
1 1
= 1 + √ erf(a) (4.116)
2 2

Exercise 4.22 ?
Using the result (4.135), derive the expression (4.137) for the log model evidence under the Laplace
approximation.

Proof. The proof of this is almost identical to the one in Section 4.4.1. From Bayes’ theorem the
model evidence is given by Z
p(D) = p(D|θ)p(θ) dθ (4.136)

Identifying f (θ) = p(D|θ)p(θ) and Z = p(D), and applying the result (4.135), we obtain the
model evidence under the Laplace approximation:

(2π)M/2
p(D) = p(D|θM AP )p(θM AP )
|A|1/2

where θM AP is the value of θ at the mode of the posterior distribution, and A is the Hessian
matrix of second derivatives of the negative log posterior

A = −∇∇ ln p(D|θM AP )p(θM AP ) = −∇∇ ln p(θM AP |D) (4.138)

Therefore, the log model evidence is given by


M 1
ln p(D) = ln p(D|θM AP ) + ln p(θM AP ) + ln(2π) − ln |A| (4.137)
2 2

77
Exercise 4.25 ??
Suppose we wish to approximate the logistic sigmoid σ(a) defined by (4.59) by a scaled probit
function Φ(λa) where Φ(a) is defined by (4.114). Show that if λ is chosen so that the derivatives
of the two functions are equal at a = 0, then λ2 = π/8.

Proof. We start by evaluating both function’s derivatives at a = 0. We’ve seen in Exercise 4.19
that the derivative of the probit function is given by
 2
∂ 1 a
Φ(a) = √ exp −
∂a 2π 2
so  2
∂ λ a λ
Φ(λa) = √ exp − =√
∂a a=0 2π 2 a=0 2π
From (4.88) we also obtain the derivative of the sigmoid function:

∂ 1
σ(a) = σ(a)(1 − σ(a)) =
∂a a=0 a=0 4

Finally, by using the fact that the derivatives of the functions are equal at a = 0, we quickly reach
the result from the hypothesis, i.e. λ2 = π/8.

78
Chapter 5

Neural Networks

Exercise 5.1 ??
Consider a two-layer network function of the form (5.7) in which the hidden-unit nonlinear acti-
vation functions h(·) are given by logistic sigmoid functions of the form
1
σ(a) = . (5.191)
1 + exp(−a)
Show that there exists an equivalent network, which computes exactly the same function, but with
hidden activation functions given by tanh(a) where the tanh function is defined by (5.59). Hint:
first find the relation between σ(a) and tanh(a), and then show that the parameters of the two
networks differ by linear transformations.

Proof. The considered two-layer network has the form


M
X X D  
(2) (1) (1) (2)
yk (x, w) = σ wkj h wji xi + wj0 + wk0 (5.7)
j=1 i=1

Now, we’ve proved in Exercise 3.1 that


1 x 1
σ(x) = tanh +
2 2 2
Therefore, we can rewrite yk as
 X M  X D  M 
1 (2) 1 (1) 1 (1) 1 X (2) (2)
yk (x, w) = σ w tanh w xi + wj0 + w + wk0
2 j=1 kj 2 i=1 ji 2 2 j=1 kj
XM XD  
(2) (1) (1) (2)
=σ ωkj h ωji xi + ωj0 + ωk0
j=1 i=1

where
M
(1) 1 (1) (1) 1 (1) (2) 1 (2) (2) 1 X (2) (2)
ωji = wji ωj0 = wj0 ωkj = wkj ωk0 = wkj + wk0
2 2 2 2 j=1
Both new parameter sets can be obtained as linear transformations of the old ones, so there exists an
equivalent two-layer network using tanh hidden activation functions, but different parameters.

79
Exercise 5.2 ?
Show that maximizing the likelihood function under the conditional distribution (5.16) for a mul-
tioutput network is equivalent to minimizing the sum-of-squares error function (5.11).

Proof. The likelihood function is given by


N
Y
p(T|X, w, β) = p(tn |xn , w, β)
n=1

The target variables are assumed to be distributed normally

p(tn |xn , w, β) = N (tn |y(xn , w), β −1 I) (5.16)

and since
N NK β
ln N (tn |y(xn , w), β −1 I) = − ln β − ln(2π) − ||y(xn , w) − tn ||2
2 2 2
the negative log-likelihood is given by
N
βX
− ln p(t|X, w, β) = ||y(xn , w) − tn ||2 + const
2 n=1

where we grouped the terms that don’t depend on w under the constant term. Maximization of
the likelihood function is equivalent to minimizing the negative log-likelihood. Therefore, one can
easily find that this is equivalent to minimizing the error function
N
1X
E(w) = ||y(xn , w) − tn ||2 (5.11)
2 n=1

Exercise 5.3 ??
Consider a regression problem involving multiple target variables in which it is assumed that the
distribution of the targets, conditioned on the input vector x, is a Gaussian of the form

p(t|x, w) = N (t|y(x, w), Σ) (5.192)

where y(x, w) is the output of a neural network with input vector x and weight vector w, and
Σ is the covariance of the assumed Gaussian noise on the targets. Given a set of independent
observations of x and t, write down the error function that must be minimized in order to find the
maximum likelihood solution for w, if we assume that Σ is fixed and known. Now assume that Σ
is also to be determined from the data, and write down an expression for the maximum likelihood
solution for Σ. Note that the optimizations of w and Σ are now coupled, in contrast to the case
of independent target variables discussed in Section 5.2.

80
Proof. The negative log-likelihood is given by
N
X
− ln p(T|X, w) = − ln N (tn |y(xn , w), Σ)
i=1
N
NK N 1X T
y(xn , w) − tn Σ−1 y(xn , w) − tn

= ln(2π) + ln |Σ| +
2 2 2 n=1

Maximizing the likelihood is equivalent to minimizing the negative log-likelihood. Therefore, the
error function that must be minimized to obtain maximum likelihood is given by
N
N 1X T
y(xn , w) − tn Σ−1 y(xn , w) − tn

E(w, Σ) = ln |Σ| +
2 2 n=1

In the case when Σ is known, we can simply treat the determinant term as a constant, so minimizing
the error function
N
1X T
y(xn , w) − tn Σ−1 y(xn , w) − tn

E(w) =
2 n=1
would yield the maximum likelihood solution wML . If Σ is unknown, we can’t do that and the
determination of wML would use Σ, so that’s why this time the optimizations of w and Σ are
coupled. The MLE for the covariance matrix is obtained by taking the derivative of the negative
log-likelihood wrt. Σ−1 , equalizing it to 0 and then solving for Σ. Taking the derivative of the
negative log-likelihood yields
N
∂ ln p(T|X, w) N ∂ 1X ∂ T −1 
= ln |Σ| + y(x n , w) − tn Σ y(xn , w) − tn
∂Σ−1 2 ∂Σ−1 2 n=1 ∂Σ−1
N
N ∂ −1 1X ∂  T −1 
=− ln |Σ | + Tr y(xn , w) − tn Σ y(xn , w) − t n
2 ∂Σ−1 2 n=1 ∂Σ−1
N
N 1X ∂  T  −1
=− Σ+ Tr y(xn , w) − tn y(xn , w) − t n Σ
2 2 n=1 ∂Σ−1
N
N 1X T 
=− Σ+ y(xn , w) − tn y(xn , w) − tn
2 2 n=1

where we’ve used the cyclic property of the trace operator and the fact that

ln |A| = A−T
∂A
Now, equalizing the derivative with 0 and solving for Σ gives the MLE for the covariance matrix:
N
1 X T 
ΣML = y(xn , w) − tn y(xn , w) − tn
N n=1

81
Exercise 5.4 ??
Consider a binary classification problem where the target values are t ∈ {0, 1}, with a network
output y(x, w) that represents p(t = 1|x), and suppose that there is a probability  that the class
label on a trainining data point has been incorrectly set. Assuming independent and identically
distributed data, write down the error function corresponding to the negative log likelihood. Verify
that the error function (5.21) is obtained when  = 0. Note that this error function makes the
model robust to incorrectly labelled data, in contrast to the usual error function.

Proof. We’re going to model the problem similarly with what we’ve done in Section 4.2.2, but
this time taking into account the mislabelled training data probability. As a result, let r ∈ {0, 1}
the real target values, considering mislabelling. Therefore, we can find the label probabilities by
weighting in the error chance:

p(r = 1|x, w) = (1 − )p(t = 1|x, w) + p(t = 0|x, w) = (1 − )y(xn , w) +  1 − y(xn , w)

p(r = 0|x, w) = (1 − )p(t = 0|x, w) + p(t = 1|x, w) = (1 − ) 1 − y(xn , w) + y(xn , w)

We can combine both of these into

p(r|x, w) = p(r = 1|x, w)r p(r = 0|x, w)1−r


 r   1−r
= (1 − )y(xn , w) +  1 − y(xn , w) (1 − ) 1 − y(xn , w) + y(xn , w)

Therefore, the negative log-likelihood is given by


N
Y N
X
− ln p(r|X, w) = − ln p(rn |xn , w) = − {rn ln p(rn = 1|xn , w) + (1 − rn ) ln p(rn = 0|x, w)}
i=1 i=1

As a result, this is equivalent to minimizing the error function


N
X      
E(w) = − rn ln (1−)y(xn , w)+ 1−y(xn , w) +(1−rn ) ln (1−) 1−y(xn , w) +y(xn , w)
i=1

which for  = 0 is equivalent to (5.21).

Exercise 5.5 ?
Show that maximizing likelihood for a multiclass neural network model in which the network
outputs have the interpretation yk (x, w) = p(tk = 1|x) is equivalent to minimization of the cross-
entropy function (5.24).

Proof. Let’s consider the binary target variables tk ∈ {0, 1} have a 1-of-K coding scheme indicating
the class. If we assume the class labels are independent, given the input vector, the conditional
distribution of the targets is
K
Y
p(tk |x) = p(tk = 1|x)tk
k=1

82
As a result, the corresponding negative log likelihood is given by
N Y
Y K N X
X K
tnk
− ln p(T|X, w) = − ln p(tnk = 1|xn ) = − ln tnk ln p(tnk = 1|xn )
n=1 k=1 n=1 k=1

Therefore, maximizing the likelihood of the model is equivalent to minimization of the cross entropy
function
XN XK
E(w) = − tnk ln p(tnk = 1|xn ) (5.24)
n=1 k=1

Exercise 5.6 ?
Show the derivative of the error function (5.21) with respect to the activation ak for output units
having a softmax activation function satisfies (5.18).

Proof. The general result for the derivative of the softmax function with respect to the activation
ak was proved in Exercise 4.17 and is given by (4:106). Therefore, we have that

∂yk
= yk (1 − yk )
∂ak
Taking the derivative of
N
X
E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} (5.21)
n=1

with respect to ak yields


∂ ∂ ∂
E(w) = −tk ln yk − (1 − tk ) ln(1 − yk ) = −tk (1 − yk ) + yk (1 − tk )yk = yk − tk
∂ak ∂ak ∂ak
As a result,
∂E
= yk − tk (5.18)
∂ak

Exercise 5.7 ?
Show the derivative of the error function (5.21) with respect to the activation ak for an output
unit having a logistic sigmoid activation function satisfies (5.18).

Proof. We’ve seen in Exercise 4.12 that



σ(a) = σ(a)(1 − σ(a) (4.88)
∂a

83
Since the output unit has a logistic sigmoid activation function, then
yk = σ(ak )
Therefore, using (4.88) gives
∂yk 
= σ(ak ) 1 − σ(ak ) = yk (1 − yk )
∂ak
Analogously to Exercise 5.6, one can quickly reach that (5.18) holds.

Exercise 5.8 ?
We saw in (4.88) that the derivative of the logistic sigmoid activation function can be expressed in
terms of the function value itself. Derive the corresponding result for the ’tanh’ activation function
defined by (5.59).

Proof. Taking the derivative of the ’tanh’ function is straightforward:


2 2 2
∂ ea − e−a ea + e−a − ea − e−a e − e−a
   a

tanh(a) = a −a
= 2 =1− a −a
= 1 − tanh(a)2
∂a ∂a e + e ea + e−a e +e
Notice that the derivative of the ‘tanh‘ function can also be expressed as a function of itself.

Exercise 5.9 ?
The error function (5.21) for binary classification problems was derived for a network having a
logistic-sigmoid output activation function, so that 0 ≤ y(x, w) ≤ 1, and data having target values
t ∈ {0, 1}. Derive the corresponding error function if we consider a network having an output
−1 ≤ y(x, w) ≤ 1 and target values t = 1 for class C1 and t = −1 for class C2 . What would be the
appropiate choice of output unit activation function?

Proof. The hyperbolic tangent is the appropiate choice for the ouput unit activation function,
because ‘tanh‘ is a sigmoid function and its values range between −1 and 1. Let’s consider the
case of binary classification in which we interpret the network output y(x, w) as the conditional
probability p(C1 |x), with p(C2 |x) given by 1−y(x, w). The conditional distribution of targets given
inputs is then of the form
1+t  1−t
p(t|x, w) = y(x, w) 2 1 − y(x, w) 2
Taking the negative log-likelihood then yields
N N  
Y X 1+t 1−t 
− ln p(t|X, w) = − ln p(tn |xn , w) = − ln y(x, w) + ln 1 − y(x, w)
n=1 n=1
2 2
As a result, maximizing the likelihood is equivalent to minimizing the error function
N  
X 1+t 1−t
E(w) = − ln yn + ln(1 − yn )
n=1
2 2
where yn denotes y(xn , w).

84
Exercise 5.10 ?
Consider a Hessian matrix H with eigenvector equation (5.33). By setting the vector v in (5.39)
equal to each of the eigenvectors ui in turn, show that H is positive definite if, and only if, all of
its eigenvalues are positive.

Proof. Consider the eigenvector equation

Hui = λi ui (5.33)

→ Assume that H is positive definite. Then,

uTi Hui = λi ||ui ||2 > 0

which happens only if the eigenvalues λi are positive.

← Suppose that the eigenvalues λi are positive. Since the eigenvectors form an orthonormal
basis, an arbitrary vector v can be written in the form
X
v= ci ui (5.38)
i

Therefore,
X T X  X T  X 
T
v Hv = ci ui H ci ui = ci ui ci λi ui
i i i i
XX X
= λj ci cj uTi uj = λi c2i
i j i

Since the eigenvalues λi are positive,


X
vT Hv = λi c2i > 0
i

for all v, which proves that H is positive definite.

Exercise 5.11 ??
Consider a quadratic error function defined by (5.32), in which the Hessian matrix H has an
eigenvalue equation given by (5.33). Show that the contours of constant error are ellipses whose
axes are aligned with eigenvectors ui with lengths that are inversly proportional to the square root
of the corresponding eigenvalues λi .

85
Proof. Analogously to what we’ve seen in Section 5.3.2, we’re going to rewrite
1
E(w) ' E(w? ) + (w − w? )T H(w − w? ) (5.32)
2
as
1X
E(w) ' E(w? ) + λi αi2 (5.36)
2 i
where we’ve expanded (w − w? ) as a linear combination of H’s eigenvectors:
X
w − w? = αi ui (5.35)
i

Now, since w and w? are fixed, let ξ = 2E(w) − 2E(w? ). Therefore, one can rewrite (5.36) as
X X  αi 2
ξ' λi αi2 = −1/2
i i λi

This equation describes an N -dimensional ellipsoid. Since the coordinates αi that define it are
using the orthonormal basis formed by {ui }, its axis are aligned with the eigenvectors ui . The axis
length of an ellipse can be obtained by taking αi = 0, for i 6= j such that
 2
αj
ξ' −1/2
λj

and respectively
 1/2
ξ
αj '
λj
Therefore, the lenghts of the ellipses are inversly proportional to the square root of the correspond-
ing eigenvalues λi .

Exercise 5.12 ??
By considering the local Taylor expansion (5.32) of an error function about a stationary point w? ,
show that the necessary and sufficient condition for the stationary point to be a local minimum
of the error function is that the Hessian matrix H, defined by (5.30) with w b = w? , be positive
definite.

Proof.

→ Suppose that H is positive definite. From (5.32) one could then find that

E(w) − E(w? ) > 0

for w 6= w? . Therefore, E(w? ) would be the minimum value of E.

86
← Assume that w? is a local minimum of E. Then,

E(w) − E(w? ) > 0

which would mean that


1
(w − w? )T H(w − w? ) > 0
2
for w 6= w? , i.e. H is positive definite, since w respectively w − w? can be chosen arbitrarily.

Exercise 5.13 ?
Show that as a consequence of the symmetry of the Hessian matrix H, the number of independent
elements in the quadratic error function (5.28) is given by W (W + 3)/2.

Proof. The independent elements in the


1
E(w) ' E(w) b T b + (w − w)
b + (w − w) b T H(w − w)
b (5.28)
2
are given by the terms containing b and H. Since b has W elements and H is a symmetric matrix
with W (W + 1)/2 independent elements (see Exercise 2.21), one has a total of
W (W + 1) W (W + 3)
W+ =
2 2
independent elements, where W is the dimensionality of w.

Exercise 5.14 ?
By making a Taylor expansion, verify that the terms that are O() cancel on the right-hand side
of (5.69).

Proof. Taking the Taylor expansion around wji of the terms on the right hand side of
∂En En (wji + ) − En (wji − )
= + O(2 ) (5.69)
∂wji 2
yields
2 00
En (wji + ) ' En (wji ) + En0 (wji ) +E (wji ) + O(3 )
2 n
2
En (wji − ) ' En (wji ) − En0 (wji ) + En00 (wji ) + O(3 )
2
Substituting these results into (5.69) cancels the O() terms and gives
∂En
' En0 (wji ) + O(2 )
∂wji

87
Exercise 5.15 ??
In Section 5.3.4, we derived a procedure for evaluation the Jacobian matrix of a neural network
using a backpropagation procedure. Derive an alternative formalism for finding the Jacobian based
on forward propagation equations.

Proof. The Jacobian can be obtained by using the forward propagation technique. This is similar
to what we’ve seen in Section 5.3.4, but this time the computations will start from the output end
of the network. We have that
∂yk ∂yk ∂ak
Jki = =
∂xi ∂ak ∂xi
Summing over the j hidden units that have links to k units yields
∂ak X ∂ak ∂aj
=
∂xi j
∂aj ∂xi

From (5.48), it’s obvious that


∂aj
= wji
∂xi
As a result,
∂yk X ∂ak ∂yk X ∂ak ∂zj ∂yk X ∂zj
Jki = wji = wji = wkj wji
∂ak j ∂aj ∂ak j ∂zj ∂aj ∂ak j ∂aj

Suppose that h is the activation function for the ouput layer, respectively g for the hidden layer.
Then, X
Jki = h0 (ak ) g 0 (aj )wkj wji
j

Since the main steps are computing aj and ak (in this order), the process of evaluating the Jacobian
can be tought of as a forward propagation process.

Exercise 5.16 ?
The outer product approximation to the Hessian matrix for a neural network using a sum-of-squares
error function is given by (5.84). Extend this result to the case of multiple outputs.

Proof. The sum-of-square error function for multiple outputs is given by


n
1X
E= ||yn − tn ||2
2 n=1

Similarly to Section 5.4.2, our goal is to obtain the outer product approximation for the Hessian
matrix of the error. Hence, computing the Hessian yields:
 X N  XN 
1 2 T
H = ∇∇E = ∇ ∇||yn − tn || = ∇ (yn − tn ) ∇yn
2 n=1 n=1

88
N
X N
X
= ∇yn ∇ynT + (yn − tn )T ∇∇yn
n=1 n=1

Neglecting the second term yields the outer product approximation for the Hessian matrix:
N
X
H' ∇yn ∇ynT
n=1

which is analogous to (5.84) for bn = ∇yn in the multiple output case. Note that, for simplicity
all the ∇ symbols refer to ∇w

Exercise 5.17 ?
Consider a squared loss function of the form
ZZ
1
E= {y(x, w) − t}2 p(x, t) dx dt (5.193)
2
where y(x, w) is a parametric function such as a neural network. The result (1.89) shows that the
function y(x, w) that minimizes this error is given by the conditional expectation of t given x. Use
this result to show that the second derivative of E with respect to two elements wr and ws of the
vector w, is given by
∂ 2E
Z
∂y ∂y
= p(x) dx (5.194)
∂wr ∂ws ∂wr ∂ws
Note that, for a finite sample form p(x), we obtain (5.84).

Proof. To simplify the notation, we’ll denote y(x, w) as y. Taking the second derivative of E yields
∂ 2E ∂2
 ZZ 
1 2
= {y − t} p(x, t) dx dt
∂ws ∂wr ∂ws ∂wr 2
ZZ
1 ∂ ∂y
= 2(y − t) p(x, t) dx dt
2 ∂w ∂wr
ZZ s
∂ 2y
ZZ
∂y ∂y
= p(x, t) dx dt + (y − t) p(x, t) dx dt
∂ws ∂wr ∂ws ∂wr
Using (1.89), i.e. that y = Et [t|x] minimizes the error, proves that the second integral term is null:
∂ 2y ∂ 2y ∂ 2y
ZZ Z Z Z 
(y − t) p(x, t) dx dt = yp(x) dx − tp(t|x) dt p(x) dx
∂ws ∂wr ∂ws ∂wr ∂ws ∂wr
∂ 2y
Z
= (y − Et [t|x])p(x) dx
∂ws ∂wr
=0
As a result, the second derivative can be written as
∂ 2E
ZZ Z
∂y ∂y ∂y ∂y
= p(x)p(t|x) dx dt = p(x) dx (5.194)
∂ws ∂wr ∂ws ∂wr ∂ws ∂wr

89
Exercise 5.18 ?
Consider a two-layer network of the form shown in Figure 5.1 with the addition of extra parameters
corresponding to skip-layer connections that go directly from inputs to the outputs. By extending
the discussion of Section 5.3.2, write down the equations for the derivatives of the error function
with respect to these additional parameters.

(s)
Proof. Let the weight corresponding to the skip-layer connections be denoted by wki . The outputs
will gain an extra sum corresponding to those connections:
X (2) X (s)
yk = wkj zj + wki xi
j i

Since δk ’s functional form remains unchanged, the derivatives with respect to the first-layer and
second-layer weights remain the same as before, i.e. (5.67). The derivative with respect to the
skip-layer is now given by
∂En ∂En ∂ak
(s)
= (s)
= δk xi
∂wki ∂ak ∂wki
the same as the one of the second-layer.

Exercise 5.19 ?
Derive the expression (5.85) for the outer product approximation to the Hessian matrix for a
network having a single output with a logistic sigmoid output-unit activation function and a cross-
entropy error function, corresponding to the result (5.84) for the sum-of-squares error function.

Proof. For simplicity, let yn denote y(xn , w). Consider a network with the cross-entropy error
function
XN
E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} (5.21)
n=1

and a single output with activation


yn = σ(an )
From (4.88) one has that

∇yn = σ(an )[1 − σ(an )]∇an = yn (1 − yn )∇an

Using the chain rule of differential calculus,


N N N
X ∂E ∂an X ∂E X ∂E
H = ∇∇E(w) = ∇ =∇ ∇an = ∇ ∇an
n=1
∂a n ∂w n=1
∂a n n=1
∂a n

Computing the derivative term is straightforward:


 
∂E tn ∂yn 1 − tn ∂yn yn (1 − tn ) − tn (1 − yn ) ∂yn
=− − = = yn − tn
∂an yn ∂an 1 − yn ∂an yn (1 − yn ) ∂an

90
Substituting this into the initial expression gives
N
X N
X
∇yn ∇aTn + (yn − tn )∇∇an

H=∇ (yn − tn )∇an =
n=1 n=1

The second term inside the sum contains the term (yn − tn ), so we can neglect it as seen in Section
5.4.2 and arrive at the outer approximation for the Hessian matrix by expanding ∇yn :
N
X
H' yn (1 − yn )∇an ∇aTn
n=1

which is equivalent to (5.85) for bn = ∇an . Note that, for simplicity all the ∇ symbols refer to
∇w

Exercise 5.20 ?
Derive an expression for the outer product approximation to the Hessian matrix for a network hav-
ing K outputs with a softmax output-unit activation function and a cross-entropy error function,
corresponding to the result (5.84) for the sum-of-squares error function.

Proof. We’ll take a similar approach to the previous exercises. For simplicity, let ynk denote
yk (xn , wk ). Consider a network with the cross-entropy error function
N X
X K
E(w) = − tnk ln ynk
n=1 k=1

and K outputs with the softmax activation function

exp{ank }
ynk = PK
j=1 exp{anj }

Note that both yn and an are vectors of size 1 × K. Using the chain rule of calculus yields
N N
X ∂E ∂an X ∂E
H = ∇∇E(w) = ∇ =∇ ∇an
n=1
∂an ∂w n=1
∂an

Now, the derivative term will be of size 1 × K. The value of the i-th element will be given by
  K K
∂E ∂E X ∂ ln ynk X tnk ∂ynk
= =− tnk =−
∂an i ∂ani k=1
∂ani k=1
ynk ∂ani

From Exercise 4.17, respectively (4.106), we have that

∂ynk
= ynk (δik − yni )
∂ani

91
Therefore,
  K K K
∂E X tnk X X
=− ynk (δik − yni ) = − tnk (δik − yni ) = yni tnk − tni = yni − tni
∂an i k=1
ynk k=1 k=1

Hence, it’s obvious that


∂E
= yn − tn
∂an
Substituting this back into the Hessian,
N
X N
X
 
H=∇ yn − tn ∇an = ∇yn ∇an + (yn − tn )∇∇an }
n=1 n=1

As before, the second term inside the sum contains the term (yn −tn ), so we can neglect it similarly
to what we do in the previous exercises or Section 5.4.2. By ignoring the term, one arrives at the
outer approximation of the Hessian matrix:
N N
X X ∂yn
H' ∇yn ∇an = ∇an ∇an
n=1 n=1
∂an

where the derivative term is a K × K matrix with the elements


 
∂yn ∂yni
= = yni (δij − yni )
∂an ij ∂anj

Note that for notation simplicity, all ∇ symbols refer to ∇w .

Exercise 5.21 ? ? ? TODO


Extend the expression (5.86) for the outer product approximation of the Hessian matrix to the case
of K > 1 output units. Hence, derive a recursive expression analogous to (5.87) for incrementing
the number N of patterns and a similar expression for incrementing the number K of outputs. Use
these results, together with the identity (5.88), to find sequential update expressions analogous to
(5.89) for finding the inverse of the Hessian by incrementally including both extra patterns and
extra outputs.

Proof.

Exercise 5.22 ??
Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessian matrix of a two-layer
feed-forward network by application of the chain rule of calculus.

Proof. As seen in Section 5.4.5, we consider the three separate blocks of the Hessian:

92
1. Both weights are in the second layer:

∂ 2 En
 
∂ ∂En
(2) (2)
= (2) (2)
∂wkj ∂wk0 j 0 ∂wkj ∂wk0 j 0
 
∂ ∂En ∂ak0
= (2) (2)
∂wkj ∂ak0 ∂wk0 j 0
 
∂ ∂En
= (2)
zj 0
∂wkj ∂ak0
 
∂ ∂En ∂ak
= zj 0 (2)
∂ak0 ∂ak ∂wkj
∂ 2 En
= zj zj 0
∂ak ∂ak0
= zj zj 0 Mkk0 (5.93)

2. Both weights are in the first layer:

∂ 2 En
 
∂ ∂En
(1) (1)
= (1) (1)
∂wji ∂wj 0 i0 ∂wji ∂wj 0 i0
 
∂ ∂En ∂aj 0
= (1)
∂wji ∂aj 0 ∂wj(1)
0 i0

∂ 
= (1)
x i 0 δj 0
∂wji

Using the backpropagation formula (5.56), we have that

∂ 2 En
 X (2) 
∂ 0
(1) (1)
= (1)
xi0 h (aj 0 ) wkj 0 δk
∂wji ∂wj 0 i0 ∂wji k
∂  X (2)  X (2) ∂δk
= xi0 (1) h0 (aj 0 ) wkj 0 δk + xi0 h0 (aj 0 ) wkj 0 (1)
∂wji k k ∂wji

For j 6= j 0 the derivative in the first term is null. As a result, the first term can be written as
∂  X (2) ∂ X (2)
xi 0 (1)
h0 (aj 0 ) wkj 0 δk = Ijj 0 xi0 (1)
h0 (aj 0 ) wkj 0 δk
∂wji k ∂wj 0 i k
X (2)
= Ijj 0 xi xi0 h00 (aj 0 ) wkj 0 δk
k

Now, let’s compute the derivative in the second term:

∂δk X ∂δk ∂ak0 X ∂ 2 En ∂a0 X ∂


X
(2) (1)

k
(1)
= (1)
= 0 (1)
= Mkk0 (1) wk0 j h(xi wji )
∂wji k 0
∂a k0 ∂wji
k 0
∂a k ∂ak ∂wji
k 0 ∂w ji j
X (2)
0
= xi h (aj ) Mkk0 wk0 j
k0

93
Putting everything together yields the desired result:
∂ 2 En X (2)
XX (2) (2)
(1) (1)
= Ijj 0 xi xi0 h00 (aj 0 ) wkj 0 δk + xi xi0 h0 (aj )h0 (aj 0 ) wkj 0 wk0 j Mkk0 (5.94)
∂wji ∂wj 0 i0 k k k0

Note that this result is equivalent with the one in the book even if the k and k 0 are inter-
changed in the second term. This is because the sum ranges are chosen arbitrarily.

3. One weight in each layer:


∂ 2 En
 
∂ ∂En
(1) (2)
= (1) (2)
∂wji ∂wkj 0 ∂wji ∂wkj 0
 
∂ ∂En ∂ak
= (1) (2)
∂wji ∂ak ∂wkj 0

∂ 
= (1)
δk zj 0
∂wji
∂δk ∂zj 0
= zj 0 (1)
+ δk (1)
∂wji ∂wji

We found the value of the first term in the previous case. Also, the derivative in the second
term is null for j 6= j 0 . Therefore, the above expression becomes
∂ 2 En X (2)
(1) (2)
= zj 0 xi h0 (aj ) Mkk0 wk0 j + Ijj 0 δk xi h0 (aj )
∂wji ∂wkj 0 k0
 X (2) 
0
= xi h (aj ) δk Ijj 0 + zj 0 wk0 j Mkk0 (5.95)
k0

Exercise 5.23 ??
Extend the results of Section 5.4.5 for the exact Hessian of two-layer network to include skip-layer
connections that go directly from input to outputs.

Proof. Let’s denote the skip layer by the (s) superscript. One has 3 cases for the weight combina-
tions that include skip weights:
1. The non-skip activation is in the second layer:
∂ 2 En
 
∂ ∂En
(s) (2)
= (s) (2)
∂wki ∂wk0 j ∂wki ∂wk0 j
 
∂ ∂En ∂ak0
= (s)
∂wki ∂ak0 ∂wk(2)
0j

∂ 
= (s)
δk0 zj
∂wki

94
∂δk0
= zj (s)
∂wki
Computing the derivative term separately yields
∂δk0 ∂δk0 ∂ak ∂ 2 En
(s)
= (s)
= xi = Mkk0 xi
∂wki ∂ak ∂wki ∂ak ∂ak0

Hence,
∂ 2 En
(s) (2)
= xi zj Mkk0
∂wki ∂wk0 j

2. The non-skip activation is in the first layer:


∂ 2 En
 
∂ ∂En
(s) (1)
= (s) (1)
∂wki ∂wji0 ∂wki ∂wji0
 
∂ ∂En ∂aj
= (s) (1)
∂wki ∂aj ∂wji0

∂ 
= (s)
xi0 δj
∂wki
∂δj
= xi0 (s)
∂wki
Using the back-propagation formula (5.56) gives
∂ 2 En
 X (2) 
∂δj 0
(s) (1)
= xi0 (s) h (aj ) wk0 j δk0
∂wki ∂wji0 ∂wki k0
X (2) ∂δk0
= xi0 h0 (aj ) wk0 j (s)
k0 ∂wki

We’ve already computed the derivative term in the last case. Therefore,
∂ 2 En X (2)
(s) (1)
= xi xi0 h0 (aj ) wk0 j Mkk0
∂wki ∂wji0 k0

3. Both weights are skip weights:


∂ 2 En
 
∂ ∂En
(s) (s)
= (s) (s)
∂wki ∂wk0 i0 ∂wki ∂wk0 i0
 
∂ ∂En ∂ak0
= (s) (s)
∂wki ∂ak0 ∂wk0 i0
∂ 
= (s)
δ k 0 xi 0
∂wki
∂δk0
= xi0 (s)
∂wki

95
We’ve already computed the derivative term in the first case. As a result,
∂ 2 En
(s) (s)
= xi xi0 Mkk0
∂wki ∂wk0 i0

Exercise 5.24 ??
Verify that the network function defined by (5.113) and (5.114) is invariant under the transforma-
tion (5.115) applied to the inputs, provided the weights and biases are simultaneously transformed
using (5.116) and (5.117). Similarly, show that the network outputs can be transformed according
to (5.118) by applying the transformation (5.119) and (5.120) to the second-layer weights and
biases.

Proof. Let’s make the transformations (5.115), (5.116) and (5.117) and check the new value of aej .
X X1 bX X
aej = w
eji x
ei + w
ej0 = wji axi + b) + wj0 − wji = wji xi + wj0 = aj
i i
a a i i

Since the activations of the hidden layer remain the same after the transformation, we can conclude
that the outputs are invariant under the above transformations. Now, let’s apply the transforma-
tions (5.119) and (5.120) to the second-layer weights and biases. The new outputs look like
X X X 
yek = wekj zj + w
ek0 = cwkj zj + cwk0 + d = c wkj zj + wk0 + d = cyk + d
j j j

which proves that the network outputs can be transformed as in (5.118).

Exercise 5.25 ? ? ?
Consider a quadratic error function of the form
1
E = E0 + (w − w? )T H(w − w? ) (5.195)
2
where w∗ represents the minimum, and the Hessian matrix H is positive definite and constant.
Suppose the initial weight vector w(0) is chosen to be at the origin and is updated using simple
gradient descent
w(τ ) = w(τ −1) − ρ∇E (5.196)
where τ denotes the step number, and ρ is the learning rate (which is assumed to be small). Show
that, after τ steps, the components of the weight vector parallel to the eigenvectors of H can be
written
(τ )
wj = {1 − (1 − ρηj )τ }wj? (5.197)
where wj = wT uj , uj and ηj are the eigenvectors and eigenvalues, respectively, of H so that

Huj = ηj uj (5.198)

96
Show that as τ → ∞, this gives w(τ ) → w? as expected, provided |1 − ρηj | < 1. Now suppose that
training is halted after a finite number τ of steps. Show that the components of the weight vector
parallel to the eigenvectors of the Hessian satisfy
(τ )
wj ' wj? when ηj  (ρτ )−1 (5.199)
(τ )
|wj |  |wj? | when ηj  (ρτ )−1 (5.200)
Compare this result with the discussion in Section 3.5.3 of regularization with simple weight decay,
and hence show that (ρτ )−1 is analogous to the regularization parameter λ. The above results also
show that the effective number of parameters in the network, as defined by (3.91), grows as the
training progresses.

Proof. Taking the gradient of the error with respect to w yields


 
1
∇E = ∇ E0 + (w − w ) H(w − w ) = H(w − w? )
? T ?
2

Substituting this into (5.196) gives

w(τ ) = w(τ −1) − ρH(w(τ −1) − w? )

Now, we left-multiply by uTj both sides of the expression and use the fact that wj = wT uj , along
with (5.198) to obtain that
(τ ) (τ −1) (τ −1) (τ −1)
wj = wj − ρηj wj + ρηj wj? = (1 − ρηj )wj + ρηj wj?

Let’s prove (5.197) by induction. The base case for τ = 1 is obviously holding since w(0) is the
origin:
(1) (0)
wj = (1 − ρηj )wj + ρηj wj? = ρηj wj? = {1 − (1 − ρηj )}wj?
For the general case, let τ = t ∈ N and assume that (5.197) holds:
(t)
wj = {1 − (1 − ρηj )t }wj?
(t)
Substituting the value of wj into
(t+1) (t)
wj = (1 − ρηj )wj + ρηj wj?

gives
(t+1)
wj = (1 − ρηj ){1 − (1 − ρηj )t }wj? + ρηj wj?
= (1 − ρηj ) − (1 − ρηj )t+1 + ρηj wj?


= {1 − (1 − ρηj )(t+1) }wj?

Since the base case and general recursive implication holds, we proved by induction that (5.197)
holds. Now, taking the number of steps τ to infinity yields
(τ )
lim wj = lim {1 − (1 − ρηj )τ }wj? = wj?
τ →∞ τ →∞

97
for |1 − ρηj | < 1 since as τ → ∞, one has that (1 − ρηj )τ → 0. Since ηj ρτ  1 and |1 − ρηj | < 1, τ
(τ )
must be large. Therefore, as proved above, wj ' wj? . Now, since ηj ρτ  1 and τ is finite, ρτ
must be very small. We use this fact by expanding the polynomial and ignoring the higher order
terms:
(τ )
|wj | = |1 − (1 − ρηj )τ ||wj? |
= |1 − 1 − τ ρηj + O(ρ2 ηj2 ) ||wj? |


' |τ ρηj ||wj? |


 |wj? |

Exercise 5.26 ??
Consider a multilayer perceptron with arbitrary feed-forward topology, which is to be trained by
minimzing the tangent propagation error function in which the regularizing function is given by
(5.128). Show that the regularization term Ω can be written as a sum over patterns of terms of
the form
1X
Ωn = (Gynk )2 (5.201)
2 k
where G is a differential operator defined by
X ∂
G≡ τi (5.202)
i
∂xi

By acting on the forward propagation equations


X
zj = h(aj ), aj = wji zi (5.203)
i

with the operator G, show that Ωn can be evaluated by forward propagation using the following
equations: X
αj = h0 (aj )βj , βj = wji αi (5.204)
i
where we have defined the new variables
αj ≡ Gzj , βj ≡ Gaj (5.205)
Now show that the derivatives of Ωn with respect to a weight wrs in the network can be written
in the form
∂Ωn X
= αk {φkr zs + δkr αs } (5.206)
∂wrs k
where we have defined
∂yk
δkr ≡
, φkr = Gδkr . (5.207)
∂ar
Write down the backpropagation equations for δkr , and hence derive the set of backpropagation
equations for the evaluation of φkr .

98
Proof. Simply evaluating (5.201) using (5.202) gives
 2  2
1X 2 1 X X ∂ynk 1X X
Ωn = (Gynk ) = τni = τni Jnki
2 k 2 k i
∂xni 2 k i

where Jnki is the (k, i)-th element of the Jacobian for the n-th observation. Summing the above
expression over n yields the regularization function:
 2
X 1X X
Ωn = τni Jnki = Ω (5.128)
n
2 n k

By applying the differential operator G on the forward propagation equations (5.203), one obtains
the same results as (5.204):
X ∂zj X ∂zj ∂aj X ∂aj
αj = Gzj = τi = τi = h0 (aj ) τi = h0 (aj )Gaj = h0 (aj )βj
i
∂x i i
∂a j ∂x i i
∂x i

X ∂ X   X X ∂zi0 X X
βj = Gaj = τi wji0 zi0 = τi wji0 = wji0 Gzi0 = wji0 αi0
i
∂x i
i 0 i i 0
∂x i
i 0 i 0

We notice that we can rewrite Gyk as


X ∂yk
Gyk = τi
i
∂xi
X ∂yk X ∂ak ∂zj
= τi
i
∂ak j ∂zj ∂xi
X X ∂zj
= h0 (ak ) τi wkj
i j
∂xi
X X ∂zj
= h0 (ak ) wkj τi
j i
∂xi
X
= h0 (ak ) wkj Gzj
j
0
= h (ak )βk
= αk
Since αj can be obtained from aj and βj (see (5.204)), this is proof that Ωn can be evaluated by
forward propagation using the equations (5.204) successively. We can see that the derivative of
the differential operator can be written as
 
∂Gf ∂ X ∂f X ∂ ∂f ∂f
= τi = τi =G
∂γ ∂γ i ∂xi i
∂xi ∂γ ∂γ

Therefore, the derivatives of the regularizers Ωn with respect to a weight wrs can be written as
 X  X   X  
∂Ωn ∂ 1 2 ∂Gyk X ∂yk ∂yk ∂ar
= Gyk = Gyk = αk G = αk G
∂wrs ∂wrs 2 k k
∂w rs
k
∂w rs
k
∂ar ∂wrs

99
 
X ∂yk
= αk G zs
k
∂ar

Applying the product rule on the obtained expression and substituting the variables defined in
(5.207) yields
∂Ωn X X
= αk {zs Gδkr + δkr Gzs } = αk {φkr zs + δkr αs } (5.206)
∂wrs k k

The backpropagation formula for δkr is obtained similarly to the one in Section 5.3.1. Suppose
that the units l come after the units r. Then,
X ∂yk ∂al X ∂yk ∂al ∂zr X
δkr = = = h0 (ar ) wlr δkl
l
∂al ∂ar l
∂al ∂zr ∂ar l

As a result, the backpropagation equations for φkr can be obtained by applying the differential
operator G on this result:

φkr = Gδkr
 X 
0
= G h (ar ) wlr δkl
l
X X
= G h0 (ar ) wlr δkl + h0 (ar ) wlr G(δkl )
l l

Exercise 5.27 ??
Consider the framework for training with transformed data in the special case in which the trans-
formation consists simply of the addition of random noise x → x + ξ where ξ has a Gaussian
distribution with zero mean and unit covariance. By following an argument analogous to that of
Section 5.5.5, show that the resulting regularizer reduces to the Tikhonov from (5.135).

Proof. Using the same arguments as in Section 5.5.5, we have that the regularizer is given by
Z
1 2
Ω= τ T ∇y(x) p(x) dx (5.134)
2
Under our specific transformation, we have that
∂s(x, ξ) ∂ 
τ = = x+ξ =I
∂ξ ξ=0 ∂ξ ξ=0

Hence, substituting back into (5.134) gives the Tikhonov form


Z
1
Ω= k∇y(x)k2 p(x) dx (5.135)
2

100
Exercise 5.29 ?
Verify the result (5.141). (See PRML errata for the removal of λ factors).

Proof. Taking the derivative of the total error function

E(w)
e = E(w) + Ω(w) (5.139)

with respect to wi yields

∂E
e ∂E ∂Ω
= +
∂wi ∂wi ∂wi
Our goal is to compute the second derivative term. Hence,
 X X 
∂Ω ∂ 2
= − ln πj N (wi |µj , σj )
∂wi ∂wi i j
X −1 X
2 ∂
= πk N (wi |µk , σk ) πj N (wi |µj , σj2 )
k j
∂w i
X −1 X
w i − µj
= πk N (wi |µk , σk2 ) πj 2
N (wi |µj , σj2 )
k j
σ j

X πj N (wi |µj , σj2 ) wi − µj


= 2
σj2
P
j k πk N (wi |µk , σk )
X w i − µj
= γj (wi )
j
σj2

where γj (w) is defined by (5.140). Substituting into the expression above, one obtains

∂E
e ∂E X wi − µj
= + γj (wi ) (5.141)
∂wi ∂wi j
σj2

Exercise 5.30 ?
Verify the result (5.143). (See PRML errata for the removal of λ factors).

Proof. Similarly to the previous exercise, we’re going to compute the derivative of Ω(w) with
respect to µj :
 X X 
∂Ω ∂ 2
= − ln πj N (wi |µj , σj )
∂µj ∂µj i j
X  X −1 
2 ∂ 2
= πk N (wi |µk , σk ) πj N (wi |µj , σj )
i k
∂µ j

101
X  X −1 X 
µj − w i
= πk N (wi |µk , σk2 ) πj 2
N (wi |µj , σj )
i k j
σj2
X πj N (wi |µj , σj2 ) µj − w i
= 2
σj2
P
i k πk N (wi |µk , σk )
X µj − w i
= γj (wi )
i
σj2

Since E(w) has no dependence on µj the derivative term reduces, so

∂E
e ∂E ∂Ω ∂Ω X µj − w i
= + = = γj (wi ) (5.143)
∂µj ∂µj ∂µj ∂µj i
σj2

Exercise 5.31 ?
Verify the result (5.143). (See PRML errata for the removal of λ factors).

Proof. Like in the previous exercise, since E(w) does not depend on the distribution of w,

∂E
e ∂E ∂Ω ∂Ω
= + =
∂σj ∂σj ∂σj ∂σj

The derivative term is now given by


 X X 
∂Ω ∂ 2
= − ln πj N (wi |µj , σj )
∂σj ∂σj i j
X  X −1 
2 ∂ 2
= πk N (wi |µk , σk ) πj N (wi |µj , σj )
i k
∂σj
−1 
X  X (wi − µj )2
 
2 1 2
= πk N (wi |µk , σk ) πj − N (wi |µj , σj )
i k
σj σj3
X πj N (wi |µj , σj2 )  1 (wi − µj )2

= 2

σj3
P
i k πk N (wi |µk , σk ) σj

(wi − µj )2
 
X 1
= γj (wi ) −
i
σ j σj3

Therefore,
(wi − µj )2
 
∂E
e X 1
= γj (wi ) − (5.143)
∂σj i
σj σj3

102
Exercise 5.32 ??
Show that the derivatives of the mixing coefficients {πk }, defined by (5.146), with respect to the
auxiliary parameters {ηj } are given by

∂πk
= δjk πj − πj πk (5.208)
∂ηj
P
Hence, by making use of the constraint k πk = 1, derive the result (5.147).

Proof. The mixing coefficients are given by the softmax function

exp{ηj }
πj = P (5.146)
k exp{ηk }

Therefore, as seen in Exercise 4.17, the derivative of πj is given by

∂πk
= δjk πj − πj πk (5.208)
∂ηj

Since E(w) does not depend on the distribution of w, the derivative of the total error is given by
 X −1 X 
∂E
e ∂Ω X ∂πk
= =− πl N (wi |µl , σl2 ) 2
N (wi |µk , σk )
∂ηj ∂ηj i l k
∂ηj
X  X −1 X 
2 2
=− πl N (wi |µl , σl ) (δkj πk − πk πj )N (wi |µk , σk )
i l k
X  πj N (wi |µj , σj2 ) P
πk N (wi |µk , σk2 )

k
=− P 2
− πj P 2
i l π l N (w i |µ l , σl ) l πl N (wi |µl , σl )
X
= {πj − γj (wi )} (5.147)
i

Exercise 5.33 ?
Write down a pair of equations that express the Cartesian coordinates (x1 , x2 ) for the robot arm
shown in Figure 5.18 in terms of the joint angles θ1 and θ2 and the lengths L1 and L2 of the links.
Assume the origin of the coordinate system is given by the attachment point of the lower arm.
These equations define the ‘forward kinematics‘ of the robot arm.

Proof. Let the origin O be the attachment point of the lower arm and A be the attachment point
of the upper arm. Also, let C be the end effector position and OABC be a parallelogram. One
can easily find the coordinates of the end effector by noticing that
−−→ −→ −→ −−→ −−→ −−→ −−→ → − →

OB = OA + OC = OD + DA + OE + EC = i (OE − OD) + j (DA + EC)

103
B

L2

θ2
A

C
L1
θ1

D O E x

Figure 5.1: Robot arm geometric interpretation

where D, E are the projections of A and C over the x-axis. Since the angles around a parallelogram
sum to 2π, it’s straightforward to show that

∠COA = π − θ2 ∠COE = θ1 + θ2 − π ∠DOA = π − θ1

Hence, (
DA = OA sin(∠DOA) = L1 sin(θ1 )
∆OAD :
OD = OA cos(∠DOA) = −L1 cos(θ1 )
(
EC = OC sin(∠COE) = −L2 sin(θ1 + θ2 )
∆OCE :
OE = OC cos(∠COE) = −L2 cos(θ1 + θ2 )
Finally, the equations that define the ’forward kinematics’ of the arm will be given by
(
x1 = OE − OD = L1 cos(θ1 ) − L2 cos(θ1 + θ2 )
y1 = DA + EC = L1 sin(θ1 ) − L2 sin(θ1 + θ2 )

Exercise 5.34 ?
Derive the result (5.155) for the derivative of the error function with respect to the network output
activations controlling the mixing coefficients in the mixture density network.

Proof. The error function for the network takes the form
N
X K
X 
tn |µk (xn , w), Iσk2 (xn , w)

E(w) = − ln πk (x, w)N (5.153)
n=1 k=1

104
Taking the derivative of the error function over an individual data point wrt. to the network
output activations controlling the mixing coefficients yields:
XK −1 XK
∂En ∂πl
π
= − π N
j nj N
π nl
∂ak j=1 l=1
∂a k
XK −1 XK

=− πj Nnj δlk πk − πl πk Nnl
j=1 l=1
K
X −1  K
X 
=− πj Nnj πk Nnk − πk πl Nnl
j=1 l=1

= πk − γnk (5.155)

where Nnk denotes N tn |µk (xn , w), Iσk2 (xn , w) and πk , γnk are defined by (5.150) and (5.154).

Exercise 5.35 ?
Derive the result (5.156) for the derivative of the error function with respect to the network output
activations controlling the component means in the mixture density network.

Proof. Similarly to the previous exercise, using (5.152) and then taking the derivative of the
individual error wrt. the component means gives
X K −1 X K
∂En ∂Nni
µ = − πj Nnj πi µ
∂akl j=1 i=1
∂akl
X K −1 X K  
1 ∂ 1 2
=− πj Nnj πi p µ exp − 2
ktn − µi k
(2π) L σ 2L ∂a 2σ
j=1 i=1 i kl i
X K −1  
tnl − µkl 1 1 2
=− πj Nnj πk p exp − 2 ktn − µk k
j=1
σk2 (2π)L σk2L 2σk
X K −1
tnl − µkl
=− πj Nnj 2
πk Nnk
j=1
σ k
 
tnl − µkl
= γnk (5.156)
σk2

Exercise 5.36 ?
Derive the result (5.157) for the derivative of the error function with respect to the network output
activations controlling the component variances in the mixture density network. (See errata for
correct version of (5.157)).

105
Proof. Analogously to previous exercises, using (5.151) and taking the derivative of the individual
error wrt. the component variances yields
X K −1 X K
∂En ∂Nni
= − π j Nnj π i
∂aσk j=1 i=1
∂aσk
X K −1
∂Nnk
=− πj Nnj πk
j=1
∂aσk
X K −1   −L  
1 ∂σk 1 2
=− πj Nnj πk p exp − 2 ktn − µk k +
j=1 (2π)L ∂aσk 2σk
 
L ∂ 1 2
σk σ exp − 2 ktn − µk k
∂ak 2σk
K −1
ktn − µk k2
X   
=− πj Nnj πk − LNnk + 2
Nnk
j=1
σk

ktn − µk k2
 
= γnk L − (5.157)
σk2

Exercise 5.37 ?
Verify the results (5.158) and (5.160) for the conditional mean and variance of the mixture density
network model.

Proof. The conditional mean result is easily verified by using (5.148) and identifying the expected
value factor:
Z
E[t|x] = tp(t|x) dt
Z K
X
πk (x)N t|µk (x), σk2 (x) dt

= t
k=1
K
X Z
tN t|µk (x), σk2 (x) dt

= πk (x)
k=1
XK
= πk (x)µk (x) (5.158)
k=1

The proof for the variance is a little bit more involved. Note that
 2 
E ktk2 |x = E µk (x) + t − µk (x) |x
  

= E kµk (x)k2 |x + 2E µk (x)|x E t − µk (x)|x + E[kt − µk (x)k2 |x


      

= E kµk (x)k2 |x + Tr E[kt − µk (x)k2 |x


   

106
T
= E kµk (x)k2 |x + E Tr t − µk (x) t − µk (x)
     
|x

Since the second term vanished and the third term is actually the variance of N µk (x), Iσk2 (x) ,
one has that
K   X K  
 2  X 2  2 2 2
E ktk |x = πk (x) kµk (x)k + Tr Iσk (x) = πk (x) kµk (x)k + M σk (x)
k=1 k=1

where M is the dimensionality of t. Now, the variance of the density function about the conditional
average is given by
s2 (x) = E kt − E[t|x]k2 |x
 

= E ktk2 |x − 2E[t|x]T E[E[t|x]|x + E[kE[t|xk2 |x]


  

XK  
2 2 T 2
= πk (x) M σk (x) + kµk (x)k − 2µk E[t|x] + kE[t|x]k
k=1
XK  K
X 2
2
= πk (x) M σk (x) + µk (x) − πl (x)µl (x) (5.160)
k=1 l=1

Exercise 5.38 ?
Using the general result (2.115), derive the predictive distribution (5.172) for the Laplace approx-
imation to the Bayesian neural network model.

Proof. We’ve seen in Section 5.7.1 that


Z
p(t|x, D) ' p(t|x, w, β)q(w|D) dw (5.168)

where
q(w|D) = N (w|wMAP , A−1 ) (5.167)
and
p(t|x, w, β) ' N (t|y(x, wMAP ) + gT (w − wMAP ), β −1 ) (5.171)
where g is defined by (5.170). Therefore, we can rewrite the marginal distribution p(t) as
Z
p(t|x, D) ' N (t|y(x, wMAP ) + gT (w − wMAP ), β −1 )N (w|wMAP , A−1 ) dw

Now, using (2.115) for the mean of the predictive distribution and setting the parameters to their
MAP value gives
µ(x) = y(x, wMAP ) − gT wMAP + gT w|w=wMAP = y(x, wMAP )
The input-dependent variance is obtained analogously using (2.115):
σ 2 (x) = β −1 + gT A−1 g (5.173)
Finally, the predictive distribution can be written as
p(t|x, D, α, β) ' N (t|y(x, wM AP ), β −1 + gT A−1 g) (5.172)

107
Exercise 5.39 ?
Make use of the Laplace approximation result (4.135) to show that the evidence function for the
hyperparameters α and β in the Bayesian neural network model can be approximated by (5.175).

Proof. Since the marginal likelihood is given by


Z
p(D|α, β) = p(D|w, β)p(w|α) dw (5.174)

by identifying f (w) = p(D|w, β)p(w|α) and Z = p(D|α, β) and then making use of the Laplace
approximation result (4.135) similarly to Section 4.4.1 gives

1 
ln p(D|α, β) ' ln p(D|wMAP , β) + ln p(wMAP |α) − ln − ∇∇ ln p(D|wMAP , β)p(wMAP |α)
2
W N N
+ ln α + ln β − ln(2π)
2 2 2
where W is the total number of parameters in w. Since

p(wMAP |D, α, β) ∝ p(D|wMAP , β)p(wMAP |α)

we have that
1 W N N
ln p(D|α, β) ' ln p(wMAP |D, α, β) − ln |A| + ln α + ln β − ln(2π)
2 2 2 2
where
A = −∇∇ ln p(w|D, α, β) (5.166)
From (5.165) and and (5.176), one can easily see that

ln p(wMAP |D, α, β) = −E(wMAP )

so
1 W N N
ln p(D|α, β) ' −E(wMAP ) − ln |A| + ln α + ln β − ln(2π) (5.175)
2 2 2 2

108
Chapter 6

Kernel Methods

Exercise 6.1 ??
Consider the dual formulation of the least squares linear regression problem given in Section
6.1. Show that the solution for the components an of the vector a can be expressed as a linear
combination of the elements of the vector φ(xn ). Denoting these coefficients by the vector w,
show that the dual of the dual formulation is given by the original representation in terms of the
parameter vector w.

Proof. By rewriting (6.4), one has that


1
an = − {wT φ(xn ) − tn }
λ
 M M 
1 X tn X
=− wi φi (xn ) − PM φi (xn )
λ i=1 i=1 φi (xn ) i=1
M  
X tn wi
= PM − φi (xn )
i=1 λ i=1 φ i (x n ) λ
M
X
= Ωni φi (xn )
i=1
= ΩTn φ(xn )
where
tn wi
Ωni = PM −
λ i=1 φi (xn ) λ
Therefore, an can be written as a linear combination of the elements of φ(xn ) and
a = diag(ΩΦ)

Exercise 6.3 ?
The nearest-neighbour classifier (Section 2.5.2) assigns a new input vector x to the same class as
that of the nearest input vector xn from the training set, where in the simple case, the distance

109
is defined by the Euclidean metric kx − xn k2 . By expressing this rule in terms of scalar product
and then making use of kernel substitution, formulate the nearest-neighbour classifier for a general
nonlinear kernel.

Proof. Since we’re dealing with inner products over R, the Euclidian metric can be rewritten as
kx − xn k2 = hx − xn , x − xn i = hx, xi − 2hx, xn i + hxn xn i
Similarly to what happens in Section 6.2, using kernel substitution above to replace hx, x0 i with a
nonlinear kernel κ(x, x0 ) yields the nearest-neighbour classifier for a general nonlinear kernel:
k(x, x0 ) = κ(x, x) − 2κ(x, xn ) + κ(xn , xn )

Exercise 6.4 ?
In Appendix C, we give an example of a matrix that has positive elements but that ahas a negative
eigenvalue and hence that is not positive definite. Find an example of the converse property, namely
a 2 × 2 matrix with positive eigenvalues that has at least one negative element.

Proof. Consider the matrix 


2 1
A=
−1 2
A contains one negative element and the eigenvalues of A are λ1 = 1 and λ2 = 3, which proves
that a matrix can be positive definite and have negative elements.

Exercise 6.5 ?
Verify the results (6.13) and (6.14) for constructing valid kernels.

Proof. Since k1 is a valid kernel, let α be a feature mapping such that


k1 (x, x0 ) = hα(x), α(x0 )i
Using the fact that an inner product on a real vector space is a positive-definite symmetric bilinear
form, we have that
√ √
ck1 (x, x0 ) = chα(x), α(x0 )i = h cα(x), cα(x0 )i = hβ(x), β(x0 )i

where c > 0 is a constant and β(x) = cα(x). Therefore, the new kernel
k(x, x0 ) = ck1 (x, x0 ) (6.13)
is valid. Analogously, since f (·) is a real-valued function,
f (x)k1 (x, x0 )f (x0 ) = f (x)hα(x), α(x0 )if (x0 ) = hf (x)α(x), f (x0 )α(x0 )i = hγ(x), γ(x0 )i
where γ(x) = f (x)α(x). As a result, the kernel
k(x, x0 ) = f (x)k1 (x, x0 )f (x0 ) (6.14)
will also be valid.

110
Exercise 6.6 ?
Verify the results (6.15) and (6.16) for constructing valid kernels.

Proof. Let q(·) be a polynomial with nonnegative coefficients. Since in the polynomial kernels are
summed and multiplied by nonnegative constants or other kernels, combining (6.13), (6.17) and
(6.18) proves that the kernel
k(x, x0 ) = q k1 (x, x0 )

(6.15)
is valid. Now, the exponential function is defined as

X xi
exp(x) :=
i=0
i!

, so

 X
0 k1 (x, x0 )i
exp k1 (x, x ) =
i=0
i!
Note that the exponential of a kernel is an infinite sequence of kernel sums and products (with
itself or nonnegative constants), so by using (6.13), (6.17), (6.18) again, one has that the new
kernel
k(x, x0 ) = exp k1 (x, x0 )

(6.16)
is valid.

Exercise 6.7 ?
Verify the results (6.17) and (6.18) for constructing valid kernels.

Proof. Let K1 and K2 be the Gram matrices corresponding to the kernels k1 and k2 . Therefore,
they are positive semidefinite matrices, so for any a ∈ Rn , one has that

aT Ha = aT H1 + H2 a = aT H1 a + aT H2 a > 0


Since H = H1 + H2 is positive semidefinite and corresponds to the Gram matrix of the kernel
k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ), one has that the kernel

k(x, x0 ) = k1 (x, x0 ) + k2 (x, x0 ) (6.17)

is valid. Now, let α, β be feature mappings such that

k1 (x, x0 ) = hα(x), α(x0 )i

k2 (x, x0 ) = hβ(x), β(x0 )i


As a result,

k1 (x, x0 )k2 (x, x0 ) = hα(x), α(x0 )ihβ(x), β(x0 )i


= α(x)T α(x0 )β T (x)β(x0 )

111
N
X M
 X 
0 0
= αi (x)αi (x ) βi (x)βi (x )
i=1 j=1
N X
X M
= αi (x)βj (x)αi (x0 )βj (x0 ) (*)
i=1 j=1
N X
X M
= Aij (x)Aij (x0 )
i=1 j=1

= hA(x), A(x0 )iF

where A is a matrix with


Aij (x) = αi (x)βj (x)
and h·, ·iF is the Frobenius inner product. Since the product kernel can be rewritten as a valid
inner product in the feature space defined by the feature mapping A(x), the new kernel

k(x, x0 ) = k1 (x, x0 )k2 (x, x0 ) (6.18)

is valid. Note that we can continue differently from (*), so


K
X
0 0
k1 (x, x )k2 (x, x ) = φk (x)φk (x0 ) = hφ(x), φ(x0 )i
k=1

where K = N M and
φk (x) = α((k−1)N )+1 (x)β((k−1) N )+1 (x)

where  and denote integer division and remainder, respectively.

Exercise 6.8 ?
Verify the results (6.19) and (6.20) for constructing valid kernels.

Proof. Let ψ be a feature mapping such that

k3 (x, x0 ) = hψ(x), ψ(x0 )i

Then,

k3 φ(x), φ(x0 ) = hψ φ(x) , ψ φ(x0 ) i


  

= h ψ ◦ φ (x), ψ ◦ φ (x0 )i
 

= hγ(x), γ(x0 )i

where φ is a function from x to RM and γ = ψ ◦ φ. Therefore, the kernel

k(x, x0 ) = k3 φ(x), φ(x0 )



(6.19)

is valid. For the second part, since A is a symmetric, positive semidefinite matrix, one can use the
Cholesky decomposition to obtain a matrix L such that

A = LLT

112
As a result, one can show that
T
xT Ax = xT LLT x = LT x LT x = hζ(x), ζ(x0 )i


where ζ(x) = LT x. Hence, the kernel


k(x, x0 ) = xT Ax (6.20)
is valid.

Exercise 6.9 ?
Verify the results (6.21) and (6.22) for constructing valid kernels.

Proof. Let φa and φb be feature mappings so that


ka (x, x0 ) = hφa (x), φa (x0 )i
kb (x, x0 ) = hφb (x), φb (x0 )i
Therefore, since the inner product becomes a bilinear form on R,
ka (xa , x0 a ) + kb (xb , x0b ) = hφa (xa ), φa (x0a )i + hφb (xb ), φb (x0b )i
= h φa (xa ), φa (x0a ) , φb (xb ), φb (x0b ) i
 

= hφ(x), φ(x0 )i
where  
φa (xa )
φ(x) =
φb (xb )
Hence, the kernel
k(x, x0 ) = ka (xa , x0a ) + kb (xb , x0b ) (6.21)
is valid. The product identity is obtained similarly to what we do in Exercise 6.7. One has that
ka (xa , x0a )kb (xb , x0b ) = hφa (xa ), φa (x0a )ihφb (xb ), φb (x0b )i
XNa  XNb 
0 0
= φai (xa )φai (xa ) φbj (xb )φbj (xb )
i=1 j=1
Nb
Na X
X
= φai (xa )φbj (xb )φai (x0a )φbj (x0b )
i=1 j=1
Nb
Na X
X
= Aij (x)Aij (x0 )
i=1 j=1

= hA(x), A(x0 )iF


where h·, ·iF is the Frobenius inner product, φai (x) is the i-th element of φa (x) and
Aij (x) = φai (xa )φbj (xb )
Therefore, the new kernel
k(x, x0 ) = ka (xa , x0a )kb (xb , x0b ) (6.22)
will also be valid.

113
Exercise 6.10 ?
Show that an excellent choice of kernel for learning a function f (x) is given by k(x, x0 ) = f (x)f (x0 )
by showing that a linear learning machine-based on this kernel will always find a solution propor-
tional to f (x).

Proof. By substituting the kernel and (6.8) into (6.9), one has that
N
X N
X 
T −1 T
y(x) = k(x) (K + λIN ) t = k(x) a = k(x, xn )an = f (x) f (xn )an
n=1 n=1

which shows that the prediction function will always be proportional to f (x).

Exercise 6.11 ?
By making use of the expansion (6.25), and then expanding the middle factor as a power se-
ries, show that the Gaussian kernel (6.23) can be expressed as the inner product of an infinite-
dimensional feature vector.

Proof. We’ve seen in Section 6.2 that the Gaussian kernel can be expanded as
( ) ( )
2  0
 0 2
kxk hx, x i kx k
k(x, x0 ) = exp − 2 exp exp − (6.25)
2σ σ2 2σ 2

In Exercise 6.7 we proved that if α, β are feature maps, there exists a feature map ψ such that

hα(x), α(x0 )ihβ(x), β(x0 )i = hψ(x), ψ(x0 )i

Therefore, one can prove using induction that there exists a feature map ζ such that for n ∈ N,

hα(x), α(x0 )in = hζ(x), ζ(x0 )i

Now, using the definition of the exponential function for the middle term gives
∞ ∞ ∞  r r
hx, x0 i
  X 
1 0 i
X 1 0
X 1 1 1 1 0
exp = hx, x i = hΨi (x), Ψi (x )i = Ψi (x), Ψi (x )
σ2 i=0
i!σ 2i i=0
i!σ 2i i=0
σ i! σ i!

where Ψi are feature maps such that

hx, x0 ii = hΨi (x), Ψi (x0 )i

Substituting this result back into (6.25) yields


( ) ( ) ∞  r
2 0 2
r 
0 kxk kx k X 1 1 1 1 0
k(x, x ) = exp − 2 exp − Ψi (x), Ψi (x )
2σ 2σ 2 i=0
σ i! σ i!

( ) ( ) r
2 0 2
r 
X kxk kx k 1 1 1 1 0
= exp − 2 exp − 2
Ψi (x), Ψi (x )
i=0
2σ 2σ σ i! σ i!

114
∞ 
( ) r ( )
kxk2 0 2
r
X 1 1 1 1 kx k
= Ψi (x) exp − 2 , Ψi (x0 ) exp −
i=0
σ i! 2σ σ i! 2σ 2
X∞
= φi (x)φi (x0 )
i=0
= hφ(x), φ(x0 )i

where φ(x) is a feature vector of infinite dimensionality with


( )
kxk2
 r
1 1
φi (x) = Ψi (x) exp − 2
σ i! 2σ

Exercise 6.12 ??
Consider the space fo all possible subsets A of a given fixed set D. Show that the kernel function
(6.27) corresponds to an inner product in a feature space of dimensionality 2|D| defined by the
mapping φ(A) where A is a subset of D and the element φU (A), indexed by the subset U , is given
by (
1, if U ⊆ A
φU (A) = (6.95)
0, otherwise
Here U ⊆ A denotes that U is either a subset of A or is equal to A.

Proof. Using simple combinatorics, one can easily show that the number of subsets of a given fixed
set D is given by 2|D| . Therefore, φ(A) will be of dimensionality 2|D| . Since the element φU (A) is
1 if U ⊆ A and 0 otherwise, the result of the inner product hφ(A1 ), φ(A2 )i will give the number of
subsets of D contained by both A1 and A2 . However, since A1 , A2 ⊆ D this can also be expressed
by counting the number of subsets of A1 ∩ A2 . This is done by the kernel

k(A1 , A2 ) = 2|A1 ∩A2 | (6.27)

Hence, the kernel can be written as an inner product in the space defined by the mapping φ(A)
since
k(A1 , A2 ) = 2|A1 ∩A2 | = hφ(A1 ), φ(A2 )i

Exercise 6.13 ? TODO


Show that the Fisher kernel, defined by (6.33), remains invariant if we make a nonlinear transfor-
mation of the parameter vector θ → ψ(θ), where the function ψ(·) is invertible and differentiable.

115
Proof. The Fisher kernel is defined by

k(x, x0 ) = g(θ, x)T F−1 g(θ, x0 ) (6.33)

where
g(θ, x) = ∇θ ln p(x|θ) (6.32)
is the Fisher score and F is the Fisher information matrix, given by

F = Ex g(θ, x)g(θ, x)T


 
(6.34)

Exercise 6.14 ?
Write down the form of the Fisher kernel, defined by (6.33), for the case of a distribution p(x|µ) =
N (x|µ, S) that is Gaussian with mean µ and fixed covariance S.

Proof. We start by evaluating the Fisher score using (6.32):

g(µ, x) = ∇µ ln p(x|µ)
= ∇µ ln N (x|µ, S)
  
1 1 T −1
= ∇µ ln exp − (x − µ) S (x − µ)
(2π)k/2 |S|1/2 2
 
1 −1
= ∇µ − (x − µ)S (x − µ)
2
−1
= S (x − µ)

Now, the information matrix will be given by (6.34):

F = Ex g(µ, x)g(µ, x)T = Ex S−1 (x − µ)(x − µ)T S−1 = S−1 Ex (x − µ)(x − µ)T S−1
     

Since the expectation corresponds to the covariance matrix, we have that

F = S−1

Finally, the Fisher kernel can be obtained by substituting the obtained values into (6.33):

k(x, x0 ) = g(µ, x)T F−1 g(µ, x0 ) = (x − µ)T S−1 SS−1 (x − µ) = (x − µ)T S−1 (x − µ)

which turns out to be the Mahalanobis distance.

Exercise 6.15 ?
By considering the determinant of a 2×2 Gram matrix, show that a positive definite kernel function
k(x, x0 ) satisfies the Cauchy-Schwartz inequality

k(x1 , x2 )2 ≤ k(x1 , x1 )k(x2 , x2 ) (6.96)

116
Proof. Consider the 2 × 2 Gram matrix corresponding to the kernel k:
 
k(x1 , x1 ) k(x1 , x2 )
K=
k(x2 , x1 ) k(x2 , x2 )
Since the kernel function is symmetric, the determinant of K is given by
|K| = k(x1 , x1 )k(x2 , x2 ) − k(x1 , x2 )2
Now, since the Gram matrix K is positive definite, its eigenvalues are positive. Since the determi-
nant of a matrix is given by the product of its eigenvalues, then the determinant of K must then
be positive. Hence,
k(x1 , x2 )2 ≤ k(x1 , x1 )k(x2 , x2 ) (6.96)

Exercise 6.16 ??
Consider a parametric model governed by the parameter vector w together with a data set of input
values x1 . . . , xN and a nonlinear feature mapping φ(x). Suppose that the dependence of the error
function on w takes the form
J(w) = f wT φ(x1 ), . . . , wT φ(xN ) + g(wT w)

(6.97)
where g(·) is a monotonically increasing function. By writing w in the form
N
X
w= αn φ(xn ) + w⊥ (6.98)
n=1

show that the value of w that minimizes J(w) takes the form of a linear combination of the basis
functions φ(xn ) for n = 1, . . . , N .

Proof. Taking the derivative of (6.97) with respect to w yields:


d ∂ ∂
f wT φ(x1 ), . . . , wT φ(xN ) + g(wT w)

J(w) =
dw ∂w ∂w
N T
X ∂f ∂w φ(xn ) ∂g ∂wT w
= +
i=1
∂wT φ(xn ) ∂w ∂wT w ∂w
N
X ∂f ∂g
= φ(xn )T + 2wT
i=1
∂wT φ(xn ) ∂wT w

Since the error function is convex, it is minimized when the derivative is 0, so when
 −1 X N N
1 ∂g ∂f X
w=− φ(xn ) = αn φ(xn )
2 ∂wT w i=1
∂wT φ(xn ) i=1

with  −1
1 ∂g ∂f
αn = − T T
2 ∂w w ∂w φ(xn )
which is equivalent to (6.98) for w⊥ = 0.

117
Exercise 6.18 ?
Consider a Nadaraya-Watson model with one input variable x and one target variable t having
Gaussian components with isotropic covariances, so that the covariance matrix is given by σ 2 I
where I is the unit matrix. Write down experssions for the conditional density p(t|x) and for the
conditional mean E[t|x] and variance var[t|x], in terms of the kernel function k(x, xn ).

Proof. To simplify the notation, let  


x
z=
t
Since the model has Gaussian components with isotropic covariances,

f (x, t) = f (z) = N (z|0, σ 2 I)

and therefore,

f (x − xn , t − tn ) = f (z − zn ) = N (z − zn |0, σ 2 I) = N (z|zn , σ 2 I)

To aid computation, we note that

f (z − zn ) = N (z|zn , σ 2 I)
 
1 1 2
= exp − 2 ||z − zn ||
2πσ 2 2σ
(x − xn )2 + (t − tn )2
 
1
= exp −
2πσ 2 2σ 2
(x − xn )2 (t − tn )2
   
1 1
=√ exp − √ exp −
2πσ 2 2σ 2 2πσ 2 2σ 2
= N (x|xn , σ 2 )N (t|tn , σ 2 )

and
Z Z Z
2 2 2
f (z − zn ) dt = N (x|xn , σ )N (t|tn , σ ) dt = N (x|xn , σ ) N (t|tn , σ 2 ) dt = N (x|xn , σ 2 )

By the Nadaraya-Watson model, from (6.42) we have that


N N N
1 X 1 X 1 X
p(x, t) = f (x − xn , t − tn ) = f (z − zn ) = N (x|xn , σ 2 )N (t|tn , σ 2 )
N n=1 N n=1 N n=1

The conditional probability is now given by


N
X
f (z − zn ) N
p(t, x) n=1
X N (x|xn , σ 2 )N (t|tn , σ 2 )
p(t|x) = Z = N Z
= N
p(t, x) dt n=1
X X
f (z − zm ) dt N (x|xm , σ 2 )
m=1 m=1
N
X
= k(x, xn )N (t|tn , σ 2 )
n=1

118
where
N (x|xn , σ 2 )
k(x, xn ) = N
X
N (x|xm , σ 2 )
m=1
is the kernel function corresponding to the model. The conditional expectation is obtained as
Z XN Z N
X
2
E[t|x] = tp(t|x) dt = tk(x, xn )N (t|tn , σ ) dt = k(x, xn )tn
n=1 n=1

The variance is then given by


Z
var[t|x] = (t − E[t|x])2 p(t|x) dt
Z Z Z
2 2
= t p(t|x) dt − 2E[t|x] tp(t|x) dt + E[t|x] p(t|x) dt
N
X Z
= k(x, xn ) t2 N (t|tn , σ 2 ) dt − E[t|x]2
n=1
XN N
X 2
t2n 2 2

= k(x, xn ) +σ − k(x, xn )N (t|tn , σ )
n=1 n=1

Exercise 6.20 ??
Verify the results (6.66) and (6.67).

Proof. By considering the same setup as the one described in Section 6.4.2, e.g. splitting the
covariance matrix CN +1 into (6.65), one has that
p(tN ) = N (tN |0, CN )
p(tN +1 ) = N (tN +1 |0, c)
cov[tN , tN +1 ] = k
where k is a N -dimensional vector with
ki = k(xi , xN +1 )
and
c = k(xN +1 , xN +1 ) + β −1
Simply matching our variables with the general formulas (2.81), (2.82) for the mean and covariance
of the conditional distribution p(tN +1 |tN ) gives the desired results
m(xN +1 ) = E tN +1 |tN , xN +1 = kT C−1
 
N tN (6.66)

σ 2 (xN +1 ) = var tN +1 |tN , xN +1 = c − kT C−1


 
N k (6.67)

119
Exercise 6.21 ??
Consider a Gaussian process regression model in which the kernel function is defined in terms of
a fixed set of nonlinear basis functions. Show that the predictive distribution is identical to the
result (3.58) obtained in Section 3.3.2 for the prediction distributions, and so it is only necessary
to show that the conditional mean and variance are the same. For the mean, make use of the
matrix identity (C.6), and for the variance, make use of the matrix identity (C.7).

Proof. Let the kernel function of our Gaussian process regression model be defined by
1
k(xn , xm ) = φ(xn )T φ(xm )
α
where φ are nonlinear basis functions and α represents the precision of p(w). As stated above,
our goal is to show that the mean m(xN +1 ) and covariance σ 2 (xN +1 ) are equivalent to the ones
obtained in Section 3.3.2 for the predictive distribution given by (3.58). Notice that both k and
CN can be rewritten as forms depending on the basis functions. Since
1
ki = k(xi , xN +1 ) = φ(xi )T φ(xN +1 )
α
then k = α−1 Φφ(xN +1 ), where Φ is the design matrix. Now, the elements of CN can be written
as
1 1
CN nm = CN (xN , xM ) = k(xN , xM ) + β −1 Inm = Knm + β −1 Inm = ΦΦT nm + Inm

α β
where K is the Gram matrix given by (6.54), so
1 1
CN = ΦΦT + I
α β
The mean of the predictive distribution in the Gaussian process model is given by (6.66) and can
be rewritten as
m(xN +1 ) = kT C−1
N t
 −1
1 T T 1 T 1
= φ (xN +1 ) Φ ΦΦ + I t
α α β
 −1
T T T α
= φ(xN +1 ) Φ ΦΦ + I t
β
−1
= βφ(xN +1 )T ΦT βΦΦT + αI t
−1
= βφ(xN +1 )T ΦT βΦT Φ + αI t
= βφ(xN +1 )T ΦT SN t
where SN is given by (3.54) and we’ve used (C.6) to obtain that
−1 −1
ΦT βΦΦT + αI = ΦT βΦT Φ + αI
Starting from (6.67), the variance can be obtained as
σ 2 (xN +1 ) = c − kT C−1
N k

120
1 β
= φ(xN +1 )T ΦT SN Φφ(xN +1 )
+ k(xN +1 , xN +1 ) −
β α
1
1 β
= + φ(xN +1 )T φ(xN +1 ) − φ(xN +1 )T ΦT SN Φφ(xN +1 )
β
α α
 
1 T 1 β T
= + φ(xN +1 ) I − Φ SN Φ φ(xN +1 )
β α α
 
1 T 1 β T T
−1
= + φ(xN +1 ) I − Φ αI + βΦ Φ Φ φ(xN +1 )
β α α
 
1 T 1 β T T
−1
= + φ(xN +1 ) I − Φ αI + βΦ Φ Φ φ(xN +1 )
β α α
  −1 
1 T 1 1 T 1 1 T
= + φ(xN +1 ) I − 2Φ I+ Φ Φ Φ φ(xN +1 )
β α α β α

Using the Woodbury identity (C.7) combined with (3.54), one could show that
−1
SN = αI + βΦT Φ
 −1
1 1 T 1 1 T
= I − 2Φ I+ Φ Φ Φ
α α β α

Hence, the variance is the same as in the result (3.58):


1
σ 2 (xN +1 ) = + φ(xN +1 )T SN φ(xN +1 )
β
Therefore, since both our predictive distributions are Gaussian and have the same mean and
variance, they are identical.

Exercise 6.24 ?
Show that a diagonal matrix W whose elements satisfy 0 < Wii < 1 is positive definite. Show that
the sum of two positive definite matrices is itself positive definite.

Proof. Since W is a N × N diagonal matrix with 0 < Wii < 1,


N
X
T
a Wa = a2i Wii > 0, ∀a ∈ RN
i=1

Therefore, W is a positive definite matrix. Let C1 , C2 be N × N positive definite matrices and


A = C1 + C2 . Then,

aT Aa = aT C1 + C2 a = aT C1 a + aT C2 a > 0, ∀a ∈ RN


As a result, we showed that the sum of two positive definite matrices is also a positive definite
matrix.

121

You might also like