introduction to neural networks
introduction to neural networks
Solutions
Solution (1) We assume that there is an input space X , and output space Y, and a
probability distribution on X ⇥ Y. Given a map h : X ! Y and a loss function L
defined on Y ⇥ Y, the generalization risk of h is the expected value
over a class of functions H. When the class H is very large (for example, if it consists
of all possible functions from X to Y), then we can find a function ĥ for which the
empirical risk, R̂(ĥ), is very small or even zero (if we simply “memorise” the observed
data). Such a function ĥ can have a large generalization risk: it has been adapted too
closely to the observed data, at the expense of not generalizing well to unobserved data.
This is called overfitting.
Solution (2) The Markov inequality states that, given t > 0 and a non-negative
random variable X, we have
E[X]
P(X t) .
t
Given any t 2 R and 0, e t
> 0, and hence
X E[e X
]
P(X t) = P(e e t) t
,
e
where the last inequality follows by applying Markov’s inequality.
(b) Assume that we have n random samples (X1 , Y1 ), . . . , (Xn , Yn ) that are inde-
pendent and identically distributed. For each fixed h 2 H, we have the bound
7
MA3K1 Mathematics of Machine Learning April 10, 2021
then ĥ is also a random variable: it depends on the random data (Xi , Yi ). Hence, in the
expression |R̂(ĥ) R(ĥ)|, the given ĥ is not fixed and will vary with the data (Xi , Yi ).
The expected value may larger.
⇢1 (x)P(Y = 1) p · ⇢1 (x)
P(Y = 1 | X = x) = = .
⇢(x) (1 p)⇢0 (x) + p⇢1 (x)
We can rearrange:
p · ⇢1 (x) 1 ⇢1 (x) 1 p
> , > .
(1 p)⇢0 (x) + p⇢1 (x) 2 ⇢0 (x) p
Solution (6) (a) A function h 2 H that implements XOR would correspond to a line
in R2 that separates the points {(0, 0), (1, 1)} from the points {(0, 1), (1, 0)}. Any line
h(x) = wT x+b that has (0, 0) on one side, say, h(0, 0) < 0, has to have b < 0. Assume
that h separates (0, 0) from (0, 1) and (1, 0). Then h(1, 0) > 0 implies w1 + b > 0, and
h(0, 1) > 0 implies that w2 + b > 0. In particular, if we sum these two expression, we
get w1 + w2 + 2b > 0. But since b < 0, this implies that h(1, 1) = w1 + w2 + b > 0.
This shows that any line line separating (0, 0) from (0, 1) and (1, 0) also separates (0, 0)
from (1, 1),and hence XOR is not possible via linear classifiers.
(b) The problem amounts to finding the ways in which the corners of a square can
be separated by lines. There are 16 = 24 possible binary functions.
8
MA3K1 Mathematics of Machine Learning April 10, 2021
x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The task is to figure out which of these can be implemented by a linear classifier.
Clearly, for every such function f that can be implemented by linear separation, the
inverted 1 f can also be implemented (if f (x) = 1{h(x) > 0}, then 1 f (x) =
1{ h(x) > 0}). Clearly, cases 1 and 16 can be implemented (take any line that contains
all the points on one side). We also get all the cases in the above table with exactly one
1 or exactly one 0 by a line that has one point on one side and all the others on the other
side (like x2 x1 + 1/2, for example). This makes another 8 cases. By considering
conditions of the form
we get four additional cases, taking the total number of functions that can be implemen-
ted using H to 14. The remaining two cases are XOR and 1 XOR, which cannot be
implement using H as seen in Part (a).
We can take the set x = (e1 , . . . , ed ), where ei is the i-th unit vector. Then
|{h(x) : h 2 H}| = 2d ,
since for any 0 1 vector of length d, taking the function hI where I consists of
those indices i for which i = 1 gives the sign pattern hI (x) = . This shows that the
VC dimension is at least d.
Solution (8) Set K = |H|. For z = (z1 , . . . , zn ), with zi = (xi , yi ), we can define
the empirical Rademacher complexity as
" n
#
1X
R̂z (L H) = E sup i g(zi ) ,
g2L H n i=1