0% found this document useful (0 votes)
2 views

introduction to neural networks

The document discusses the mathematics of machine learning, focusing on concepts such as generalization risk, empirical risk, and overfitting. It also covers the Markov inequality, PAC-learnability, and the limitations of linear classifiers in separating certain functions like XOR. Additionally, it introduces the growth function and VC dimension, along with empirical Rademacher complexity in the context of classifiers.

Uploaded by

spammysharky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

introduction to neural networks

The document discusses the mathematics of machine learning, focusing on concepts such as generalization risk, empirical risk, and overfitting. It also covers the Markov inequality, PAC-learnability, and the limitations of linear classifiers in separating certain functions like XOR. Additionally, it introduces the growth function and VC dimension, along with empirical Rademacher complexity in the context of classifiers.

Uploaded by

spammysharky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

MA3K1 Mathematics of Machine Learning April 10, 2021

Solutions
Solution (1) We assume that there is an input space X , and output space Y, and a
probability distribution on X ⇥ Y. Given a map h : X ! Y and a loss function L
defined on Y ⇥ Y, the generalization risk of h is the expected value

R(h) = E[L(h(X), Y )],

where (X, Y ) is distributed on X ⇥ Y according to the given distribution. The general-


ization risk gives an indication on how h “performs”, on average, on unseen data.
If we observe n samples (x1 , y1 ), . . . , (xn , yn ) drawn from this distribution, we can
construct a classifier ĥ by minimizing the empirical risk
n
1X
R̂(h) = L(h(xi ), yi )
n i=1

over a class of functions H. When the class H is very large (for example, if it consists
of all possible functions from X to Y), then we can find a function ĥ for which the
empirical risk, R̂(ĥ), is very small or even zero (if we simply “memorise” the observed
data). Such a function ĥ can have a large generalization risk: it has been adapted too
closely to the observed data, at the expense of not generalizing well to unobserved data.
This is called overfitting.

Solution (2) The Markov inequality states that, given t > 0 and a non-negative
random variable X, we have
E[X]
P(X t)  .
t
Given any t 2 R and 0, e t
> 0, and hence

X E[e X
]
P(X t) = P(e e t)  t
,
e
where the last inequality follows by applying Markov’s inequality.

Solution (3) (a) The generalization risk is the expected value

R(h) = E[1{h(X) 6= Y }] = P(h(X) 6= Y ).

The empirical risk is


n
1X
R̂(h) = 1{h(Xi ) 6= Yi }.
n i=1

(b) Assume that we have n random samples (X1 , Y1 ), . . . , (Xn , Yn ) that are inde-
pendent and identically distributed. For each fixed h 2 H, we have the bound

|R̂(h) R(h)|  C(n, ),

7
MA3K1 Mathematics of Machine Learning April 10, 2021

with probability at least 1 , by assumption. Now if ĥ is the minimizer of R̂(h), that


is,
n
1X
ĥ = arg min 1{h(Xi ) 6= Yi },
h2H n i=1

then ĥ is also a random variable: it depends on the random data (Xi , Yi ). Hence, in the
expression |R̂(ĥ) R(ĥ)|, the given ĥ is not fixed and will vary with the data (Xi , Yi ).
The expected value may larger.

Solution (4) (a) The regression function for binary classification is

f (x) = E[Y | X = x] = P(Y = 1 | X = x)

The Bayes classifier can be defined as


(
1 if f (x) > 1/2
h⇤ (x) =
0 else

(b) To compute the Bayes classifier, we use Bayes’ rule:

⇢1 (x)P(Y = 1) p · ⇢1 (x)
P(Y = 1 | X = x) = = .
⇢(x) (1 p)⇢0 (x) + p⇢1 (x)
We can rearrange:

p · ⇢1 (x) 1 ⇢1 (x) 1 p
> , > .
(1 p)⇢0 (x) + p⇢1 (x) 2 ⇢0 (x) p

Solution (5) A hypothesis class H is called PAC-learnable if there exists a classifier


ĥ 2 H depending on n random samples (Xi , Yi ), i 2 {1, . . . , n}, and a polynomial
function p(x, y, z, w), such that for any ✏ > 0 and 2 (0, 1), for all distributions on
X ⇥ Y, ✓ ◆
P R(ĥ)  inf R(h) + ✏ 1
h2H

holds whenever n p(1/✏, 1/ , size(X ), size(H)).

Solution (6) (a) A function h 2 H that implements XOR would correspond to a line
in R2 that separates the points {(0, 0), (1, 1)} from the points {(0, 1), (1, 0)}. Any line
h(x) = wT x+b that has (0, 0) on one side, say, h(0, 0) < 0, has to have b < 0. Assume
that h separates (0, 0) from (0, 1) and (1, 0). Then h(1, 0) > 0 implies w1 + b > 0, and
h(0, 1) > 0 implies that w2 + b > 0. In particular, if we sum these two expression, we
get w1 + w2 + 2b > 0. But since b < 0, this implies that h(1, 1) = w1 + w2 + b > 0.
This shows that any line line separating (0, 0) from (0, 1) and (1, 0) also separates (0, 0)
from (1, 1),and hence XOR is not possible via linear classifiers.
(b) The problem amounts to finding the ways in which the corners of a square can
be separated by lines. There are 16 = 24 possible binary functions.

8
MA3K1 Mathematics of Machine Learning April 10, 2021

x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The task is to figure out which of these can be implemented by a linear classifier.
Clearly, for every such function f that can be implemented by linear separation, the
inverted 1 f can also be implemented (if f (x) = 1{h(x) > 0}, then 1 f (x) =
1{ h(x) > 0}). Clearly, cases 1 and 16 can be implemented (take any line that contains
all the points on one side). We also get all the cases in the above table with exactly one
1 or exactly one 0 by a line that has one point on one side and all the others on the other
side (like x2 x1 + 1/2, for example). This makes another 8 cases. By considering
conditions of the form

xi 1/2 > 0, xi 1/2 < 0, i 2 {1, 2}

we get four additional cases, taking the total number of functions that can be implemen-
ted using H to 14. The remaining two cases are XOR and 1 XOR, which cannot be
implement using H as seen in Part (a).

Solution (7) Let H be a set of classifiers h : X ! Y. The growth function ⇧H is


defined as
⇧H (n) = maxn |{h(x) : h 2 H}| .
x2X

where h(x) = (h(x1 ), . . . , h(xn )) if x = (x1 , . . . , xn ). The VC dimension of H is

VC(H) = max{n 2 N : ⇧H (n) = 2n }.

We can take the set x = (e1 , . . . , ed ), where ei is the i-th unit vector. Then

|{h(x) : h 2 H}| = 2d ,

since for any 0 1 vector of length d, taking the function hI where I consists of
those indices i for which i = 1 gives the sign pattern hI (x) = . This shows that the
VC dimension is at least d.

Solution (8) Set K = |H|. For z = (z1 , . . . , zn ), with zi = (xi , yi ), we can define
the empirical Rademacher complexity as
" n
#
1X
R̂z (L H) = E sup i g(zi ) ,
g2L H n i=1

where g(zi ) = L(h(xi ), yi ), and the Rademacher complexity as


h i
Rn (L H) = EZ R̂Z (L H .

You might also like