0% found this document useful (0 votes)
27 views

Classification: K N X X X y I y

Uploaded by

usasua1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Classification: K N X X X y I y

Uploaded by

usasua1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

Classification

3.1 k nearest neighbor


The table below is the training data set with n = 6 observations of a 3-dimensional quantitative input x = [x1 x2 x3 ]T
and 1 qualitative output y (the color green or red).
i x y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 −1 0 1 Green
6 1 1 1 Red

(a) Compute the Euclidean distance between each observation in the training data, and the test point x? = [0 0 0]T .
(b) What is our prediction for the test point x? , if we use k-NN with k = 1?
(c) What is our prediction for the test point x? , if we use k-NN with k = 3?

3.2 Logistic regression


Suppose we collect data from a group of students in a Machine learning class with variables x1 = hours studied, x2 =
grade point average, and y = a binary output if that student received grade 5 (y = 1) or not (y = 0). We learn a
logistic regression model

eβ0 +β1 x1 +β2 x2


p(y = 1 | x) = (3.1)
1 + eβ0 +β1 x1 +β2 x2

with parameters βb0 = −6, βb1 = 0.05, βb2 = 1.


(a) Estimate the probability according to the logistic regression model that a student who studies for 40 h and has the
grade point average of 3.5 gets a 5 in the Machine learning class.
(b) According to the logistic regression model, how many hours would the student in part (a) need to study to have
50% chance of getting a 5 in the class?

January 29, 2019


Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

3.3 Difference between LDA and QDA


We now examine the differences between LDA and QDA. The Bayes decision boundary is the decision boundary of the
Bayes classifier, which is the ‘optimal’ classifier (Section 3.4 in the lecture notes).

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? What
do we expect on the test set?
(b) If the Bayes decision boundary is nonlinear, do we expect LDA or QDA to perform better on the training set?
What do we expect on the test set?
(c) In general, as the sample size n increases, do we expect the test error rate of QDA relative to LDA to increase,
decrease or be unchanged? Why?
(d) True or false: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a smaller
test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary.
Justify your answer.

3.4 Bayes’ classifier


Suppose you work at a clinic in a field mission, and want to predict whether a patient has a particular (potentially
deadly) disease or not. You do have a limited supply of an effective drug, which however has severe side effects. Due
to unfortunate circumstances, the only diagnosis tool you have access to is a clinical thermometer, with which you can
measure the body temperature of the patient. From previous studies made on the disease, you know the following:

• the distribution of body temperatures in infected patients is approximately Gaussian distributed with mean 38.5 ◦ C
and standard deviation 1 ◦ C.
• the distribution of body temperatures in patients not infected by the disease (either
√ healthy or infected by other
diseases) is approximately Gaussian with mean 37.5 ◦ C and standard deviation 0.5 ◦ C.
• the prevalence of the disease is 5% (i.e., 5% of the population is infected).
The body temperatures of three patients are as follows: patient A 38.5 ◦ C, patient B 39.2 ◦ C, and patient C 40.1 ◦ C.
(a) What is the probability that patient A, B and C are infected by the disease, respectively?
Hint: Use Bayes’ theorem
p(x | y)p(y)
p(y | x) = PK
k=1 p(x | k)p(k)

where p(y) is the prior probability of class y, and p(x | y) the probability density of x for an observation from class
y.
(b) Which prediction should you make for each patient, in order to make on average as few misclassifications as
possible? (Hint: Bayes’ classifier)
(c) Argue why another performance metric other than the standard accuracy (’on average as few misclassifications as
possible’) should be considered for this problem. How would that affect your decisions in (b)?
(d) For most applications, Bayes’ classifier cannot be used. Why is it possible to use Bayes’ classifier for this problem?

3.5 Error rates


Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different
classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30%
on the test data. Next we use 1-nearest neighbors (i.e. k-NN with k = 1) and get an average error rate (averaged over
both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification
of new observations? Why?

January 29, 2019


Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

3.6 Quadratic Discriminant Analysis


Consider a classification problem with the input x ∈ Rp and output y ∈ {1, . . . , K}. Consider also Bayes’ classifier

p(x | y)p(y)
yb = arg max p(y = k | x), where p(y | x) = PK . (3.2)
k={1,...,K} k=1 p(x | k)p(k)

In Quadratic Discriminant Analysis (QDA) we assume that p(x | y) is a multivariate Gaussian density with mean µk
and covariance Σk
1 1 T
Σ−1
p(x | y = k) = N (x | µk , Σk ) = e− 2 (x−µk ) k
(x−µk )
(3.3a)
(2π)p/2 |Σ k |1/2
p(y = k) = πk (3.3b)

(each with a different µk , Σk and πk ).

(a) Show that if one makes the assumption (3.3), then Bayes’ classifier becomes
1 1 T −1 1
yb = arg max δk (x), where δk (x) = − xT Σ−1 T −1
k x + x Σk µk − µk Σk µk − log |Σk | + log πk . (3.4)
k={1,...,K} 2 2 2

This is QDA, and δk (x) is called the discriminant function.


Hint: In lecture 4, an equivalent derivation for LDA was made, assuming Σk = Σ is constant for all k. Look on
that one and extend it accordingly for QDA by relaxing this assumption.
(b) Consider two classes k and l. Show that the decision boundary between these two classes is given by a quadratic
function.

3.7 Curse of dimensionality


For large number of inputs p, some methods, such as the nonparametric k-NN, may perform bad due to the large
dimensionality p of the input space. The problem is that the concept of ‘near’ or ‘close’ is very much depending on the
number of dimensions p, and is commonly referred to as the curse of dimensionality. To investigate this, we will now
consider an alternative version of the k-NN method, considering all neighbors within a fixed hypercube (instead of the
k nearest) for making the decision.

(a) Suppose that p = 1, and that the inputs x are uniformly distributed on [0, 1]. We decide to consider all observations
with an input within a ±0.05 interval (as an alternative to using the k nearest observed inputs in the k-NN method)
when making predictions. We now want to predict a test observation with input X = 0.3. On average, what fraction
of all training observations will be used in making the prediction?
(b) Now consider the corresponding situation for p = 2: The inputs are uniformly distributed on [0, 1] × [0, 1], and for
making predictions we use all training observations within ±0.05 in each dimension. On average, what fraction of
all training observations will we use when making a prediction for a test observation with input x? = [0.3 0.6]T ?
(c) In general, what fraction of all training observations will be used in predictions if there are p dimensions? As before,
all inputs are uniformly distributed on [0, 1]p and for prediction we consider the training observations within ±0.05
for each dimension. You may ignore the boundary effects if the test input is within 0.05 from the borders 0 or 1.
(d) Based on your answers to (a)-(c), argue why the prediction performance of k-NN might deteriorate for large p.
(e) If the inputs are distributed as in (c), and we want to make predictions using 10% of the training data inputs,
which length should the side of a symmetric hypercube have, that covers on average 10% of the inputs?

January 29, 2019


Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

Solutions

3.1 (a) The Euclidian distances are in the rightmost column below
i x y distance kx − x? k
1 0 3 0 Red 3
2 2 0 0 Red √ 2
3 0 1 3 Red √10 ≈ 3.2
4 0 1 2 Green √5 ≈ 2.2
5 −1 0 1 Green √2 ≈ 1.4
6 1 1 1 Red 3 ≈ 1.7

(b) k-NN with k = 1, the closest observation, i.e., observation 5, is selected as prediction. Thus, the prediction is
green.
(c) The 3 closest observation are observation 5, 6, and 2. Thus, the prediction is red.

3.2 (a) The probability of getting a 5 using the parameters βb0 = −6, βb1 = 0.05 is

eβb0 +βb1 x1 +βb2 x2


p(y = 1 | x) = (3.5)
1 + eβb0 +βb1 x1 +βb2 x2
e−6+0.051x1 +x2
= (3.6)
1 + e−6+0.05x1 +1x2
Now, with x1 = 40 and x2 = 3.5,

e−6+0.05 · 40+1 · 3.5


p(y = 1 | x) = (3.7)
1 + e−6+0.05 · 40+1 · 3.5
e−0.5
= (3.8)
1 + e−0.5
1
= ≈ 38%. (3.9)
1 + e0.5

(b) Set p(y = 1 | x) = 0.5 and x2 = 3.5. This gives

e−6+0.05x1 +3.5
0.5 = (3.10)
1 + e−6+0.05x1 +3.5
1
= 2.5−0.05x1 ⇒ (3.11)
e +1
0.5(1 + e2.5−0.05x1 ) = 1 ⇒ (3.12)
1
e2.5−0.05x1 = −1=1⇒ (3.13)
0.5
2.5 − 0.05x1 = log(1) = 0 ⇒ (3.14)
2.5
x1 = = 50 h. (3.15)
0.05

January 29, 2019


Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

3.3 (a) We can always expect QDA to perform better than LDA on the training set because it is more flexible and is
capable of fitting the training data better. If the Bayes decision boundary is linear, we expect LDA to perform
better on test data because it does not overfit.
(b) If the Bayes decision boundary is nonlinear, we expect QDA to be able to perform better also on the test sets.
(c) In general we expect the test error rate to improve with QDA relative to LDA as the sample size n increases, since
QDA is more flexible and will therefore be able to be closer to the Bayes decision boundary. (However, for small
n, it may overfit to the training data.)
(d) False. With few data points n, the QDA is likely to overfit (yielding a higher test error rate than LDA), which the
LDA cannot if the Bayes decision boundary also is linear.

3.4 (a) We have the output y describing the patient status as {infected, healthy}, and the input x being the body temper-
ature:

• p(x | y = infected) = N (x | 38.5, 1)


• p(x | y = healthy) = N (x | 37.5, 0.5)
• p(infected) = 0.05, and hence p(healthy) = 0.95

Inserting the expressions and patient’s temperatures into Bayes’ theorem, we get
p(Patient A is infected) = p(y = infected | x = 38.5) = 0.09,
p(patient B is infected) = p(y = infected | x = 39.2) = 0.34,
p(patient C is infected) = p(y = infected | x = 40.1) = 0.90.

(b) Bayes’ classifier (predicting the most likely class) minimizes the average number of misclassifications. For this
problem, it gives the following predictions: patient A healthy, patient B healthy, patient C infected.
(c) Since the disease is deadly, there is an asymmetry in the problem. The consequences of falsely classifying an
infected patient as healthy is probably worse than falsely classifying a healthy patient as infected (despite the side
effects of the drug). A classifier designed with this asymmetry in mind would probably also predict patient B, and
perhaps also A, as infected.
A useful tool for such a design could be the confusion matrix.
(d) What is special in this problem is that we are not given training data, but instead we have access to information
about the distributions p(y | x) and p(y).

3.5 Logistic regression has a training error rate of Ptraining = 20% and test error rate of Ptest = 30%. k-NN (k = 1):
P +P
average error rate of training2 test = 18%.
However, for k-NN with k=1, the training error rate is Ptraining = 0% because for any training observation, its nearest
neighbor will be the response itself. So, k-NN has a test error rate of Ptraining = 36%. I would choose logistic regression
because of its lower test error rate of 30%.

3.6 (a) Since the denominator in (3.2) does not depend on k, we get
yb = arg max p(y = k|x)
k={1,...,K}

= arg max p(x | k)p(k)


k={1,...,K}
 
= arg max log p(x | k)p(k)
k={1,...,K}

Further, we see that


     
log p(x | k)p(k) = log p(k) + log p(x | k)
1 1
= − (x − µk )T Σ−1
k (x − µk ) − log |Σk | + log(πk ) + const.
2 2 | {z }
independent of k
1 1 T −1 1
= − xT Σ−1 T −1
k x + x Σk µk − µk Σk µk − log |Σk | + log(πk ) + const.
| 2 2 {z 2 | {z }
} independent of k
=δk (x)

January 29, 2019


Statistical Machine Learning, 1RT700: Exercises Lesson 3 - Classification

where the classification problem can be written as

yb = arg max δk (x). (3.16)


k={1,...,K}

(b) Compare two classes y = k and y = l. The decision boundary between these two classes is given by

p(y = k|x) = p(y = l|x) ⇒ δk (x) − δl (x) = 0,

i.e., where the predicted probability for each of the two classes are equally high.
This gives
1 1 T −1 1
δk (x) − δl (x) = − xT Σ−1 T −1
k x + x Σk µk − µk Σk µk − log |Σk | + log(πk )
2 2 2
1 T −1 1 T −1 1
 
T −1
− − x Σl x + x Σl µl − µl Σl µl − log |Σl | + log(πl )
2 2 2
1 T −1
= − x (Σk − Σ−1 −1 −1
l )x + x (Σk µk − Σl µl )
T
2
1 1 1 1
− µT Σ−1 µ − log |Σk | + log(πk ) + µT Σ−1 µl + log |Σl | − log(πl ),
2 k k k 2 2 l l 2
which is a quadratic function in x as long as Σk 6= Σl .

3.7 (a) All training observations with inputs on the interval [0.25, 0.35] will be used for making the prediction. Since the
inputs are uniformly distributed on the interval [0, 1], will 10% on the average be in [0.25, 0.35], and hence be used
for the prediction.
(b) In this case, we will use all training observations with inputs in the square [0.25, 0.35] × [0.55, 0.65]. The square
covers 1% of [0, 1] × [0, 1], and hence will on average only 1% of the training observations be used in the predictions.
(c) The probability of an input to be inside the hypercube with dimensions 0.1 × · · · × 0.1 is 0.1 per dimension, and
thus 0.1p for all dimensions.
(d) If p is large, the nearest neighbor to a test input might still be quite far away from the test input: when p = 1 in
the situation in (a), about 10% of the training data could be expected to be within ±0.05, and we can thus expect
to find an observation ‘similar’ to the test case among the training data, yielding a hopefully useful prediction. If
p = 10 in (c), only 0.000001% of the training data could be expected to be within the hypercube ±0.05 for each
dimension around the test input, and it seems less likely that we have an observation among the training data that
is ‘similar’ to the test case, and the prediction performance may therefore deteriorate.
(e) • For p = 1, the side needs to be 0.1.

• For p = 2, the side needs to be 0.11/2 = 0.316.


• For p = 3, the side needs to be 0.11/3 = 0.464.
• ...

• For p, the side needs to be 0.11/p .


Thus, if the number of input is high, let us say p = 100, the side of the cube needs to be 0.977. This means that
if we want on average to use 10% of the training data for a prediction, we would need to include almost the entire
range of each input dimension in order to achieve that, because of the large number of dimensions.

January 29, 2019

You might also like