Problem 2: Overfitting (20 points)
For each of the supervised learning methods that we have studied, indicate how the
‘method could overfit the training data (consider both your design choices as well as the
training) and what you can do to minimize this possibility. There may be more than one
mechanism for overfitting, make sure that you identify them all.
Part A: Nearest Neighbors (5 Points)
1. How does it overfit?
Every point in dataset (including noise) defines its own decision boundary
The distance function can be chosen to do well on training set but less well on
new data,
2. How can you reduce overfitting?
Use k-NN for larger k
Use cross-validation to choose k and the distance function
Part B: Decision Trees (5 Points)
1, How does it overfit?
By adding new tests to the tree to correctly classify every data point in the
t
2, How can you reduce overfitting?
By pruning the resulting tree based on performance on a validation set,Part C: Neural Nets (5 Points)
1. How does it overfit?
By having too many units and therefore too many weights, thus enabling it to
fit every nuance of the training set.
By training too long so as to fit the training data better.
2. How can you reduce overlitting?
Using cross-validation to choose a not too complex network
By using a validation set to decide when to stop training.
Part D: SVM [Radial Basis and Polynomial kernels] (5 Points)
1, How does it overfit?
In RBF, by choosing a value of sigma (the std dev of Gaussian) too small
In Polynomial, by choosing the degree of the polynomial too high
By allowing the Lagrange multipliers to get too large
2. How can you reduce overfitting?
Using cross-validation to choose the kernel parameters and the maximum
value for the multipliers.1 [ Points] Short Questions
‘The following short questions should be answered with at most two sentences, and/or a
picture. For the (true/false) questions, answer true or false. If you answer true, provide a
short justification, if false explain why or provide a small counterexample.
1. [ points] Your billionaire friend needs your help. She needs to classify job applications
into good/bad categories, and also to detect job applicants who lie in their applications
using density estimation to detect outliers. To meet these needs, do you recommend,
using a discriminative ocgenarative classiio> Why?
Br dandy esimahdy,
Meek 904 \yy
2. [ points] Your billionaire friend also wants to classify software applications to detect
bug-prone applications using features of the source code, This pilot project only has
a few applications to be used as training data, though. To create the most accurate
classifier, do you recommend using a giscriminalivDor generative classifier? Why?
bow
‘oA
ron
3 | points} Finally, your billionaire friend also wants to classify companies to decide
which one to acquire. This project has lots of training data based on several decades
rte of research. To create the most accurate classifier, do you recommend using a diserim-
Cn inative orgeneratiyD classifier? Why?
4. [points] Assume that we are using some classifier of fixed complexity. Draw a graph
showing two curves: test error vs. the number of training examples and cross-validation,error vs. the number of training examples Xo cA/
: dente
yO
{ points] Assume that we are using an SVM classifier with a Gaussian keel, Draw
a graph showing two curves: training error vs. kernel bandwidth and test error vs
kernel bandwidth
7
~
Pand wide
[points] Assume that we are modeling a number of random variables using a Bayesian
Network with n edges. Draw a graph showing two curves: Bias of the estimate of the
joint probability vs. n and variance of the estimate of the joint probability vs. n.
Nvorfon Le,
n
X bras
(a) Both PCA and linear regression can be thought of as algorithms for minimizing a
sum of squared errors. Explain which error is being minimized in each algorithm
CCAS a/aM IN (x= FLeunyy’ "reconsiructon®
[ points}
nvr
Un regs argming U-*6) —"Yednal” evrer
[ points] A long time ago there was a village amidst hundreds of lakes. Two types
of fish lived in the region, but only one type in each lake. These types of fish both
looked exactly the same, smelled exactly the same when cooked, and had the exact
same delicious taste - except one was poisonous and would kill any villager who ate
it. The only other difference between the fish was their effect on the pHT (acidity) of
the lake they occupy. The pH for lakes occupied by the non-poisonous type of fish was
distributed according to a Gaussian with unknown mean (J1saye) and variance (22,;,)4 Bias-Variance Decomposition (12 pts)
1. (6 pte) Suppose you have regression data generated by a polynomial of degreo 3. Characterize
the bias-variance of the estimates of the following models on the data with respect to the true
model by circling the appropriate entry.
Dias ‘Variance
Linear regression toe /EED Gasyhieh
Polynomial regression with degree 3 igh Cowyhigh
Polynomial regression with degree 10 |(owyhigh low (high
2. Let ¥ = §(X)-+6, where ¢ has mean zero and variance a2. In k-nearest neighbor (KNN)
regression, the prediction of ¥ at point x9 is given by the average of the values Y at de &
neighbors closest to 2.
(a) (2 pts) Denote the E-nearest neighbor to zo by zy and its corresponding Y value by
co. Write the prediction f(cp) of the KNN regsessiou for x in terms of yig,1 <6 he
a ee
fC RR Yu
(b) (2 pts) What is the behavior of the bias as k increases’
increase,
(e) (2 pts) What is the behavior of the variance as k ineraases?
decrease