Lecture 31-36
Lecture 31-36
Selection
1
Goal
• Model Selection
• Model Assessment
2
A Regression Problem
• y = f(x) + noise
3
Linear Regression
4
Quadratic Regression
5
Joining the dots
6
Which is best?
8
Why Errors
10
Measuring Errors:
Loss Functions
• Typical regression loss functions
Squared error:
Absolute error:
11
Measuring Errors:
Loss Functions
• Typical classification loss functions
0-1 Loss:
12
The Goal: Low Test Error
• We want to minimize generalization error or test
error:
Err E[L(Y, fˆ (X))]
• But all we really know is training error:
N
?
1
err L(y i , fˆ (x i ))
N i1
For Deviation
squared-error loss & estimate
of the average additive noise:
from the true function’s mean
16
Graphical representation of
bias & variance
Model Space
Hypothesis Space (basic linear regression)
Closest fit
(given our observation)
Realization
Mo Shrunken fit
de
lF
Truth itti
ng
Regularized Model Space
(ridge regression)
Model Bias Estimation
Estimation Variance
Closest fit
Bias
In population
(if epsilon=0) 17
Bias & Variance
Decomposition Examples
• kNN Regression
Averaging over the training set:
1 1 p 2
N i
Err(x i ) [ f (x i ) Ef (x i )]
2
N i
ˆ 2
• Linear Regression
Linear weights on y:
18
Simulated Example of
Bias Variance Decomposition
Prediction error
-- + -- = -- -- + -- = -- Bias2
Regression
Variance
with squared
error loss
Bias-Variance
-- + -- <> -- -- + -- <> -- different for
Classification
with 0-1 loss 0-1 loss
than for
Estimation errors
squared error
on the right side
loss
of the boundary
don’t hurt! 19
Optimism of The
Training Error Rate
• Typically: training error rate < true error
(same data is being used to fit the method
and assess its error)
N
1
err L(y i , fˆ (x i )) < Err E[L(Y, fˆ (X))]
N i1
overly optimistic
20
Estimating Test Error
21
Adjustment for optimism of training error
Optimism
N
2
Summary: Errin E y err
N
Cov yˆ , y
i 1
i i
23
Estimates of In-Sample
Prediction Error
• General form of the in-sample estimate:
Êrrin err oˆp
24
AIC & BIC
25
AIC & BIC
log N
BIC LL(Data | MLE params) (# of parameters)
2
26
MDL
(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
Input
Samples Learning model Predictions
Rn R1
Real class
R1
Real model =?
error 27
MDL
(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
length logPr(y | , M, X) logPr( | M)
MDL principle: choose the model with the minimum description length
4 Err
Errtrue Errtrain 1 1 train
2
h log(a2 N /h) 1 log( /4)
where a1
N
h VC dimension (measure of f ' s power)
As h increases
A method of selecting a class F from a family of nested classes
29
Errin Estimation
• Cross Validation
• Bootstrap
31
Cross-Validation
test train
K-fold
……
N
1
CV ( ) L y i , fˆ (i) (x i , )
N i1
32
How many folds?
Computation increases
k fold
Leave-one-out
k increases
33
Cross-Validation: Choosing K
1
N i1
ˆ i
yi f (xi)
N i1 1 Sii
Sii i'th diagonal element of S
trace(S) N
a computationally cheaper approximation
• GCV provides
2
1
N ˆ
yi f (xi)
GCV
N i1 1 35
Bootstrap: Main Concept
“The bootstrap is a computer-based
method of statistical inference that can answer
many real statistical questions without formulas”
(An Introduction to the Bootstrap, Efron and Tibshirani, 1993)
36
How is it coming
38
Bootstrap:
Error Estimation with Err(1)
A CV-inspired improvement on Errboot
N
1 1
Erˆr i
(1)
i (x i ))
L(y , ˆ
f *b
N i1 C b C i
39
Bootstrap:
Error Estimation with Err(.632)
An improvement on Err(1) in light-fitting cases
1
Probability of zi NOT being chosen when 1 point is uniformly sampled from Z : 1 -
N
1 N
Probability of z i NOT being chosen when Z is sampled N times :1 -
N
1 N
Probability of zi being chosen AT LEAST once when Z is sampled N times: 1 1 -
N
1 e1 Erˆr(.632) err .632 (Erˆr(1) err)
40
0.632 .368 err .632 Erˆr(1)
Bootstrap:
Error Estimation with Err(.632+)
An improvement on Err(.632) by adaptively
accounting for overfitting