0% found this document useful (0 votes)

6 views

Lecture 31-36

Uploaded by

asif karim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lecture 31-36

Uploaded by

asif karim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 44

Model Assessment and

Selection

1
Goal

• Model Selection

• Model Assessment

2
A Regression Problem
• y = f(x) + noise

• Can we learn f from

this data?

• Let’s consider three

methods...

3
Linear Regression

4
Quadratic Regression

5
Joining the dots

6
Which is best?

• Why not choose the method with the best fit to

the data?
“How well are you going to
predict future data drawn
from the same distribution?”
7
Model Selection and Assessment

• Model Selection: Estimating performances of

different models to choose the best one
(produces the minimum of the test error)

• Model Assessment: Having chosen a model,

estimating the prediction error on new data

8
Why Errors

• Why do we want to study errors?

• In a data-rich situation split the data:

Train Validation Test

Model Selection Model assessment

• But, that’s not usually the case

9
Overall Motivation
• Errors
 Measurement of errors (Loss functions)
 Decomposing Test Error into Bias & Variance

• Estimating the true error

 Estimating in-sample error (analytically )
AIC, BIC, MDL, SRM with VC
 Estimating extra-sample error (efficient sample reuse)
Cross Validation & Bootstrapping

10
Measuring Errors:
Loss Functions
• Typical regression loss functions
 Squared error:

 Absolute error:

11
Measuring Errors:
Loss Functions
• Typical classification loss functions
 0-1 Loss:

 Log-likelihood (cross-entropy loss / deviance):

12
The Goal: Low Test Error
• We want to minimize generalization error or test
error:
Err  E[L(Y, fˆ (X))]
• But all we really know is training error:
N
?
1
err   L(y i , fˆ (x i ))
 N i1

• And this is a bad estimate of test error

13

Bias, Variance & Complexity

Training error can always be reduced when increasing

model complexity, but risks over-fitting.
Typically 14
Decomposing Test Error
Model: Y  f ( X )   ; E ( )  0 and Var ( )   
2

For Deviation
squared-error loss & estimate
of the average additive noise:
from the true function’s mean

Irreducible error Expected squared deviation of our

of target Y estimate around its mean 15
Further Bias Decomposition
• For linear models
For(eg. Ridge),
standard bias
linear can
regression,
be further decomposed:
Estimation Bias = 0

Average Model Average Estimation

* is the best fitting linear approximation
  arg min E ( f ( X )   X )

Bias T 2 Bias

16
Graphical representation of
bias & variance
Model Space
Hypothesis Space (basic linear regression)
Closest fit
(given our observation)
Realization
Mo Shrunken fit
de
lF
Truth itti
ng
Regularized Model Space
(ridge regression)
Model Bias Estimation
Estimation Variance
Closest fit
Bias
In population
(if epsilon=0) 17
Bias & Variance
Decomposition Examples
• kNN Regression
Averaging over the training set:

1 1 p 2

N i
Err(x i )    [ f (x i )  Ef (x i )]  
2

N i
ˆ 2

• Linear Regression


Linear weights on y:

18
Simulated Example of
Bias Variance Decomposition
Prediction error
-- + -- = -- -- + -- = -- Bias2
Regression
Variance
with squared
error loss

Bias-Variance
-- + -- <> -- -- + -- <> -- different for
Classification
with 0-1 loss 0-1 loss
than for
Estimation errors
squared error
on the right side
loss
of the boundary
don’t hurt! 19
Optimism of The
Training Error Rate
• Typically: training error rate < true error
(same data is being used to fit the method
and assess its error)

N
1
err   L(y i , fˆ (x i )) < Err  E[L(Y, fˆ (X))]
N i1
overly optimistic

 20
Estimating Test Error

• Can we estimate the discrepancy between

err and Err? extra-sample error
Expectation over N new
Errin --- In-sample error:responses at each xi

21
Adjustment for optimism of training error
Optimism
N
2
Summary: Errin  E y  err  
N
 Cov  yˆ , y 
i 1
i i

for squared error, 0-1 and other lossN functions:

2
optimism: op  Errin  E y  err  op   Cov  yˆ , y 
i i
N i 1

• For linear fit with d indep inputs/basis funcs:

2
Errin  E y  err   d 2
N
 optimism linearly with # d
 Optimism as training sample size 22
Ways to Estimate Prediction Error

• In-sample error estimates:

 AIC
 BIC
 MDL
 SRM

• Extra-sample error estimates:

 Cross-Validation
• Leave-one-out
• K-fold
 Bootstrap

23
Estimates of In-Sample
Prediction Error
• General form of the in-sample estimate:
Êrrin  err  oˆp

• For linear fit :

2d 2
C p  err  ˆ  , so called C p statistic
N

24
AIC & BIC

Similarly: Akaike Information Criterion (AIC)

2 d
AIC    loglik 2 
N N

Bayesian Information Criterion (BIC)


BIC  2 loglik  (log N )d

25
AIC & BIC

AIC  LL(Data | MLE params)  (# of parameters)

log N
BIC  LL(Data | MLE params)  (# of parameters)
2

26
MDL
(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
Input
Samples Learning model Predictions
Rn R1

Real class
R1
Real model =?

error 27
MDL
(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
length  logPr(y | , M, X)  logPr( | M)

Length of transmitting the discrepancy Description of the model

given the model + optimal coding under optimal coding
 under the given model

MDL principle: choose the model with the minimum description length

Equivalent to maximizing the posterior: Pr(y | , M, X)  Pr( | M)

28
SRM with VC (Vapnik-
Chernovenkis) Dimension
• Vapnik showed that with probability 1-

 4  Err 
Errtrue  Errtrain   1 1 train

2   
h log(a2 N /h)  1  log( /4)
where   a1
N
 h  VC dimension (measure of f ' s power)
As h increases

A method of selecting a class F from a family of nested classes

29
Errin Estimation

• A trade-off between the fit to the data and

the model complexity
d
AIC  err  2   
ˆe
N
BIC  2 loglik  (log N )d

MDL length  logPr(y | , M, X)  logPr( | M)

 4  Err 
VC : Errtrue  Errtrain   2 1 1 train

  
30
Estimation of
Extra-Sample Err

• Cross Validation

• Bootstrap

31
Cross-Validation

test train

K-fold
……

N
1
 
CV ( )   L y i , fˆ  (i) (x i , )
N i1
32
How many folds?

Computation increases

Variance decreases bias decreases

k fold
Leave-one-out
k increases

33
Cross-Validation: Choosing K

Popular choices for K: 5,10,N 34

Generalized Cross-Validation

• LOOCV can be computational expensive for linear fitting with large N

• Linear fitting yˆ Sy (S is a smoother matrix)

• For linear fitting under squared-error loss:

2
1 yi  fˆ (xi) 
N 2 N


1

N i1
 ˆ i

yi  f (xi)   
N i1  1 Sii 
Sii  i'th diagonal element of S

  trace(S) N
 a computationally cheaper approximation
• GCV provides
2
1 
N ˆ
yi  f (xi) 
 GCV   
N i1  1   35
Bootstrap: Main Concept
“The bootstrap is a computer-based
method of statistical inference that can answer
many real statistical questions without formulas”
(An Introduction to the Bootstrap, Efron and Tibshirani, 1993)

Step 2: Calculate the

statistic

Step 1: Draw samples

with replacement

36
How is it coming

Sampling distribution of sample mean x

In practice cannot afford
large number of random samples



The theory tells us

the sampling distribution

The sample stands for the population

and the distribution of x in many
resamples stands for the sampling
distribution 37
Bootstrap:
Error Estimation with Errboot
 
B N
1 1
 S   S(Z *b )
* *
Vaˆ r[S(Z)]  S(Z *b )  S
B 1 b1 B b1
VarFˆ [S(Z)] Depends on the unknown true
distribution F
 
A straightforward application of bootstrap to error prediction

B N
1 1
Erˆrboot   
B N b1 i1
L(y i , ˆ
f *b
(x i ))

 38
Bootstrap:
Error Estimation with Err(1)
A CV-inspired improvement on Errboot
N
1 1
Erˆr   i
(1)
 i (x i ))
L(y , ˆ
f *b

N i1 C b C  i



39
Bootstrap:
Error Estimation with Err(.632)
An improvement on Err(1) in light-fitting cases

Erˆr(.632)  .368  err  .632  Erˆr(1)

 N # of datapoints Z  (z1,...,z n ) ?

 1 
 Probability of zi NOT being chosen when 1 point is uniformly sampled from Z : 1 - 
  N 

 1 N
 Probability of z i NOT being chosen when Z is sampled N times :1 - 
 N 

 1 N
 Probability of zi being chosen AT LEAST once when Z is sampled N times: 1 1 - 
 N 
 1 e1 Erˆr(.632)  err  .632 (Erˆr(1)  err)
40
 0.632  .368 err  .632 Erˆr(1)
Bootstrap:
Error Estimation with Err(.632+)
An improvement on Err(.632) by adaptively
accounting for overfitting

• Depending on the amount of overfitting, the best error

estimate is as little as Err(.632) , or as much as Err(1), or
something in between

• Err(.632+) is like Err(.632) with adaptive weights, with Err(1)

weighted at least .632

• Err(.632+) adaptively mixes training error and leave-one-

out error using the relative overfitting rate (R)
41
Bootstrap:
Error Estimation with Err(.632+)

Erˆr(.632) ranges from Erˆr(.632) if there is minimal overfitting (R 0),

to Erˆr(1) if there is maximal overfitting (R 1) 42
Cross Validation & Bootstrap
• Why bother with cross-validation and bootstrap
when analytical estimates are known?

1) AIC, BIC, MDL, SRM all requires knowledge of d,

which is difficult to attain in most situations.

2) Bootstrap and cross validation gives similar results

to above but also applicable in more complex situation.

3) Estimating the noise variance requires a roughly

working model, cross validation and bootstrap will work
well even if the model is far from correct.
43
Conclusion
• Test error plays crucial roles in model selection
• AIC, BIC and SRMVC have the advantage that you only need the
training error
• If VC-dimension is known, then SRM is a good method for model
selection – requires much less computation than CV and
bootstrap, but is wildly conservative
• Methods like CV, Bootstrap give tighter error bounds, but might
have more variance
• Asymptotically AIC and Leave-one-out CV should be the same
• Asymptotically BIC and a carefully chosen k-fold should be the
same
• BIC is what you want if you want the best structure instead of the
best predictor
• Bootstrap has much wider applicability than just estimating
prediction error 44

Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Data Science Cheatsheet
100% (1)
Data Science Cheatsheet
5 pages
Model Selection and Model Validation
No ratings yet
Model Selection and Model Validation
36 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
intro to regression
No ratings yet
intro to regression
4 pages
DDMA05_ModelSelection
No ratings yet
DDMA05_ModelSelection
28 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
No ratings yet
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
13 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
EDAN96_2024_Last_lecture-1
No ratings yet
EDAN96_2024_Last_lecture-1
78 pages
Regression-and-generalization (1)
No ratings yet
Regression-and-generalization (1)
67 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
57 pages
226 Lecture5 Prediction
No ratings yet
226 Lecture5 Prediction
45 pages
Week 10_Lecture 10
No ratings yet
Week 10_Lecture 10
59 pages
Lec 3 Regression.
No ratings yet
Lec 3 Regression.
20 pages
SML_Lecture4
No ratings yet
SML_Lecture4
38 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
5 CV Boot-Handout PDF
No ratings yet
5 CV Boot-Handout PDF
44 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Braun Bootstrap2012 PDF
No ratings yet
Braun Bootstrap2012 PDF
63 pages
Regression
No ratings yet
Regression
45 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Generalization Error
No ratings yet
Generalization Error
9 pages
Emailing PREDICTIVE ANALYSIS 2
No ratings yet
Emailing PREDICTIVE ANALYSIS 2
14 pages
CS550 Regression
No ratings yet
CS550 Regression
62 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
ESGB Evaluation Methods
No ratings yet
ESGB Evaluation Methods
84 pages
Fast Approximation of The Bootstrap For Model Selection
No ratings yet
Fast Approximation of The Bootstrap For Model Selection
6 pages
6.estimators (C)
No ratings yet
6.estimators (C)
5 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
L3 Model Selection Diagnostics
No ratings yet
L3 Model Selection Diagnostics
75 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Linear Models - Numeric Prediction
No ratings yet
Linear Models - Numeric Prediction
7 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Foundation Model Evaluation
No ratings yet
Foundation Model Evaluation
10 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
C Se 546 Wi 12 Linear Regression
No ratings yet
C Se 546 Wi 12 Linear Regression
31 pages
AAM UNIT 1 QB WITH ANSWER
No ratings yet
AAM UNIT 1 QB WITH ANSWER
12 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
13 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Exercises of Function Study
From Everand
Exercises of Function Study
Simone Malacrida
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Total Marks: 22 Marks (Convert Into 100% and Enter On Moodle Grade Book)
No ratings yet
Total Marks: 22 Marks (Convert Into 100% and Enter On Moodle Grade Book)
5 pages
History Picture sources.docx
No ratings yet
History Picture sources.docx
17 pages
Critical Reasoning 8 QUESTION TYPES
No ratings yet
Critical Reasoning 8 QUESTION TYPES
8 pages
Part 2 (2014) - Essay Writing Techniques
No ratings yet
Part 2 (2014) - Essay Writing Techniques
22 pages
Part 1: Regression Model With Dummy Variables
No ratings yet
Part 1: Regression Model With Dummy Variables
16 pages
Logic and Critical Thinking Assignment
No ratings yet
Logic and Critical Thinking Assignment
1 page
X Variable 1 Residual Plot
No ratings yet
X Variable 1 Residual Plot
93 pages
Solution 9 8
No ratings yet
Solution 9 8
17 pages
Test of Hypotheses For A Single Sample: Learning Objectives
No ratings yet
Test of Hypotheses For A Single Sample: Learning Objectives
18 pages
Z Test Population Mean ( ) and ( ) Known or Unknown Variance Sample Size N 30
No ratings yet
Z Test Population Mean ( ) and ( ) Known or Unknown Variance Sample Size N 30
6 pages
ICT - ICT Applications (2023) Final
No ratings yet
ICT - ICT Applications (2023) Final
23 pages
BA unit2
No ratings yet
BA unit2
30 pages
MCSE 003 Ignouassignmentguru.com
No ratings yet
MCSE 003 Ignouassignmentguru.com
97 pages
Mcqs of Hypothesis
82% (11)
Mcqs of Hypothesis
2 pages
The Statistics Tutor's Quick Guide To Commonly Used Statistical Tests
No ratings yet
The Statistics Tutor's Quick Guide To Commonly Used Statistical Tests
53 pages
Scientific Skills
No ratings yet
Scientific Skills
15 pages
Hypothesis Testing With One Sample
No ratings yet
Hypothesis Testing With One Sample
68 pages
DLL - ENG 9 - Q2 - Making Inferences From What Was Said
No ratings yet
DLL - ENG 9 - Q2 - Making Inferences From What Was Said
6 pages
Tutorial Stat 322 PDF
No ratings yet
Tutorial Stat 322 PDF
58 pages
Short Summary About Deductive and Inductive Reasoning
No ratings yet
Short Summary About Deductive and Inductive Reasoning
3 pages
Table 1 Regression Model Coun - Ness and Job - Attitude: Model R R Square
No ratings yet
Table 1 Regression Model Coun - Ness and Job - Attitude: Model R R Square
3 pages
Eco 5
No ratings yet
Eco 5
30 pages
Logic, Logic Fuzzy and Quantum
100% (2)
Logic, Logic Fuzzy and Quantum
125 pages
It Includes Two Reading Passages
No ratings yet
It Includes Two Reading Passages
13 pages
Project Title: Buying Patterns of Levi's Jeans
No ratings yet
Project Title: Buying Patterns of Levi's Jeans
7 pages
Inference Rules 08
No ratings yet
Inference Rules 08
61 pages
Lecture 6 (Hypothesis Testing-One Sample T-Test)
No ratings yet
Lecture 6 (Hypothesis Testing-One Sample T-Test)
53 pages
Problem Set 3 (With Dummy Variable)
No ratings yet
Problem Set 3 (With Dummy Variable)
3 pages
GR 5-Unit1lesson6inferencingmanproject
No ratings yet
GR 5-Unit1lesson6inferencingmanproject
3 pages
Make An Inference From The Ideas Presented in The Material Viewed
No ratings yet
Make An Inference From The Ideas Presented in The Material Viewed
26 pages