0% found this document useful (0 votes)

60 views

Chapter 2

This document discusses best subset selection for linear regression models. It introduces best subset selection as an approach to selecting a subset of predictor variables that yields good predictive performance. The document outlines that best subset selection examines all possible combinations of predictors and chooses the subset with the best performance. However, this becomes computationally infeasible with many predictors. The document then discusses using information criteria like AIC and BIC to select the best subset model by penalizing models with more predictors. It provides an example of applying best subset selection to a credit card data set with 10 predictors.

Uploaded by

changmin shim

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Chapter 2

Uploaded by

changmin shim

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

CHAPTER TWO

Regression with penalization: subset selection

Module Code: EC4308

Module title: Machine Learning and Economic Forecasting
Instructor: Vu Thanh Hai
Reference: ISLR 6.1
Acknowledgement: I adapt Dr. Denis Tkachenko’s notes with
some minor modifications. Thanks to Denis!
Lecture outline

▪ Introduction and motivation

▪ Best subset selection
▪ Forward stepwise selection
▪ Backward stepwise selection
▪ Concluding thoughts: pros and cons.
Introduction and motivation
▪ The linear regression model that you have learned:
𝑦𝑖 = 𝛽0 + 𝛽𝑝 𝑥1,𝑖 + 𝛽2 𝑥2,𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝,𝑖 + 𝜀𝑖

▪ Very commonly used to model 𝐸(𝑌 ∣ 𝑋) - recall that this is what we are after
in predictive modelling under squared loss.
▪ Can describe nonlinear but still additive relationships. That is linear in
coefficients (𝛽’s).
▪ Linearity yields advantages in inference, and in many applications delivers
surprisingly good approximations.
Motivation: is OLS good enough?
▪ We typically fit such models with OLS, why do we have to look for
alternatives?
▪ 2 key concerns: prediction accuracy and model interpretability.
▪ OLS will have low bias if true relationship approximately linear.
▪ If 𝑃 << 𝑁 ("typical" metrics textbook setting), OLS also has low variance.
No need to do anything
Motivation: prediction accuracy
▪ When 𝑁 is not much larger than 𝑃 - OLS predictions will have high variance
→ overfit → poor OOS prediction.
▪ When 𝑃 > 𝑁 (or 𝑃 >> 𝑁 ), no unique solution: OLS variance is infinite →
cannot use this method at all.
▪ Our goal is to get a linear combination of 𝑋 to get a good forecast of 𝑌.
▪ We want to reduce the variance by somehow reducing model complexity with
a minor cost in terms bias - constrain coefficients on some 𝑋 's to 0 or shrink
them.
Motivation: Interpretability
▪ Oftentimes 𝑋 's entering regression may be irrelevant.

▪ OLS will almost never set coefficients exactly to 0 .

▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance

model interpretability.

▪ Sacrifice "small details" to get the "big picture".

▪ Caution: "interpretation" here need not mean causal effect! Rather, which variables
can be important predictors.
Introduction: Three important approaches
▪ Subset selection: find a (hopefully small) subset of the X's that yields good
performance. Fit OLS on this subset.

▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0

relative to OLS (some can go to exactly 0 ).

▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃

linear combinations of the variables (e.g., principal components)

▪ We will talk about the first approach today.

Subset Selection
▪ We try to find a possibly sparse subset of 𝑋 's to forecast 𝑌 well: find 𝑆 << 𝑁
predictors needed for a high quality forecast.
▪ Dropping a variable = setting its coefficient to zero. A form of penalty called
the 𝐿0 penalty.
▪ Bias/variance tradeoff: leave out a variable highly correlated to signal → bias;
put in too many X's → high variance.
▪ In principle, want to try out all possible combinations and choose the one
with the best OOS performance.
Introduction: Three important approaches
▪ Data from the ISLR package on credit card balances and 10 predictors, 𝑁 = 400.
ID Identification
Income Income in $10,000 's
Limit Credit limit
Rating Credit rating
Cards Number of credit cards
Age Age in years
Education Number of years of education
Gender A factor with levels Male and Female
Student A factor with levels No and Yes indicating whether the individual was a student
Married A factor with levels No and Yes indicating whether the individual was married
Ethnicity A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity.
Balance Average credit card balance in $
Best Subset Selection
▪ Let 𝑝᪄ ≤ 𝑝 be the maximum model size.
▪ Goal: Goal: for each 𝑘 ∈ {0,1,2 … , 𝑃}, find the subset of size 𝑘 that gives the
best 𝑅2 . Then, pick the overall best model.

▪ For 𝑘 = 0,1, … , 𝑝᪄
• Fit all models that contain exactly 𝑘 predictors. There are 𝑝ҧ models. If 𝑘 = 0, the
𝑘
forecast is the unconditional mean.

• Pick the best (e.g, highest 𝑅2 ) among these models, and denote it by ℳ𝑘 .

▪ Optimize over ℳ0 , … , ℳ𝑝᪄ using cross-validation (or other criteria like AIC
or BIC)
Best Subset Selection
▪ The above direct approach is called all subsets or best subsets regression
▪ However, we often can't examine all possible models, since they are 2𝑝 of them; for
example, when 𝑝 = 40 there are over a billion models!
▪ The prediction is highly unstable: the subsets of variables in ℳ10 and ℳ11 can be very
different from each other → high variance (the best subset of ℳ3 need not include any of
the variables in best subset of ℳ2 .)
▪ If P is large, the chance of finding models that work great in training data, but not so
much OOS increases – overfitting.
▪ Estimates fluctuate across different samples, so does the best model choice
▪ Nice feature: not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Best Subset Selection
▪ But the more predictors, the lower the insample MSE. With large 𝑘, we might have too
flexible a model. With small 𝑘, we have a rigid model. Both do badly out-of-sample.
▪ So which 𝑘 is the best? In other words, out of the best models for every 𝑘, which model is
the best?
▪ We need the model with the lowest OOS MSE ('test error’).
▪ We can use Information Criterion (IC) to indirectly estimate the OOS MSE. IC help us
with adjusting the in-sample MSE to penalize for potential overfitting. The more regressors
we add, the higher the penalty.
▪ OR: We can directly estimate the OOS MSE using the validation set or CV.
Mallow’s 𝐶𝑝 and adjusted-𝑅 : A refresher 2

▪ Mallow's 𝐶𝑝 :
1
𝐶𝑝 = RSS + 2𝑘𝜎ƶ 2
𝑛
▪ where 𝑑 is the total # of parameters used and 𝜎ƶ 2 is an estimate of the variance of the error 𝜖 associated with
each response measurement.

▪ For a least squares model with 𝑘 variables, the adjusted 𝑅2 statistic is calculated as
2
RSS/(𝑛 − 𝑘 − 1)
Adjusted 𝑅 = 1 −
TSS/(𝑛 − 1)

2 RSS
▪ Maximizing the adjusted 𝑅 is equivalent to minimizing . While RSS always
𝑛−𝑘−1
RSS
decreases as the number of variables in the model increases, may increase or decrease,
𝑛−𝑘−1
due to the presence of 𝑘 in the denominator.
AIC: A refresher
▪ AIC (Akaike Information Criterion) is an unbiased estimate of the Kullback-
Leibler divergence between the model distribution and the forecast
distribution (if all the assumptions hold).
▪ The 𝐴𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
AIC = −2log 𝐿 + 2 ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In the linear regression model, it equals:
1 2
𝐴𝐼𝐶 = 𝑅𝑆𝑆 + 2𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
▪ AIC is used in different model types, not just linear.
BIC: a refresher
▪ BIC ( (Schwarz) Bayesian Information Criterion) is inversely proportional to
the posterior probability of the model under a uniform prior over models.
▪ The 𝐵𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
BIC = −2log 𝐿 + log(N) ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In this linear regression, it equals:
1 2
𝐵𝐼𝐶 = 𝑅𝑆𝑆 + log 𝑁 𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
Estimating 𝜎ƶ 2 when applying IC
▪ How do we find 𝜎ƶ 2 to compute AIC and BIC.
▪ First approach: Use 𝜎ො 2 directly from each model.

▪ When 𝑃 << 𝑁: use the 𝜎ƶ 2 estimated using the full model (all predictors used
→ low bias model; suggested in textbook.)
▪ When 𝑃/𝑁 is not small, try using iterative procedure:
▪ Use 𝜎02 = 𝑠𝑌2 (sample variance of 𝑌 ), select best model based on AIC/BIC,
call it 𝑀 𝑘0 .
▪ Take 𝜎12 = 𝜎ƶ02 , select the best model on IC, call it 𝑀 𝑘1 . Iterate until
convergence (often above 2 steps enough).
Best subset selection for ‘Credit’

Source: James et al. (2014), ISLR, Springer

For each possible model containing a subset of the ten predictors in the Credit data set, the 𝑅𝑆𝑆
and 𝑅2 are displayed. The red frontier tracks the best model for a given number of predictors,
according to 𝑅𝑆𝑆 and 𝑅 2 .
Best subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗

Limit ∗ ∗ ∗ ∗ ∗

Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗

Cards ∗ ∗ ∗ ∗ ∗

Age ∗ ∗ ∗

Education ∗

Female ∗

Student ∗ ∗ ∗ ∗ ∗ ∗

Married ∗

Asian ∗ ∗

Caucasian ∗
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset optimal model for 'Credit’

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 4 10307.72
10-fold CV 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

Stepwise Selection
▪ For computational reasons, best subset selection cannot be applied with very
large 𝑝. Why not?
▪ Best subset selection may also suffer from statistical problems when 𝑝 is large:
larger the search space, the higher the chance of finding models that look
good on the training data, even though they might not have any predictive
power on future data.
▪ Thus, an enormous search space can lead to overfitting and high variance of
the coefficient estimates.
▪ For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
Forward Stepwise Selection: algorithm
1. Let ℳ0 denote the null model, which contains no predictors.

2. For 𝑘 = 0, … , 𝑝ҧ − 1 :

▪ Consider all 𝑝 − 𝑘 models that augment the predictors in ℳ𝑘 with one additional
predictor.
▪ Choose the best among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1 . Here best is defined as
having smallest RSS or highest 𝑅2 .

3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

prediction error, 𝐶𝑝 , 𝐴𝐼𝐶,𝐵𝐼𝐶 , or adjusted 𝑅2 .
Forward Stepwise Selection
▪ FSS involves fitting one null model and (𝑝 − 𝑘) models for each iteration of 𝑘 =
𝑝(𝑝+1)
0,1, … , (𝑝ҧ − 1). If 𝑝ҧ = 𝑝 ⇒ 1 + models. E.g. P = 20, we have 211 models (vs.
2
1,048,576 using best subset)

▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.

Not the case in best subset selection, thus, may not always find the best possible

▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <

𝑝 with 𝑝ҧ < 𝑛.

▪ Still not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Forward Stepwise Selection
Forward Stepwise Selection
Forward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Forward sel. optimal model for 'Credit'

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10281.33
10-fold CV 5 10281.33

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

Backward Stepwise Selection (General)
▪ Let ℳ𝑝ҧ denote the full model, which contains all 𝑝ҧ predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1:
▪ Consider all 𝑘 models that contain all but one of the predictors in ℳ𝑘 , for a total of
𝑘 − 1 predictors.
▪ Choose the best among these 𝑘 models, and call it ℳ𝑘−1 . Here, best is defined as
having smallest RSS or highest 𝑅2 (or maximum log likelihood or lowest deviance
depending on the estimation.)

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection (OLS using t-stat)
▪ Let ℳ𝑝 denote the full model, which contains all 𝑝 predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1 :
▪ Fit the model with k predictors.
▪ Identify the least useful predictor (smallest “classical” t-stat.), drop it and
call the resulting model 𝑀𝑘−1

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-

validated prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection
𝑝(𝑝+1)
▪ Instead of fitting2𝑃 models, need only 𝑃 + 1 as for OLS and 1 +
2
for logit regression (why?).
▪ Statistical cost: we constrain the search to reduce variance, but perhaps incur
more bias.

▪ Can't be used when 𝑃 > 𝑁

▪ Still not confined just to OLS, can do, e.g., logit regression.
Backward Stepwise Selection
Backward Stepwise Selection
Backward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Backward-selection optimal model for 'Credit'

Criterion 𝑘 Test MSE

AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10307.72
10-fold CV 5 10155.78

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

High-dimensional setting (𝑃 >> 𝑁)
▪ Standard AIC and BIC are not well suited for selection - they will typically
overfit.

▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.

▪ Chen and Chen (2008) propose the extended BIC formally justified for high-
dimensional setting: replace log(𝑁) with log(𝑁) + 2log(𝑃)

▪ Wang (2012) shows that forward selection using the above BIC is consistent in
ultra high-dimensional settings under some assumptions (normal
homoscedastic linear regression plus some regularity conditions).
Final comments
▪ Nice feature: end up with a single model that is easy to work with and discuss. Most
likely not the true model.
▪ Very interpretable, but careful about claiming causality!
▪ Can do this type of selection with essentially any criterion function (e.g., log-
likelihood for binary choice).
▪ Many flavors of selection methods, not all have theoretical justification. Need to
figure out what works when.
▪ Standard approaches get very slow very fast. New techniques push the envelope
quite a bit though.
▪ Stability w.r.t. to the data sample could be an issue - a starkly different model may
obtain if another sample is considered.

Chapter 3. Linear Regression
No ratings yet
Chapter 3. Linear Regression
41 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Employee Salary Prediction Slides
No ratings yet
Employee Salary Prediction Slides
21 pages
1729585037_ML11_Generalization
No ratings yet
1729585037_ML11_Generalization
40 pages
Chapter 15 - Machine Learning New (2)
No ratings yet
Chapter 15 - Machine Learning New (2)
19 pages
Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Chapter three
No ratings yet
Chapter three
35 pages
lecture_06
No ratings yet
lecture_06
51 pages
Lec-03_LogisticRegression
No ratings yet
Lec-03_LogisticRegression
32 pages
ML_Lec_2
No ratings yet
ML_Lec_2
32 pages
Glossary for AI ML and Statistics
No ratings yet
Glossary for AI ML and Statistics
85 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Accuracy Assessment and Confusion Matrix
No ratings yet
Accuracy Assessment and Confusion Matrix
23 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Unit2-Regression NGP
No ratings yet
Unit2-Regression NGP
81 pages
Ch 6
No ratings yet
Ch 6
23 pages
S-12
No ratings yet
S-12
9 pages
ml
No ratings yet
ml
10 pages
Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression
No ratings yet
Applied Linear Statistical Models: MS 5218 Dr. Lilun DU Multiple Regression
66 pages
3.13.23.33.43.5regression Regularization
No ratings yet
3.13.23.33.43.5regression Regularization
29 pages
GLS-MMH
No ratings yet
GLS-MMH
35 pages
Module 3
No ratings yet
Module 3
35 pages
slides-6-iu
No ratings yet
slides-6-iu
38 pages
AI Algoritm Course
No ratings yet
AI Algoritm Course
19 pages
DAV Question Bank+Answe
No ratings yet
DAV Question Bank+Answe
54 pages
6_Classification and Regression Tasks (3)
No ratings yet
6_Classification and Regression Tasks (3)
100 pages
INSY662 - F23 - Week 3-1
No ratings yet
INSY662 - F23 - Week 3-1
22 pages
Hhghiikkk
No ratings yet
Hhghiikkk
29 pages
Subset Selection and Shrinkage Methods
No ratings yet
Subset Selection and Shrinkage Methods
25 pages
Multiple Regression: Model and Interpretation
No ratings yet
Multiple Regression: Model and Interpretation
10 pages
003 KNN Complete
No ratings yet
003 KNN Complete
66 pages
L2D-Multiple Regression D 2022-03-03 21_20_03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21_20_03
31 pages
Econometrics 7
No ratings yet
Econometrics 7
49 pages
slidesc53_3
No ratings yet
slidesc53_3
47 pages
STAT 445 Regression Analysis
No ratings yet
STAT 445 Regression Analysis
49 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Chapter2 - Ordinary Least Squares
No ratings yet
Chapter2 - Ordinary Least Squares
32 pages
Residual Analysis and test_02
No ratings yet
Residual Analysis and test_02
37 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Linear Regression Model: Man - PN@VNP - Edu.vn
No ratings yet
Linear Regression Model: Man - PN@VNP - Edu.vn
77 pages
linear regression
No ratings yet
linear regression
130 pages
L11_2023
No ratings yet
L11_2023
99 pages
05 16 Simple Regression 2
No ratings yet
05 16 Simple Regression 2
84 pages
6_Classification and Regression Tasks
No ratings yet
6_Classification and Regression Tasks
115 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
lecture3 - Copy (1)
No ratings yet
lecture3 - Copy (1)
15 pages
Classification and Regression
No ratings yet
Classification and Regression
34 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
Applied Statistics II-SLR
100% (1)
Applied Statistics II-SLR
23 pages
Unit V
No ratings yet
Unit V
27 pages
3 Multiple Regression Model
No ratings yet
3 Multiple Regression Model
48 pages
Module07 - Model Selection and Regularization
No ratings yet
Module07 - Model Selection and Regularization
46 pages
1. Lecture+Notes+-+Advanced+Regression
No ratings yet
1. Lecture+Notes+-+Advanced+Regression
12 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
CH 02 Simple Regression TQT
No ratings yet
CH 02 Simple Regression TQT
61 pages
werabe univ MODEL-EXIT EXAM
No ratings yet
werabe univ MODEL-EXIT EXAM
20 pages
Awareness of Artificial Intelligence: Diffusion of Information About Ai Versus Chatgpt in The United States
No ratings yet
Awareness of Artificial Intelligence: Diffusion of Information About Ai Versus Chatgpt in The United States
27 pages
Introduction To Econometrics, Tutorial (3
100% (2)
Introduction To Econometrics, Tutorial (3
12 pages
Solution To Chapter 2 Analytical Exercises: Hayashi Econometrics
No ratings yet
Solution To Chapter 2 Analytical Exercises: Hayashi Econometrics
8 pages
Practice Problems For Midterm 1
No ratings yet
Practice Problems For Midterm 1
83 pages
Statsmodels: Econometric and Statistical Modeling With Python
No ratings yet
Statsmodels: Econometric and Statistical Modeling With Python
5 pages
Paper 10 ... V4i2 165 Farahnaz Turi Final
No ratings yet
Paper 10 ... V4i2 165 Farahnaz Turi Final
18 pages
(FREE PDF Sample) Modern Statistics For The Social and Behavioral Sciences: A Practical Introduction Second Edition. Edition Wilcox Ebooks
100% (5)
(FREE PDF Sample) Modern Statistics For The Social and Behavioral Sciences: A Practical Introduction Second Edition. Edition Wilcox Ebooks
62 pages
Multiple Regression Analysis, The Problem of Estimation
No ratings yet
Multiple Regression Analysis, The Problem of Estimation
53 pages
Handbook of Applied Econometrics & Statistical Inference
No ratings yet
Handbook of Applied Econometrics & Statistical Inference
718 pages
Applied Regression Analysis
No ratings yet
Applied Regression Analysis
16 pages
T.E. - AI - ML Syllabus
No ratings yet
T.E. - AI - ML Syllabus
76 pages
Auto/cross-Correlation: Generalized Regression Model
No ratings yet
Auto/cross-Correlation: Generalized Regression Model
37 pages
TD_Meth_2024
No ratings yet
TD_Meth_2024
6 pages
Name: . ID No: .. BITS-Pilani Dubai Campus Econ F241 Econometric Methods Semester I, 2018test-1 (Closed Book)
No ratings yet
Name: . ID No: .. BITS-Pilani Dubai Campus Econ F241 Econometric Methods Semester I, 2018test-1 (Closed Book)
6 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Advanced Econometrics
No ratings yet
Advanced Econometrics
513 pages
Dhrifi 2015
No ratings yet
Dhrifi 2015
20 pages
Does Class Attendance Affect Academic Performance? Evidence From "D'Annunzio" University
No ratings yet
Does Class Attendance Affect Academic Performance? Evidence From "D'Annunzio" University
22 pages
Stability of Commercial Banks in Nepal
No ratings yet
Stability of Commercial Banks in Nepal
22 pages
R Studio How To
No ratings yet
R Studio How To
12 pages
Raptor CRM
No ratings yet
Raptor CRM
129 pages
02 Cost Behavior With Regression Analysis
No ratings yet
02 Cost Behavior With Regression Analysis
6 pages
Phillips & Perron - Biometrika - 1988 - Unit Root Test
No ratings yet
Phillips & Perron - Biometrika - 1988 - Unit Root Test
13 pages
Gretl Guide (301 350)
No ratings yet
Gretl Guide (301 350)
50 pages
Statistical Methods for Climate Scientists Timothy Delsole - The ebook in PDF format is ready for download
100% (1)
Statistical Methods for Climate Scientists Timothy Delsole - The ebook in PDF format is ready for download
66 pages
The Effect of Corporate Governance Quality and Its Mechanism On Firm Philantrophic Donations
No ratings yet
The Effect of Corporate Governance Quality and Its Mechanism On Firm Philantrophic Donations
25 pages
Engineering Statistics: Curve Fitting
No ratings yet
Engineering Statistics: Curve Fitting
24 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

CHAPTER TWO

Regression with penalization: subset selection

Module Code: EC4308

▪ Introduction and motivation

▪ OLS will almost never set coefficients exactly to 0 .

▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance

▪ Sacrifice "small details" to get the "big picture".

▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0

▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃

▪ We will talk about the first approach today.

Source: James et al. (2014), ISLR, Springer

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.

▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-

▪ Can't be used when 𝑃 > 𝑁

Criterion 𝑘 Test MSE

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.

▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.

You might also like