0% found this document useful (0 votes)
60 views

Chapter 2

This document discusses best subset selection for linear regression models. It introduces best subset selection as an approach to selecting a subset of predictor variables that yields good predictive performance. The document outlines that best subset selection examines all possible combinations of predictors and chooses the subset with the best performance. However, this becomes computationally infeasible with many predictors. The document then discusses using information criteria like AIC and BIC to select the best subset model by penalizing models with more predictors. It provides an example of applying best subset selection to a credit card data set with 10 predictors.

Uploaded by

changmin shim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Chapter 2

This document discusses best subset selection for linear regression models. It introduces best subset selection as an approach to selecting a subset of predictor variables that yields good predictive performance. The document outlines that best subset selection examines all possible combinations of predictors and chooses the subset with the best performance. However, this becomes computationally infeasible with many predictors. The document then discusses using information criteria like AIC and BIC to select the best subset model by penalizing models with more predictors. It provides an example of applying best subset selection to a credit card data set with 10 predictors.

Uploaded by

changmin shim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CHAPTER TWO

Regression with penalization: subset selection

Module Code: EC4308


Module title: Machine Learning and Economic Forecasting
Instructor: Vu Thanh Hai
Reference: ISLR 6.1
Acknowledgement: I adapt Dr. Denis Tkachenko’s notes with
some minor modifications. Thanks to Denis!
Lecture outline

▪ Introduction and motivation


▪ Best subset selection
▪ Forward stepwise selection
▪ Backward stepwise selection
▪ Concluding thoughts: pros and cons.
Introduction and motivation
▪ The linear regression model that you have learned:
𝑦𝑖 = 𝛽0 + 𝛽𝑝 𝑥1,𝑖 + 𝛽2 𝑥2,𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝,𝑖 + 𝜀𝑖

▪ Very commonly used to model 𝐸(𝑌 ∣ 𝑋) - recall that this is what we are after
in predictive modelling under squared loss.
▪ Can describe nonlinear but still additive relationships. That is linear in
coefficients (𝛽’s).
▪ Linearity yields advantages in inference, and in many applications delivers
surprisingly good approximations.
Motivation: is OLS good enough?
▪ We typically fit such models with OLS, why do we have to look for
alternatives?
▪ 2 key concerns: prediction accuracy and model interpretability.
▪ OLS will have low bias if true relationship approximately linear.
▪ If 𝑃 << 𝑁 ("typical" metrics textbook setting), OLS also has low variance.
No need to do anything
Motivation: prediction accuracy
▪ When 𝑁 is not much larger than 𝑃 - OLS predictions will have high variance
→ overfit → poor OOS prediction.
▪ When 𝑃 > 𝑁 (or 𝑃 >> 𝑁 ), no unique solution: OLS variance is infinite →
cannot use this method at all.
▪ Our goal is to get a linear combination of 𝑋 to get a good forecast of 𝑌.
▪ We want to reduce the variance by somehow reducing model complexity with
a minor cost in terms bias - constrain coefficients on some 𝑋 's to 0 or shrink
them.
Motivation: Interpretability
▪ Oftentimes 𝑋 's entering regression may be irrelevant.

▪ OLS will almost never set coefficients exactly to 0 .

▪ By removing potentially many such variables (setting their 𝛽 's to 0 ), we enhance


model interpretability.

▪ Sacrifice "small details" to get the "big picture".

▪ Caution: "interpretation" here need not mean causal effect! Rather, which variables
can be important predictors.
Introduction: Three important approaches
▪ Subset selection: find a (hopefully small) subset of the X's that yields good
performance. Fit OLS on this subset.

▪ Shrinkage: Fit a model using all 𝑃 variables, but shrink coefficients to 0


relative to OLS (some can go to exactly 0 ).

▪ Dimension reduction: attempt to "summarize" the 𝑃 predictors with 𝑀 < 𝑃


linear combinations of the variables (e.g., principal components)

▪ We will talk about the first approach today.


Subset Selection
▪ We try to find a possibly sparse subset of 𝑋 's to forecast 𝑌 well: find 𝑆 << 𝑁
predictors needed for a high quality forecast.
▪ Dropping a variable = setting its coefficient to zero. A form of penalty called
the 𝐿0 penalty.
▪ Bias/variance tradeoff: leave out a variable highly correlated to signal → bias;
put in too many X's → high variance.
▪ In principle, want to try out all possible combinations and choose the one
with the best OOS performance.
Introduction: Three important approaches
▪ Data from the ISLR package on credit card balances and 10 predictors, 𝑁 = 400.
ID Identification
Income Income in $10,000 's
Limit Credit limit
Rating Credit rating
Cards Number of credit cards
Age Age in years
Education Number of years of education
Gender A factor with levels Male and Female
Student A factor with levels No and Yes indicating whether the individual was a student
Married A factor with levels No and Yes indicating whether the individual was married
Ethnicity A factor with levels African American, Asian, and Caucasian indicating the individual's ethnicity.
Balance Average credit card balance in $
Best Subset Selection
▪ Let 𝑝᪄ ≤ 𝑝 be the maximum model size.
▪ Goal: Goal: for each 𝑘 ∈ {0,1,2 … , 𝑃}, find the subset of size 𝑘 that gives the
best 𝑅2 . Then, pick the overall best model.

▪ For 𝑘 = 0,1, … , 𝑝᪄
• Fit all models that contain exactly 𝑘 predictors. There are 𝑝ҧ models. If 𝑘 = 0, the
𝑘
forecast is the unconditional mean.

• Pick the best (e.g, highest 𝑅2 ) among these models, and denote it by ℳ𝑘 .

▪ Optimize over ℳ0 , … , ℳ𝑝᪄ using cross-validation (or other criteria like AIC
or BIC)
Best Subset Selection
▪ The above direct approach is called all subsets or best subsets regression
▪ However, we often can't examine all possible models, since they are 2𝑝 of them; for
example, when 𝑝 = 40 there are over a billion models!
▪ The prediction is highly unstable: the subsets of variables in ℳ10 and ℳ11 can be very
different from each other → high variance (the best subset of ℳ3 need not include any of
the variables in best subset of ℳ2 .)
▪ If P is large, the chance of finding models that work great in training data, but not so
much OOS increases – overfitting.
▪ Estimates fluctuate across different samples, so does the best model choice
▪ Nice feature: not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Best Subset Selection
▪ But the more predictors, the lower the insample MSE. With large 𝑘, we might have too
flexible a model. With small 𝑘, we have a rigid model. Both do badly out-of-sample.
▪ So which 𝑘 is the best? In other words, out of the best models for every 𝑘, which model is
the best?
▪ We need the model with the lowest OOS MSE ('test error’).
▪ We can use Information Criterion (IC) to indirectly estimate the OOS MSE. IC help us
with adjusting the in-sample MSE to penalize for potential overfitting. The more regressors
we add, the higher the penalty.
▪ OR: We can directly estimate the OOS MSE using the validation set or CV.
Mallow’s 𝐶𝑝 and adjusted-𝑅 : A refresher 2

▪ Mallow's 𝐶𝑝 :
1
𝐶𝑝 = RSS + 2𝑘𝜎ƶ 2
𝑛
▪ where 𝑑 is the total # of parameters used and 𝜎ƶ 2 is an estimate of the variance of the error 𝜖 associated with
each response measurement.

▪ For a least squares model with 𝑘 variables, the adjusted 𝑅2 statistic is calculated as
2
RSS/(𝑛 − 𝑘 − 1)
Adjusted 𝑅 = 1 −
TSS/(𝑛 − 1)

2 RSS
▪ Maximizing the adjusted 𝑅 is equivalent to minimizing . While RSS always
𝑛−𝑘−1
RSS
decreases as the number of variables in the model increases, may increase or decrease,
𝑛−𝑘−1
due to the presence of 𝑘 in the denominator.
AIC: A refresher
▪ AIC (Akaike Information Criterion) is an unbiased estimate of the Kullback-
Leibler divergence between the model distribution and the forecast
distribution (if all the assumptions hold).
▪ The 𝐴𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
AIC = −2log 𝐿 + 2 ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In the linear regression model, it equals:
1 2
𝐴𝐼𝐶 = 𝑅𝑆𝑆 + 2𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
▪ AIC is used in different model types, not just linear.
BIC: a refresher
▪ BIC ( (Schwarz) Bayesian Information Criterion) is inversely proportional to
the posterior probability of the model under a uniform prior over models.
▪ The 𝐵𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
BIC = −2log 𝐿 + log(N) ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In this linear regression, it equals:
1 2
𝐵𝐼𝐶 = 𝑅𝑆𝑆 + log 𝑁 𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
Estimating 𝜎ƶ 2 when applying IC
▪ How do we find 𝜎ƶ 2 to compute AIC and BIC.
▪ First approach: Use 𝜎ො 2 directly from each model.

▪ When 𝑃 << 𝑁: use the 𝜎ƶ 2 estimated using the full model (all predictors used
→ low bias model; suggested in textbook.)
▪ When 𝑃/𝑁 is not small, try using iterative procedure:
▪ Use 𝜎02 = 𝑠𝑌2 (sample variance of 𝑌 ), select best model based on AIC/BIC,
call it 𝑀 𝑘0 .
▪ Take 𝜎12 = 𝜎ƶ02 , select the best model on IC, call it 𝑀 𝑘1 . Iterate until
convergence (often above 2 steps enough).
Best subset selection for ‘Credit’

Source: James et al. (2014), ISLR, Springer


For each possible model containing a subset of the ten predictors in the Credit data set, the 𝑅𝑆𝑆
and 𝑅2 are displayed. The red frontier tracks the best model for a given number of predictors,
according to 𝑅𝑆𝑆 and 𝑅 2 .
Best subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗

Limit ∗ ∗ ∗ ∗ ∗

Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗

Cards ∗ ∗ ∗ ∗ ∗

Age ∗ ∗ ∗

Education ∗

Female ∗

Student ∗ ∗ ∗ ∗ ∗ ∗

Married ∗

Asian ∗ ∗

Caucasian ∗
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset optimal model for 'Credit’

Criterion 𝑘 Test MSE


AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 4 10307.72
10-fold CV 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.


Stepwise Selection
▪ For computational reasons, best subset selection cannot be applied with very
large 𝑝. Why not?
▪ Best subset selection may also suffer from statistical problems when 𝑝 is large:
larger the search space, the higher the chance of finding models that look
good on the training data, even though they might not have any predictive
power on future data.
▪ Thus, an enormous search space can lead to overfitting and high variance of
the coefficient estimates.
▪ For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
Forward Stepwise Selection: algorithm
1. Let ℳ0 denote the null model, which contains no predictors.

2. For 𝑘 = 0, … , 𝑝ҧ − 1 :

▪ Consider all 𝑝 − 𝑘 models that augment the predictors in ℳ𝑘 with one additional
predictor.
▪ Choose the best among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1 . Here best is defined as
having smallest RSS or highest 𝑅2 .

3. Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated


prediction error, 𝐶𝑝 , 𝐴𝐼𝐶,𝐵𝐼𝐶 , or adjusted 𝑅2 .
Forward Stepwise Selection
▪ FSS involves fitting one null model and (𝑝 − 𝑘) models for each iteration of 𝑘 =
𝑝(𝑝+1)
0,1, … , (𝑝ҧ − 1). If 𝑝ҧ = 𝑝 ⇒ 1 + models. E.g. P = 20, we have 211 models (vs.
2
1,048,576 using best subset)

▪ Observe: ℳ1 contains predictor of ℳ0 and ℳ2 contains predictor of ℳ1 and so on.


Not the case in best subset selection, thus, may not always find the best possible

▪ Forward stepwise selection can be applied in high-dimensional scenarios where 𝑛 <


𝑝 with 𝑝ҧ < 𝑛.

▪ Still not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Forward Stepwise Selection
Forward Stepwise Selection
Forward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Forward sel. optimal model for 'Credit'

Criterion 𝑘 Test MSE


AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10281.33
10-fold CV 5 10281.33

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.


Backward Stepwise Selection (General)
▪ Let ℳ𝑝ҧ denote the full model, which contains all 𝑝ҧ predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1:
▪ Consider all 𝑘 models that contain all but one of the predictors in ℳ𝑘 , for a total of
𝑘 − 1 predictors.
▪ Choose the best among these 𝑘 models, and call it ℳ𝑘−1 . Here, best is defined as
having smallest RSS or highest 𝑅2 (or maximum log likelihood or lowest deviance
depending on the estimation.)

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-validated


prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection (OLS using t-stat)
▪ Let ℳ𝑝 denote the full model, which contains all 𝑝 predictors.
▪ For 𝑘 = 𝑝,ҧ 𝑝ҧ − 1, … , 1 :
▪ Fit the model with k predictors.
▪ Identify the least useful predictor (smallest “classical” t-stat.), drop it and
call the resulting model 𝑀𝑘−1

▪ Select a single best model from among ℳ0 , … , ℳ𝑝ҧ using cross-


validated prediction error, 𝐶𝑝 , AIC, BIC, or adjusted 𝑅2
Backward Stepwise Selection
𝑝(𝑝+1)
▪ Instead of fitting2𝑃 models, need only 𝑃 + 1 as for OLS and 1 +
2
for logit regression (why?).
▪ Statistical cost: we constrain the search to reduce variance, but perhaps incur
more bias.

▪ Can't be used when 𝑃 > 𝑁

▪ Still not confined just to OLS, can do, e.g., logit regression.
Backward Stepwise Selection
Backward Stepwise Selection
Backward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable

Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Backward-selection optimal model for 'Credit'

Criterion 𝑘 Test MSE


AIC 6 𝟏𝟎𝟏𝟓𝟓. 𝟕𝟖
BIC 5 10307.72
10-fold CV 5 10155.78

▪ NB : Alternative variance estimation for AIC/BIC doesn't affect the selection.


High-dimensional setting (𝑃 >> 𝑁)
▪ Standard AIC and BIC are not well suited for selection - they will typically
overfit.

▪ BIC could be sometimes fine if 𝑃 proportional to 𝑁.

▪ Chen and Chen (2008) propose the extended BIC formally justified for high-
dimensional setting: replace log(𝑁) with log(𝑁) + 2log(𝑃)

▪ Wang (2012) shows that forward selection using the above BIC is consistent in
ultra high-dimensional settings under some assumptions (normal
homoscedastic linear regression plus some regularity conditions).
Final comments
▪ Nice feature: end up with a single model that is easy to work with and discuss. Most
likely not the true model.
▪ Very interpretable, but careful about claiming causality!
▪ Can do this type of selection with essentially any criterion function (e.g., log-
likelihood for binary choice).
▪ Many flavors of selection methods, not all have theoretical justification. Need to
figure out what works when.
▪ Standard approaches get very slow very fast. New techniques push the envelope
quite a bit though.
▪ Stability w.r.t. to the data sample could be an issue - a starkly different model may
obtain if another sample is considered.

You might also like