Chapter 2
Chapter 2
▪ Very commonly used to model 𝐸(𝑌 ∣ 𝑋) - recall that this is what we are after
in predictive modelling under squared loss.
▪ Can describe nonlinear but still additive relationships. That is linear in
coefficients (𝛽’s).
▪ Linearity yields advantages in inference, and in many applications delivers
surprisingly good approximations.
Motivation: is OLS good enough?
▪ We typically fit such models with OLS, why do we have to look for
alternatives?
▪ 2 key concerns: prediction accuracy and model interpretability.
▪ OLS will have low bias if true relationship approximately linear.
▪ If 𝑃 << 𝑁 ("typical" metrics textbook setting), OLS also has low variance.
No need to do anything
Motivation: prediction accuracy
▪ When 𝑁 is not much larger than 𝑃 - OLS predictions will have high variance
→ overfit → poor OOS prediction.
▪ When 𝑃 > 𝑁 (or 𝑃 >> 𝑁 ), no unique solution: OLS variance is infinite →
cannot use this method at all.
▪ Our goal is to get a linear combination of 𝑋 to get a good forecast of 𝑌.
▪ We want to reduce the variance by somehow reducing model complexity with
a minor cost in terms bias - constrain coefficients on some 𝑋 's to 0 or shrink
them.
Motivation: Interpretability
▪ Oftentimes 𝑋 's entering regression may be irrelevant.
▪ Caution: "interpretation" here need not mean causal effect! Rather, which variables
can be important predictors.
Introduction: Three important approaches
▪ Subset selection: find a (hopefully small) subset of the X's that yields good
performance. Fit OLS on this subset.
▪ For 𝑘 = 0,1, … , 𝑝᪄
• Fit all models that contain exactly 𝑘 predictors. There are 𝑝ҧ models. If 𝑘 = 0, the
𝑘
forecast is the unconditional mean.
• Pick the best (e.g, highest 𝑅2 ) among these models, and denote it by ℳ𝑘 .
▪ Optimize over ℳ0 , … , ℳ𝑝᪄ using cross-validation (or other criteria like AIC
or BIC)
Best Subset Selection
▪ The above direct approach is called all subsets or best subsets regression
▪ However, we often can't examine all possible models, since they are 2𝑝 of them; for
example, when 𝑝 = 40 there are over a billion models!
▪ The prediction is highly unstable: the subsets of variables in ℳ10 and ℳ11 can be very
different from each other → high variance (the best subset of ℳ3 need not include any of
the variables in best subset of ℳ2 .)
▪ If P is large, the chance of finding models that work great in training data, but not so
much OOS increases – overfitting.
▪ Estimates fluctuate across different samples, so does the best model choice
▪ Nice feature: not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Best Subset Selection
▪ But the more predictors, the lower the insample MSE. With large 𝑘, we might have too
flexible a model. With small 𝑘, we have a rigid model. Both do badly out-of-sample.
▪ So which 𝑘 is the best? In other words, out of the best models for every 𝑘, which model is
the best?
▪ We need the model with the lowest OOS MSE ('test error’).
▪ We can use Information Criterion (IC) to indirectly estimate the OOS MSE. IC help us
with adjusting the in-sample MSE to penalize for potential overfitting. The more regressors
we add, the higher the penalty.
▪ OR: We can directly estimate the OOS MSE using the validation set or CV.
Mallow’s 𝐶𝑝 and adjusted-𝑅 : A refresher 2
▪ Mallow's 𝐶𝑝 :
1
𝐶𝑝 = RSS + 2𝑘𝜎ƶ 2
𝑛
▪ where 𝑑 is the total # of parameters used and 𝜎ƶ 2 is an estimate of the variance of the error 𝜖 associated with
each response measurement.
▪ For a least squares model with 𝑘 variables, the adjusted 𝑅2 statistic is calculated as
2
RSS/(𝑛 − 𝑘 − 1)
Adjusted 𝑅 = 1 −
TSS/(𝑛 − 1)
2 RSS
▪ Maximizing the adjusted 𝑅 is equivalent to minimizing . While RSS always
𝑛−𝑘−1
RSS
decreases as the number of variables in the model increases, may increase or decrease,
𝑛−𝑘−1
due to the presence of 𝑘 in the denominator.
AIC: A refresher
▪ AIC (Akaike Information Criterion) is an unbiased estimate of the Kullback-
Leibler divergence between the model distribution and the forecast
distribution (if all the assumptions hold).
▪ The 𝐴𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
AIC = −2log 𝐿 + 2 ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In the linear regression model, it equals:
1 2
𝐴𝐼𝐶 = 𝑅𝑆𝑆 + 2𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
▪ AIC is used in different model types, not just linear.
BIC: a refresher
▪ BIC ( (Schwarz) Bayesian Information Criterion) is inversely proportional to
the posterior probability of the model under a uniform prior over models.
▪ The 𝐵𝐼𝐶 criterion is defined for a large class of models fit by maximum
likelihood:
BIC = −2log 𝐿 + log(N) ⋅ 𝑘
where L is the maximized value of the likelihood function for the estimated model.
▪ In this linear regression, it equals:
1 2
𝐵𝐼𝐶 = 𝑅𝑆𝑆 + log 𝑁 𝑘 𝜎
ƶ
𝑁𝜎ƶ 2
Estimating 𝜎ƶ 2 when applying IC
▪ How do we find 𝜎ƶ 2 to compute AIC and BIC.
▪ First approach: Use 𝜎ො 2 directly from each model.
▪ When 𝑃 << 𝑁: use the 𝜎ƶ 2 estimated using the full model (all predictors used
→ low bias model; suggested in textbook.)
▪ When 𝑃/𝑁 is not small, try using iterative procedure:
▪ Use 𝜎02 = 𝑠𝑌2 (sample variance of 𝑌 ), select best model based on AIC/BIC,
call it 𝑀 𝑘0 .
▪ Take 𝜎12 = 𝜎ƶ02 , select the best model on IC, call it 𝑀 𝑘1 . Iterate until
convergence (often above 2 steps enough).
Best subset selection for ‘Credit’
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education ∗
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married ∗
Asian ∗ ∗
Caucasian ∗
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset selection for ‘Credit’; 𝑝ҧ = 8
Best subset optimal model for 'Credit’
2. For 𝑘 = 0, … , 𝑝ҧ − 1 :
▪ Consider all 𝑝 − 𝑘 models that augment the predictors in ℳ𝑘 with one additional
predictor.
▪ Choose the best among these 𝑝 − 𝑘 models, and call it ℳ𝑘+1 . Here best is defined as
having smallest RSS or highest 𝑅2 .
▪ Still not confined just to OLS, can do, e.g., logit regression, just replace 𝑅2 by
deviance (−2log(L)).
Forward Stepwise Selection
Forward Stepwise Selection
Forward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Forward sel. optimal model for 'Credit'
▪ Still not confined just to OLS, can do, e.g., logit regression.
Backward Stepwise Selection
Backward Stepwise Selection
Backward subset selection for ‘Credit’; 𝑝ҧ = 8
𝑘 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖
Variable
Income ∗ ∗ ∗ ∗ ∗ ∗ ∗
Limit ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
Rating ∗ ∗ ∗ ∗
Cards ∗ ∗ ∗ ∗ ∗
Age ∗ ∗ ∗
Education
Female ∗
Student ∗ ∗ ∗ ∗ ∗ ∗
Married
Asian ∗ ∗
Caucasian
Backward-selection optimal model for 'Credit'
▪ Chen and Chen (2008) propose the extended BIC formally justified for high-
dimensional setting: replace log(𝑁) with log(𝑁) + 2log(𝑃)
▪ Wang (2012) shows that forward selection using the above BIC is consistent in
ultra high-dimensional settings under some assumptions (normal
homoscedastic linear regression plus some regularity conditions).
Final comments
▪ Nice feature: end up with a single model that is easy to work with and discuss. Most
likely not the true model.
▪ Very interpretable, but careful about claiming causality!
▪ Can do this type of selection with essentially any criterion function (e.g., log-
likelihood for binary choice).
▪ Many flavors of selection methods, not all have theoretical justification. Need to
figure out what works when.
▪ Standard approaches get very slow very fast. New techniques push the envelope
quite a bit though.
▪ Stability w.r.t. to the data sample could be an issue - a starkly different model may
obtain if another sample is considered.