Elements of Statistical Learning II - Ch.3 Linear Regression - Notes
Elements of Statistical Learning II - Ch.3 Linear Regression - Notes
3 Linear Models
Case study
Note: For a linear regression (LinReg, Lasso, Ridge) case study see “Linear Regression case study
(sklearn).pdf”
- includes categorical data (one hot encoding), numerical data scaling (to interpret coeff
correctly), fat tails in target (take log), multi-collinearity/correlated features and their effect
on coefficient interpretation (unstable coeffs), determiming stability of coeffs (cross val),
interpretation of coeffs (effect/relationship of Xi on Y given others X’s remain constant
(conditional dependence – as opposed to independence/marginal dependence from
correlation of Xi and Y e.g.)
Linear Regression
Linear Regression Assumptions
- Basically Linear Regression’s assumptions affect inference and not prediction (unless obvious
ones like linearity in inputs). See link below
- https://stats.stackexchange.com/questions/486672/why-dont-linear-regression-
assumptions-matter-in-machine-learning
- No multi-collinearity/Orthogonal variables
- No heteroskedacity (the variance of residual is the same for any value of X)
- Gaussian errors/For any fixed value of X, y is normally distribution
- Linearity
- Independence: No autocorrelation/Independence in the errors
- https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/
R5_Correlation-Regression7.html
- https://www.statisticssolutions.com/assumptions-of-linear-regression/ (includes multi-
collinearity)
- http://people.duke.edu/~rnau/testing.htm
- Affects coefficients (and their interpretation) and p-values of them (statistical significance).
Does not affect predictive power (see book reference below)
- More info on multicollinearity:
o https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
o Includes “Do I have to fix Multicollinearity” (short answer – if just for predictive
power no, if also for inference maybe), “How to deal with Multicollinearity”
The fact that some or all predictor variables are correlated among
themselves does not, in general, inhibit our ability to obtain a good fit
nor does it tend to affect inferences about mean responses or
predictions of new observations. —Applied Linear Statistical Models,
p289, 4th Edition.
- http://people.duke.edu/~rnau/regnotes.htm
https://scikit-learn.org/stable/auto_examples/inspection/
plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-
model-coefficient-interpretation-py
Shrinkage methods
- Lasso, Ridge
Purpose
- ridge regression and lasso regression are designed to deal with situations in
which the candidate independent variables are highly correlated with each and/or
their number is large relative to the sample size (i.e. over fitting), but those
methods are beyond the scope of this discussion.
o Can think of it that since we are tackling beta and with 0 shrinkage
parameters betas are directly proportional to covariance matrix of inputs
(ols solution) then we are reducing covariance matrix by reducing beta.
(this is an approximation since when shrinkage > 0 then solution does not
exactly include the covariance matrix but should approximately involve)
Predictive power
- Ridge/L2 usually better than L1/Lasso – but try both or elastic net
- https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-
does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
- https://stats.stackexchange.com/questions/331782/if-only-prediction-is-of-interest-why-
use-lasso-over-ridge/331809#331809
o Intuition: when you have collinear variables ridge will keep both, just shrink both
whereas lasso will typically “randomly” kick out one. By having more variables to
rely on you have more “diversification”. But does depend on the true distribution of
regression coefficients. If you have a small fraction of nonzero coefficients in truth,
lasso can perform better
Differences
- L1 for feature selection, L2, on the other hand, is useful when you have collinear/codependent
features.
o https://explained.ai/regularization/L1vsL2.html
- L2 Shrinks more higher values of coefficients and less smaller values of Coefficients. L1 does
this equally (just by thinking β^2 and abs(β))
o From a Bayesian perspective, L2/Ridge assumes a prior distribution of normal
distribution on coefficents whereas L1/Lasso a laplacian
- Generally, when you have many small/medium sized effects you should
go with ridge. If you have only a few variables with a medium/large
effect, go with lasso. Hastie, Tibshirani, Friedman (ESLII)
-
Elastic Net
https://www.quora.com/What-is-the-advantage-of-combining-L2-and-L1-regularizations