0% found this document useful (0 votes)
1 views

Elements of Statistical Learning II - Ch.3 Linear Regression - Notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Elements of Statistical Learning II - Ch.3 Linear Regression - Notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Ch.

3 Linear Models

Case study
Note: For a linear regression (LinReg, Lasso, Ridge) case study see “Linear Regression case study
(sklearn).pdf”

- includes categorical data (one hot encoding), numerical data scaling (to interpret coeff
correctly), fat tails in target (take log), multi-collinearity/correlated features and their effect
on coefficient interpretation (unstable coeffs), determiming stability of coeffs (cross val),
interpretation of coeffs (effect/relationship of Xi on Y given others X’s remain constant
(conditional dependence – as opposed to independence/marginal dependence from
correlation of Xi and Y e.g.)

Linear Regression
Linear Regression Assumptions

Effect of Linear Reg assumptions - Inference vs Prediction ( Stats vs ML)

- Basically Linear Regression’s assumptions affect inference and not prediction (unless obvious
ones like linearity in inputs). See link below
- https://stats.stackexchange.com/questions/486672/why-dont-linear-regression-
assumptions-matter-in-machine-learning

Linear Regression assumptions list

- No multi-collinearity/Orthogonal variables
- No heteroskedacity (the variance of residual is the same for any value of X)
- Gaussian errors/For any fixed value of X, y is normally distribution
- Linearity
- Independence: No autocorrelation/Independence in the errors

Diagnostic for Linear regression assumptions

- https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/
R5_Correlation-Regression7.html
- https://www.statisticssolutions.com/assumptions-of-linear-regression/ (includes multi-
collinearity)
- http://people.duke.edu/~rnau/testing.htm

Effect of these assumptions

- Mainly on p-values of coefficients (whether they are significant)


o This is based on a statistical test, and therefore some of its assumptions (/linear
regression’s assumptions)
- The ones that affect prediction are intuitive/more generic to SL in general
o 1) linearity (i.e. choice of relationship/model between X and Y) 2) independence
among (time series of) errors, 3) when using MSE, tackling unwanted outliers
otherwise will highly fit towards them (Gaussian errors assumptions – MSE is MLE
for Gaussian errors/target)
Multicollinearity assumption

- Affects coefficients (and their interpretation) and p-values of them (statistical significance).
Does not affect predictive power (see book reference below)
- More info on multicollinearity:
o https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
o Includes “Do I have to fix Multicollinearity” (short answer – if just for predictive
power no, if also for inference maybe), “How to deal with Multicollinearity”
 The fact that some or all predictor variables are correlated among
themselves does not, in general, inhibit our ability to obtain a good fit
nor does it tend to affect inferences about mean responses or
predictions of new observations. —Applied Linear Statistical Models,
p289, 4th Edition.

Interpreting inference in linear regression and other areas of caution

- http://people.duke.edu/~rnau/regnotes.htm

Interpretation of coefficients of linear models (and Common pitfalls)

https://scikit-learn.org/stable/auto_examples/inspection/
plot_linear_model_coefficient_interpretation.html#sphx-glr-auto-examples-inspection-plot-linear-
model-coefficient-interpretation-py

Subset/Stepwise regression Selection


- Best subset selection (all possible combinations for K features) (equivalent to L0
regualrization)
- Forward/Backward subset selection

Shrinkage methods
- Lasso, Ridge

Purpose

- ridge regression and lasso regression are designed to deal with situations in
which the candidate independent variables are highly correlated with each and/or
their number is large relative to the sample size (i.e. over fitting), but those
methods are beyond the scope of this discussion.
o Can think of it that since we are tackling beta and with 0 shrinkage
parameters betas are directly proportional to covariance matrix of inputs
(ols solution) then we are reducing covariance matrix by reducing beta.
(this is an approximation since when shrinkage > 0 then solution does not
exactly include the covariance matrix but should approximately involve)

Lasso vs Ridge (vs Elastic Net)

Predictive power
- Ridge/L2 usually better than L1/Lasso – but try both or elastic net
- https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-
does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when
- https://stats.stackexchange.com/questions/331782/if-only-prediction-is-of-interest-why-
use-lasso-over-ridge/331809#331809
o Intuition: when you have collinear variables ridge will keep both, just shrink both
whereas lasso will typically “randomly” kick out one. By having more variables to
rely on you have more “diversification”. But does depend on the true distribution of
regression coefficients. If you have a small fraction of nonzero coefficients in truth,
lasso can perform better

Differences

- L1 for feature selection, L2, on the other hand, is useful when you have collinear/codependent
features.
o https://explained.ai/regularization/L1vsL2.html
- L2 Shrinks more higher values of coefficients and less smaller values of Coefficients. L1 does
this equally (just by thinking β^2 and abs(β))
o From a Bayesian perspective, L2/Ridge assumes a prior distribution of normal
distribution on coefficents whereas L1/Lasso a laplacian

- Generally, when you have many small/medium sized effects you should
go with ridge. If you have only a few variables with a medium/large
effect, go with lasso. Hastie, Tibshirani, Friedman (ESLII)
-

Elastic Net

https://www.quora.com/What-is-the-advantage-of-combining-L2-and-L1-regularizations

Elastic net (that is, L1 + L2 regularization) is definitely “worth it” in the


following situations, as noted in the paper first disclosing the algorithm, Zou,
Hui; Hastie, Trevor (2005). "Regularization and Variable Selection via the Elastic Net":

 In the p>n case, the lasso selects at most n variables before it


saturates, because of the nature of the convex optimization problem.
This seems to be a limiting feature for a variable selection method.
Moreover, the lasso is not well defined unless the bound on the L1-
norm of the coefficients is smaller than a certain value.
 If there is a group of variables among which the pairwise correlations
are very high, then the lasso tends to select only one variable from
the group and does not care which one is selected. [This is partially
rectified by the group LASSO algorithm, although the user must then
identify the group.]
 For usual n>p situations, if there are high correlations between
predictors, it has been empirically observed that the prediction
performance of the lasso is dominated by ridge regression.

L1 vs L2 regularisation generally (e.g. NN)

- On high level doing the same thing – reducing dfs


- However, the way the weights drop is different: In L2 Regularization the weight
reduction is multiplicative and proportional to the value of the weight, so it is faster
for large weights and de-accelerates as the weights get smaller. In L1
Regularization on the other hand, the weights are reduced by a fixed amount in
every iteration, irrespective of the value of the weight.

You might also like