0% found this document useful (0 votes)
2 views

Notes_Lecture 13_Regularization_LASSO and RIDGE Regression

The document discusses LASSO and Ridge regression as methods for statistical learning, highlighting the trade-off between prediction accuracy and model interpretability. Ridge regression shrinks coefficients towards zero to reduce variance, while LASSO can set some coefficients to zero, aiding in variable selection. Both methods utilize a tuning parameter to control the strength of their respective penalties.

Uploaded by

Marcelo Davi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Notes_Lecture 13_Regularization_LASSO and RIDGE Regression

The document discusses LASSO and Ridge regression as methods for statistical learning, highlighting the trade-off between prediction accuracy and model interpretability. Ridge regression shrinks coefficients towards zero to reduce variance, while LASSO can set some coefficients to zero, aiding in variable selection. Both methods utilize a tuning parameter to control the strength of their respective penalties.

Uploaded by

Marcelo Davi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Lecture 18(b): LASSO and Ridge regression


Foundations of Data Science:

Algorithms and Mathematical Foundations

Mihai Cucuringu
[email protected]

CDT in Mathematics of Random System


University of Oxford

28 September, 2023
2

Overview

Ridge regression

LASSO
3
The Trade-Off Between Prediction Accuracy and
Model Interpretability
▶ linear regression: fairly inflexible
▶ splines: considerably more flexible (can fit a much wider range of
possible shapes to estimate f )
Inference:
▶ linear model: easy to understand the relationship between Y and
X1 , X2 , . . . , Xp
Very flexible approaches (splines, SVM, etc)
▶ can lead to such complicated estimates of f
▶ hard to understand how any individual predictor is associated
with the response (less interpretable)

Example: LASSO
▶ less flexible
▶ linear model + sparsity of [β0 , β1 , . . . , βp ]
▶ more interpretable; only a small subset of predictors matter
4
Flexibility vs. Interpretability 2.1 What Is Statistical Learning? 25

High
Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models


Trees

Bagging, Boosting

Support Vector Machines


Low

Low High

Flexibility
Figure: A representation of the trade-off between flexibility and
FIGURE 2.7. A representation of the tradeoff between flexibility and inter-
interpretability, using different statistical learning methods. In general, as the
pretability, using
flexibility of different
a method statistical
increases, itslearning methods.
interpretability In general, as the flexibil-
decreases.
ity of a method increases, its interpretability decreases.
5 2
R
▶ also called the coefficient of determination
▶ pronounced ”R squared”,
▶ gives the proportion of the variance in the dependent variable
that is predictable from the independent variable/s

2 TSS − RSS
R =
TSS
where

2
n⎛ p

RSS = ∑ ⎜yi − β0 − ∑ βj xij ⎟
i=1 ⎝ j=1 ⎠

2
TSS = ∑(yi − ȳ )
i
6
Variable selection
Which predictors are associated with the response? (in order to fit a
single model involving only those d predictors)
▶ Note: R 2 always increase as you add more variables to the model
▶ adjusted R 2 : 1 − RSS/(n−p−1) = 1 − (1 − R 2 ) n−1
TSS/(n−1) n−p−1

▶ Mallow’s: Cp = 1 (RSS + 2pσ̂ 2 )


n
2
▶ Akaike Information criterion AIC = 1
nσ̂ 2
(RSS + 2pσ̂ )
p
Cannot consider all 2 models...
▶ Best Subset Selection: fit a separate least squares regression for
each possible k -combination of the p predictors, and select the
best one
▶ Forward selection: start with the null model and keep adding
predictors one by one
▶ Backward selection: start with all variables in the model, and
remove the variable with the largest p-value
7
Prediction Accuracy
2 2
MSE = E[(h(x ) − h̄(x )) ] + [f (x ) − h̄(x )] + Var [],
∗ ∗ ∗ ∗


x : new data point, f : ground truth, h: our estimator
2
MSE = Var[h(x )] + Bias(h(x )) + Var []
∗ ∗

▶ if true relationship is ≈ linear, the OLS will have low bias


▶ if n >> p: OLS also has low variance, and performs well on Xtest
▶ if n ∼ p: OLS has high variability, leads to overfitting/poor
predictions on Xtest
▶ if n < p: OLS estimate is no longer unique!
Today:
▶ by shrinking the estimated coefficients, we can often substantially
reduce the variance at the cost of a negligible increase in bias
▶ can lead to substantial improvements in the accuracy with which
we can predict the response for Xtest
8
Model Interpretability

▶ some or most of the variables used in a multiple linear regression


may not be associated with the response

▶ excluding them from the fit leads to a model that is more easily
interpreted

Shrinkage/Regularization:
▶ by setting the corresponding coefficient estimates to zero — we
can obtain a model that is more easily interpreted

▶ approach for automatically performing feature/variable selection


and thus excluding irrelevant variables from a multiple regression
model
9
Variable selection

▶ Subset Selection: identify a subset of p predictors that best relate


to the response, and perform OLS on them

▶ Shrinkage/Regularization: fit a model involving all p predictors,


but the estimated coefficients are shrunken towards zero, or end
up even equal to zero

▶ Dimensionality Reduction: first project the p predictors into a


d-dimensional subspace, with d < p. The d linear combinations,
or projections are subsequently used as predictors in OLS
(principal component regression PCR)
10
Shrinkage Methods

▶ fit a model containing all p predictors using a technique that


constrains or regularizes the coefficient estimates, or
equivalently, that shrinks the coefficient estimates towards zero

▶ shrinking the coefficient estimates can significantly reduce their


variance

▶ the two best-known techniques for shrinking the regression


coefficients towards zero are
▶ ridge regression
▶ lasso regression

See Section 6.2 in the ISLR textbook.


11
Regularization penalty

Idea: impose an `q penalty on the vector of beta coefficients, to


promote shrinking them towards zero

q q q

q q

Credit: Peter Gerstoft


12
Ridge Regression
Recall: OLS estimates β0 , β1 , . . . , βp such that it minimizes
2
n ⎛ p

RSS = ∑ ⎜yi − β0 − ∑ βj xij ⎟
i=1 ⎝ j=1 ⎠

Ridge regression shrinks β1 , . . . , βp towards zero. Given a response


n n×p
vector y ∈ R and a predictor matrix X ∈ R
RSS
ÌÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò ÐÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Î
2
n ⎛ p
⎞ p
(ridge) 2
β̂ = arg minp ∑ ⎜yi − ∑ βj xij ⎟ +λ ∑ βj
i=1 ⎝ ⎠
β∈R
j=1 j=1
n p
T 2 2
= arg minp ∑ (yi − xi β) + λ ∑ βj
β∈R
i=1 j=1
2 2
= arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty
13

(ridge) 2 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty

(ridge) T T
= (X X + λI) X y
−1
β̂

Here λ ≥ 0 is a tuning parameter


▶ controls the strength of the penalty term

▶ λ = 0 recovers the linear regression estimate

▶ λ = ∞ leads to β̂ (ridge) = 0

▶ λ ∈ (0, ∞) trades-off two ideas: fitting a linear model of y on X


versus shrinking the coefficients
14
Experimental setup
p
Given fixed covariates xi ∈ R , i = 1, . . . , n
We observe:
▶ yi = f (xi ) + i , i = 1, . . . , n,

▶ for a linear model f (xi ) = xiT β

▶ i ∈ R

▶ E[i ] = 0

▶ Var[i ] = σ 2

▶ Cov(i , j ) = 0
15
Experimental setup
▶ n = 50, p = 30, and σ 2 = 1
▶ The true model is linear with
▶ 10 large coefficients (between 0.5 and 1) and
▶ 20 small ones (between 0 and 0.3)
▶ Histogram of true coefficients

Source: R. Tibshirani
16
Experimental setup
▶ n = 50, p = 30, and σ 2 = 1
▶ The true model is linear with
▶ 10 large coefficients (between 0.5 and 1) and
▶ 20 small ones (between 0 and 0.3)
▶ Histogram of true coefficients

▶ the linear regression fit yields:


▶ Squared bias ≈ 0.006
▶ Variance ≈ 0.627
▶ Pred. error ≈ 1 + 0.006 + 0.627 ≈ 1.633
17
Improved prediction via shrinking

Linear Regression Ridge Reg. (at its best)


Squared bias ≈ 0.006 ≈ 0.077
Variance ≈ 0.627 ≈ 0.403
Pred. error ≈ 1 + 0.006 + 0.627 ≈ 1 + 0.077 + 0.403
≈ 1.633 ≈ 1.48
18
Ridge regression in R

The function lm.ridge in the package MASS:

▶ lambdas = seq(0,25,length = 100)

▶ aa = lm.ridge(y ∼ x + 0, lambda = lambdas)

▶ b.ridge = coef(aa)

▶ fit.ridge = b.ridge % * % t(x)

The glmnet function/package is also available in R.


19
Bias and variance of ridge regression

(ridge) 2 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty

Bias and variance:


▶ not as simple to derive for ridge regression as they are for linear
regression
▶ but closed-form expressions are still possible

The general trend is:


▶ The bias increases as λ increases
▶ The variance decreases as λ increases
20
Bias and variance of ridge regression
21
Mean squared error (MSE), bias and variance
22
Recap: ridge regression
▶ minimizes the usual regression criterion plus a penalty term on
the squared l2 norm of the coefficient vector
▶ shrinks the coefficients towards zero
▶ introduces some bias
▶ but can greatly reduce the variance
▶ overall, it results in a better mean-squared error
▶ the amount of shrinkage is controlled by λ
▶ performs particularly well when there is a subset of true
coefficients that are small or even zero
▶ not as great when all of the true coefficients are moderately large
(can still outperform OLS over a pretty narrow range of (small) λ
values)
▶ does NOT set coefficients to zero exactly, and therefore cannot
perform variable selection in the linear model
23
LASSO
Recall OLS estimates β0 , β1 , . . . , βp such that it minimizes
2
⎛n p

RSS = ∑ ⎜yi − β0 − ∑ βj xij ⎟
i=1 ⎝ j=1 ⎠

LASSO sets some of the coefficients β1 , . . . , βp to zero. Given a


n n×p
response vector y ∈ R and a predictor matrix X ∈ R
RSS
ÌÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò ÐÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Î Penalty
ÌÒÒpÒ Ò Ò Ò Ò Ò Ð Ò Ò Ò Ò Ò Ò Ò Ò Î
2
n ⎛ p

(lasso)
β̂ = arg minp ∑ ⎜yi − ∑ βj xij ⎟ +λ ∑ ∣βj ∣
i=1 ⎝ ⎠
β∈R
j=1 j=1
n p
T 2
= arg minp ∑ (yi − xi β) + λ ∑ ∣βj ∣
β∈R
i=1 j=1
2
= arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣1
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty
24 2
arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣1
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty

• The tuning parameter λ controls the strength of the penalty, and (like
ridge regression), we get
▶ β̂ (lasso) = the usual OLS estimator, whenever λ = 0
▶ β̂ (lasso) = 0, whenever λ = ∞
For λ ∈ (0, ∞), we are balancing the trade-offs:
▶ fitting a linear model of y on X
▶ shrinking the coefficients; but the nature of the l1 penalty causes
some coefficients to be shrunken to zero exactly
LASSO (vs. Ridge):
▶ LASSO performs variable selection in the linear model
▶ has no closed-form solution (various optimization techniques are
employed)
▶ as λ increases, more coefficients are set to zero (less variables
are selected), and among the nonzero coefficients, more
shrinkage is employed
25
Ridge: coefficient paths
26
LASSO: coefficient paths
27
Fitting LASSO models in R with the glmnet package
▶ Lasso and Elastic-Net Regularized Generalized Linear Models
▶ fits a wide variety of models (linear models, generalized linear
models, multinomial models) with LASSO penalties
▶ the syntax is fairly straightforward, though it differs from lm in that
it requires you to form your own design matrix:
fit = glmnet(X, y)
▶ the package also allows you to conveniently carry out
cross-validation:
cvfit = cv.glmnet(X, y); plot(cvfit);
▶ prediction with cross validation. Example:
X = matrix(rnorm(100*20), 100, 20)
y = rnorm(100)
cv.fit = cv.glmnet(X, y)
yhat = predict(cv.fit, newx=X[1:5,])
coef(cv.fit)
coef(cv.fit, s = ”lambda.min”)
28
Elastic net - the best of both worlds
Elastic Net combines the penalties of Ridge and LASSO.

(elastic net) 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ1 ∣∣β∣∣1 + λ2 ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty Penalty

Addresses several shortcomings of LASSO:


▶ for n < p (more covariates/features than samples) LASSO can
select only n covariates (even if more are truly associated with
the response)
▶ it tends to select only one covariate from any set of highly
correlated covariates
▶ for n > p, if the covariates are strongly correlated, Ridge tends to
perform better
Elastic Net:
▶ highly correlated covariates will tend to have similar regression
coefficients (desirable grouping effect)
29
Simpson’s paradox - beware!
Phenomenon in statistics when certain trends that appear when a
dataset is separated into groups are reversed when the data are
aggregated.

▶ can be resolved when confounding variables and causal relations


are appropriately addressed in the statistical modeling
▶ misleading results that the misuse of statistics can generate
Source: Wiki

You might also like