Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
Notes_Lecture 13_Regularization_LASSO and RIDGE Regression
Mihai Cucuringu
[email protected]
28 September, 2023
2
Overview
Ridge regression
LASSO
3
The Trade-Off Between Prediction Accuracy and
Model Interpretability
▶ linear regression: fairly inflexible
▶ splines: considerably more flexible (can fit a much wider range of
possible shapes to estimate f )
Inference:
▶ linear model: easy to understand the relationship between Y and
X1 , X2 , . . . , Xp
Very flexible approaches (splines, SVM, etc)
▶ can lead to such complicated estimates of f
▶ hard to understand how any individual predictor is associated
with the response (less interpretable)
Example: LASSO
▶ less flexible
▶ linear model + sparsity of [β0 , β1 , . . . , βp ]
▶ more interpretable; only a small subset of predictors matter
4
Flexibility vs. Interpretability 2.1 What Is Statistical Learning? 25
High
Subset Selection
Lasso
Least Squares
Interpretability
Bagging, Boosting
Low High
Flexibility
Figure: A representation of the trade-off between flexibility and
FIGURE 2.7. A representation of the tradeoff between flexibility and inter-
interpretability, using different statistical learning methods. In general, as the
pretability, using
flexibility of different
a method statistical
increases, itslearning methods.
interpretability In general, as the flexibil-
decreases.
ity of a method increases, its interpretability decreases.
5 2
R
▶ also called the coefficient of determination
▶ pronounced ”R squared”,
▶ gives the proportion of the variance in the dependent variable
that is predictable from the independent variable/s
2 TSS − RSS
R =
TSS
where
2
n⎛ p
⎞
RSS = ∑ ⎜yi − β0 − ∑ βj xij ⎟
i=1 ⎝ j=1 ⎠
2
TSS = ∑(yi − ȳ )
i
6
Variable selection
Which predictors are associated with the response? (in order to fit a
single model involving only those d predictors)
▶ Note: R 2 always increase as you add more variables to the model
▶ adjusted R 2 : 1 − RSS/(n−p−1) = 1 − (1 − R 2 ) n−1
TSS/(n−1) n−p−1
∗
x : new data point, f : ground truth, h: our estimator
2
MSE = Var[h(x )] + Bias(h(x )) + Var []
∗ ∗
▶ excluding them from the fit leads to a model that is more easily
interpreted
Shrinkage/Regularization:
▶ by setting the corresponding coefficient estimates to zero — we
can obtain a model that is more easily interpreted
q q q
q q
(ridge) 2 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty
(ridge) T T
= (X X + λI) X y
−1
β̂
▶ λ = ∞ leads to β̂ (ridge) = 0
▶ i ∈ R
▶ E[i ] = 0
▶ Var[i ] = σ 2
▶ Cov(i , j ) = 0
15
Experimental setup
▶ n = 50, p = 30, and σ 2 = 1
▶ The true model is linear with
▶ 10 large coefficients (between 0.5 and 1) and
▶ 20 small ones (between 0 and 0.3)
▶ Histogram of true coefficients
Source: R. Tibshirani
16
Experimental setup
▶ n = 50, p = 30, and σ 2 = 1
▶ The true model is linear with
▶ 10 large coefficients (between 0.5 and 1) and
▶ 20 small ones (between 0 and 0.3)
▶ Histogram of true coefficients
▶ b.ridge = coef(aa)
(ridge) 2 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty
• The tuning parameter λ controls the strength of the penalty, and (like
ridge regression), we get
▶ β̂ (lasso) = the usual OLS estimator, whenever λ = 0
▶ β̂ (lasso) = 0, whenever λ = ∞
For λ ∈ (0, ∞), we are balancing the trade-offs:
▶ fitting a linear model of y on X
▶ shrinking the coefficients; but the nature of the l1 penalty causes
some coefficients to be shrunken to zero exactly
LASSO (vs. Ridge):
▶ LASSO performs variable selection in the linear model
▶ has no closed-form solution (various optimization techniques are
employed)
▶ as λ increases, more coefficients are set to zero (less variables
are selected), and among the nonzero coefficients, more
shrinkage is employed
25
Ridge: coefficient paths
26
LASSO: coefficient paths
27
Fitting LASSO models in R with the glmnet package
▶ Lasso and Elastic-Net Regularized Generalized Linear Models
▶ fits a wide variety of models (linear models, generalized linear
models, multinomial models) with LASSO penalties
▶ the syntax is fairly straightforward, though it differs from lm in that
it requires you to form your own design matrix:
fit = glmnet(X, y)
▶ the package also allows you to conveniently carry out
cross-validation:
cvfit = cv.glmnet(X, y); plot(cvfit);
▶ prediction with cross validation. Example:
X = matrix(rnorm(100*20), 100, 20)
y = rnorm(100)
cv.fit = cv.glmnet(X, y)
yhat = predict(cv.fit, newx=X[1:5,])
coef(cv.fit)
coef(cv.fit, s = ”lambda.min”)
28
Elastic net - the best of both worlds
Elastic Net combines the penalties of Ridge and LASSO.
(elastic net) 2
β̂ = arg minp ∣∣y − X β∣∣2 + λ1 ∣∣β∣∣1 + λ2 ∣∣β∣∣2
β∈R ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ñ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ ÍÒÒ Ò Ò Ò Ò ÑÒÒ Ò Ò Ò Ò ÒÏ
Loss Penalty Penalty