Lecture8a Regularization
Lecture8a Regularization
The
idea of regularization revolves around modifying the loss function L; in
particular, we add a regularization term that penalizes some specified properties of
the model parameters
where is a scalar that gives the weight (or importance) of the regularization term.
Fitting the model using the modified loss function Lreg would result in model
parameters with desirable properties (specified by R).
Alternatively, we can choose a regularization term that penalizes the squares of the
parameter magnitudes. Then, our regularized loss function is:
In both ridge and LASSO regression, we see that the larger our choice of the
regularization parameter l, the more heavily we penalize large values in b,
• If l is close to zero, we recover the MSE, i.e. ridge and LASSO regression is
just ordinary regression.
• If l is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force bridge and bLASSO to be close to
zero.
To avoid ad-hoc choices, we should select l using cross-validation.
𝑗=1𝛽
2
𝛽 2
𝑀𝑆𝐸
C ^
𝛽 MSE=D
𝛽 1
𝛽 1
𝜷 ∑ 𝑦 − 𝜷 𝒙| 𝜆 ∑ 𝛽
�
𝑂 ( ) = ¿ 𝑖 + ¿ 𝑗 ∨¿¿
𝜆 ∑𝑛| ^𝛽𝑖=1 |=𝐶
𝐽
𝐿𝐴𝑆𝑆𝑂
� �
𝑗
=� 𝑗 =1=D∑ |� − �
�| =D
𝑗=1𝛽
� 2
𝛽
�2
𝑀𝑆𝐸
C ^
𝛽� MSE=D
𝛽
�1
𝛽
�1
CS109A, PROTOPAPAS,
CS109A, PROTOPAPAS , RADER, TANNER
RADER, TANNER 16
12
The Geometry of Regularization (LASSO)
𝛽 2 𝛽 2
𝑀𝑆𝐸 𝛽 2
^
𝛽 MSE=D
C 𝑀𝑆𝐸
C ^
𝛽 MSE=D
C C
𝛽 1 𝛽 1
𝛽 1
𝑗=1𝛽
2
𝛽 2
𝑆𝐸
𝑀
^
𝛽
MSE=D
C
𝛽 1
𝛽 1
𝛽 2
𝑀𝑆𝐸
^
𝛽
MSE=D
C
𝛽 1
𝛽 2 𝛽 2
𝑀𝑆𝐸
^
𝛽
𝑀𝑆𝐸
MSE=D ^
𝛽
MSE=D
C
C C
𝛽 1 𝛽 1
Ridge estimator
The ridge estimator is where the constraint The values of the coefficients decrease as
and the loss intersect. lambda increases, but they are not nullified.
Lasso estimator
The Lasso estimator tends to zero out The values of the coefficients decrease as
parameters as the OLS loss can easily intersect lambda increases, and are nullified fast.
with the constraint on one of the axis.
CS109A, PROTOPAPAS, RADER, TANNER 23
Ridge regularization with only validation : step by step
1.
split data into
2. for
1. determine the that minimizes the , , using the train data.
2. record using validation data.
3. select the that minimizes the loss on the validation data,
4. Refit the model using both train and validation data, }, resulting to
5. report MSE or R2 on given the
al
al
vv
1.
split data into
2. for
A. determine the that minimizes the , , using the train data. This is done
using a solver.
B. record using validation data
3. select the that minimizes the loss on the validation data,
4. Refit the model using both train and validation data, }, resulting to
5. report MSE or R2 on given the
al
al
vv
Since LASSO regression tend to produce zero estimates for a number of model
parameters - we say that LASSO solutions are sparse - we consider LASSO to be a
method for variable selection.
Many prefer using LASSO for variable selection (as well as for suppressing extreme
parameter values) rather than stepwise selection, as LASSO avoids the statistic
problems that arises in stepwise selection.
Question: What are the pros and cons of the two approaches?