_Regularization_Methods_Intro_1694372556
_Regularization_Methods_Intro_1694372556
Objectives
Regularization 2
Outline
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Regularization 3
Outline
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Challenges
1 Interpretability
I OLS cannot distinguish variables with little or no influence
I These variables distract from the relevant regressors
2 Overfitting
I OLS works well when number of observation n is bigger than the
number of predictors p, i. e. n p
I If n ≈ p, overfitting results into low accuracy on unseen observations
I If n < p, variance of estimates is infinite and OLS fails
I As a remedy, one can identify only relevant variables by feature
selection
Regularization: Motivation for Regularization 5
Motivation for Regularization
Fitting techniques as alternatives to OLS
I Subset selection
I Pick only a subset of all p variables which is assumed to be relevant
I Estimate model with least squares using these reduced set of variables
I Dimension reduction
I Project p predictors into a d-dimensional subspace with d < p
I These d features are used to fit a linear model by least squares
I Shrinkage methods, also named regularization
I Fit model with all p variables
I However, some coefficients are shrunken towards zero
I Has the effect of reducing variance
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Ridge regression
I Imposes a penalty on the size of the coefficients to reduce the
variance of the estimates
!2
n p p
β ridge = min ∑ yi − β0 − ∑ βj xij
β i =1
+ λ ∑ βj2
j =1 j =1
| {z } | {z }
RSS shrinkage penalty
−50
−100
−150
4 6 8 10 12
Log Lambda
−20 Years
Walks
10000 20000
Lambda
●
●●●●
●●●●●●
●●●●
●●●
●●●
●●
●●●
●●
●●
●●
●●
●
160000
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●
●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
100000
4 6 8 10 12
log(Lambda)
I Mean squared error first remains fairly constant and then rises sharply
Ridge regression
I In contrast, coefficients in ridge regression can change substantially
when scaling variable xj due to penalty term
I Best is to use the following approach
1 Scale variables via
xij
x̃ij = r
n
1 2
n ∑ (xij − x̄j )
i =1
60
50
Mean Squared Error
40
30
20
10
0
I Squared bias (in black), variance (blue), and error on test set (red)
I Dashed line is minimum possible mean squared error
Regularization: Ridge Regression 20
Pros and Cons
Advantages
I Ridge regression can reduce the variance (with an increasing bias)
→ works best in situations where the OLS estimates have high
variance
I Can improve predictive performance
I Works in situations where p < n
I Mathematically simple computations
Disadvantages
I Ridge regression is not able to shrink coefficients to exactly zero
I As a result, it cannot perform variable selection
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Regularization: LASSO 22
LASSO
Least Absolute Shrinkage and Selection Operator (LASSO)
I Ridge regression always includes p variables, but LASSO performs
variables selection
I LASSO only changes the shrinkage penalty
!2
n p p
β LASSO = min ∑ yi − β0 − ∑ βj xij
β i =1
+ λ ∑ |β |j
j =1 j =1
| {z } | {z }
RSS shrinkage penalty
Regularization: LASSO 23
LASSO in R
I Implemented in glmnet(x, y, alpha=1) as part of the
glmnet package
lm.lasso <- glmnet(x.train, y.train, alpha=1)
0
−50
−100
−2 0 2 4
Log Lambda
Regularization: LASSO 24
Parameter Tuning
I cv.glmnet(x, y, alpha=1) determines optimal λ via cross
validation by minimizing the mean squared error from a prediction
set.seed(0)
cv.lasso <- cv.glmnet(x.train, y.train, alpha=1)
I Optimal λ and corresponding coefficients ("." are removed variables)
cv.lasso$lambda.min
## [1] 2.143503
head(coef(cv.lasso, s="lambda.min"))
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 189.7212235
## AtBat -1.9921887
## Hits 6.6124279
## HmRun 0.6674432
## Runs .
## RBI .
Regularization: LASSO 25
Parameter Tuning
I Total variables
nrow(coef(cv.lasso))
## [1] 20
I Omitted variables
dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which(
coef(cv.lasso, s="lambda.min") == 0)]
## [1] "Runs" "RBI" "CAtBat" "CHits"
I Included variables
dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which(
coef(cv.lasso, s="lambda.min") != 0)]
## [1] "(Intercept)" "AtBat" "Hits" "HmRun"
## [6] "Years" "CHmRun" "CRuns" "CRBI"
## [11] "LeagueN" "DivisionW" "PutOuts" "Assists"
## [16] "NewLeagueN"
Regularization: LASSO 26
Parameter Tuning
I plot(cv.model) compares the means squared error across λ
plot(cv.lasso)
18 18 16 14 12 10 8 6 5 4
100000 140000 180000 220000
●
Mean−Squared Error
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●
●●●●●●●●●●●●●●
●●●●●●●●● ●●●
●●●●● ●
●●●●●●●●●●●●●●●
−1 0 1 2 3 4 5
log(Lambda)
I Mean squared error first remains fairly constant and then rises sharply
I Top axis denotes the number of included model variables
Regularization: LASSO 27
LASSO in R
I predict(model, newx=x, s=lambda) makes predictions for
new data x and a specific λ
pred.lasso <- predict(cv.lasso, newx=x.test, s="lambda.min")
I Mean absolute percentage error (MAPE) of LASSO
mean(abs((y.test - pred.lasso)/y.test))
## [1] 0.6328225
I For comparison, error of ridge regression
## [1] 0.6811053
Regularization: LASSO 28
Outline
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Regularization: Comparison 29
Problem Formulation
Both ridge regression and LASSO can be rewritten as
!2
n p p
β ridge = min ∑ yi − β0 − ∑ βj xij
β i =1
s. t. ∑ βj2 ≤ θ
j =1 j =1
| {z }
RSS
!2
n p p
β LASSO = min ∑ yi − β0 − ∑ βj xij
β i =1
s. t. ∑ |βj | ≤ θ
j =1 j =1
| {z }
RSS
Regularization: Comparison 30
Variable Selection with LASSO
Comparison of previous constraints
Ridge regression LASSO
β2 βOLS β2 βOLS
βridge
βLASSO
β1
β1
mean(abs((y.test - pred.ols)/y.test))
## [1] 0.6352089
I Comparison
OLS Ridge regression LASSO
0.64 0.68 0.63
Regularization: Comparison 32
Elastic Net
I Elastic net generalizes the ideas of both LASSO and ridge regression
I Combination of both penalties
β k2 /2 + α kβ
β ElasticNet = min RSS + λ (1 − α) kβ βk
β | {z 2 } | {z }1
L2 -penalty L1 -penalty
Regularization: Comparison 33
Elastic Net in R
Example
I Test the elastic net with a sequence of values for α
I Report in-sample mean squared error
set.seed(0)
alpha <- seq(from=0, to=1, by=0.1)
en <- lapply(alpha, function(a)
cv.glmnet(x.train, y.train, alpha=a))
en.mse <- unlist(lapply(en, function(i)
i$cvm[which(i$lambda==i$lambda.min)]))
plot(alpha, en.mse, ylab="Mean Squared Error", pch=16)
●
130000
Mean Squared Error
●
●
●
●
120000
● ● ●
●
●
●
alpha
Regularization: Comparison 34
Elastic Net in R
Example (continued)
I Report out-of-sample mean absolute prediction error
●
0.67
0.66
MAPE
●
0.65
●
0.64
● ● ●
● ● ●
●
alpha
Regularization: Comparison 35
Outline
2 Ridge Regression
3 LASSO
4 Comparison
5 Wrap-Up
Regularization: Wrap-Up 36
Summary
Regularization methods
I Regularization methods bring advantages beyond OLS
I Cross validation chooses tuning parameter λ
I LASSO performs variable selection
I Neither ridge regression nor LASSO dominates one another
I Cross validation finds the best approach for a given dataset
Outlook
I In practice, λ is scaled by rule of thumb to get better results
I Research has lately developed several variants and improvements
I Spike-and-slab regression can be a viable alternative for inferences
Regularization: Wrap-Up 37
Further Readings
Package glmnet
I glmnet tutorial: http://web.stanford.edu/~hastie/
glmnet/glmnet_alpha.html
I glmnet webinar: http://web.stanford.edu/~hastie/
TALKS/glmnet_webinar.pdf
→ see Hastie’s website for data and scripts
Background on methods
I Talk on elastic net: http:
//web.stanford.edu/~hastie/TALKS/enet_talk.pdf
I Section 6.2 in the book “An Introduction to Statistical Learning”
Applications
I Especially healthcare analytics, but also sports
→ e. g. Groll, Schauberger & Tutz (2015): Prediction of major international soccer tournaments based on team-specific
regularized Poisson regression: An application to the FIFA World Cup 2014. In: Journal of Quantitiative Analysis in Sports, 10:2,
Regularization: Wrap-Up
pp. 97–115. 38