0% found this document useful (0 votes)
2 views

_Regularization_Methods_Intro_1694372556

The document discusses regularization methods in business analytics, focusing on ridge regression and LASSO (Least Absolute Shrinkage and Selection Operator) to avoid overfitting and improve model interpretability. It outlines the motivation for regularization, the mechanics of ridge regression, and how LASSO performs variable selection by shrinking some coefficients to zero. The document also includes practical implementation examples in R and highlights the advantages and disadvantages of each method.

Uploaded by

Tarik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

_Regularization_Methods_Intro_1694372556

The document discusses regularization methods in business analytics, focusing on ridge regression and LASSO (Least Absolute Shrinkage and Selection Operator) to avoid overfitting and improve model interpretability. It outlines the motivation for regularization, the mechanics of ridge regression, and how LASSO performs variable selection by shrinking some coefficients to zero. The document also includes practical implementation examples in R and highlights the advantages and disadvantages of each method.

Uploaded by

Tarik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Regularization Methods

Business Analytics Practice


Winter Term 2015/16
Stefan Feuerriegel
Today’s Lecture

Objectives

1 Avoiding overfitting and improving model interpretability with the help


of regularization methods
2 Understanding both ridge regression and the LASSO
3 Applying these methods for variable selection

Regularization 2
Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization 3
Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization: Motivation for Regularization 4


Motivation for Regularization
I Linear models are frequently favorable due to their interpretability and
often good predictive performance
I Yet, Ordinary Least Squares (OLS) estimation faces challenges

Challenges
1 Interpretability
I OLS cannot distinguish variables with little or no influence
I These variables distract from the relevant regressors

2 Overfitting
I OLS works well when number of observation n is bigger than the
number of predictors p, i. e. n  p
I If n ≈ p, overfitting results into low accuracy on unseen observations
I If n < p, variance of estimates is infinite and OLS fails
I As a remedy, one can identify only relevant variables by feature
selection
Regularization: Motivation for Regularization 5
Motivation for Regularization
Fitting techniques as alternatives to OLS
I Subset selection
I Pick only a subset of all p variables which is assumed to be relevant
I Estimate model with least squares using these reduced set of variables
I Dimension reduction
I Project p predictors into a d-dimensional subspace with d < p
I These d features are used to fit a linear model by least squares
I Shrinkage methods, also named regularization
I Fit model with all p variables
I However, some coefficients are shrunken towards zero
I Has the effect of reducing variance

Regularization: Motivation for Regularization 6


Regularization Methods
I Fit linear models with least squares but impose constraints on the
coefficients
I Instead, alternative formulations add a penalty in the OLS formula
I Best known are ridge regression and LASSO (least absolute shrinkage
operator)
I Ridge regression can shrink parameters close to zero
I LASSO models can shrink some parameters exactly to zero
→ Performs implicit variable selection

Regularization: Motivation for Regularization 7


Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization: Ridge Regression 8


Ridge Regression
OLS estimation
T
I Recall the OLS technique to estimate β = [β0 , β1 , . . . , βp ]
I Minimizes the residual sum of squares (RSS)
!2
n p
β OLS = min RSS = min ∑ yi − β0 − ∑ βj xij
β β i =1 j =1

Ridge regression
I Imposes a penalty on the size of the coefficients to reduce the
variance of the estimates
!2
n p p
β ridge = min ∑ yi − β0 − ∑ βj xij
β i =1
+ λ ∑ βj2
j =1 j =1
| {z } | {z }
RSS shrinkage penalty

Regularization: Ridge Regression 9


Tuning Parameter
I Tuning parameter λ > 0 controls the relative impact of the penalty
p
I Penalty λ ∑ βj2 has the effect of shrinking βj towards zero
j =1
I If λ ≈ 0, penalty term has no effect (similar to OLS)
I Choice of λ is critical → determined separately via cross validation

Regularization: Ridge Regression 10


Ridge Regression in R
I Predicting salaries of U. S. baseball players based on game statistics
I Loading data Hitters
library(ISLR) # Hitters is located inside ISLR
data(Hitters)
Hitters <- na.omit(Hitters) # salary can be missing
I Loading package glmnet which implements ridge regression
library(glmnet)

I Main function glmnet(x, y, alpha=0) requires dependent


variable y and regressors x
I Function only processes numerical input, whereas categorical
variables needs to be transformed via model.matrix(...)

Regularization: Ridge Regression 11


Ridge Regression in R
I Prepare variables
set.seed(0)

# drop 1st column with intercept (glmnet has already one)


x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary
train_idx <- sample(nrow(x), size=0.9*nrow(x))

x.train <- x[train_idx, ]


x.test <- x[-train_idx, ]
y.train <- y[train_idx]
y.test <- y[-train_idx]
I Call ridge regression and automatically test a sequence of λ
lm.ridge <- glmnet(x.train, y.train, alpha=0)

Regularization: Ridge Regression 12


Ridge Regression in R
I coef(...) retrieves coefficients belonging to each λ
dim(coef(lm.ridge))
## [1] 20 100

→ here: 100 models with different λ and each with 20 coefficients


I For example, the 50th model is as follows
lm.ridge$lambda[50] # tested lambda value
## [1] 2581.857
head(coef(lm.ridge)[,50]) # estimated coefficients
## (Intercept) AtBat Hits HmRun Runs
## 211.76123020 0.08903326 0.37913073 1.21041548 0.64115228
## RBI
## 0.59834311

Regularization: Ridge Regression 13


Ridge Regression in R
I plot(model, xvar="lambda") investigates the influence of λ
on the estimated coefficients for all variables
plot(lm.ridge, xvar="lambda")
19 19 19 19 19
50
0
Coefficients

−50
−100
−150

4 6 8 10 12

Log Lambda

I Bottom axis gives ln λ , top the number of non-zero coefficients


Regularization: Ridge Regression 14
Plot: Lambda vs. Coefficients

0 I Manual effort necessary for


pretty format
I As λ increases, coefficients
Coefficients

shrink towards zero


−10
Division I All other variables are shown
League in gray
New League

−20 Years

Walks

10000 20000
Lambda

Regularization: Ridge Regression 15


Parameter Tuning
I Optimal λ is determined via cross validation by minimizing the mean
squared error from a prediction
I Usage is cv.glmnet(x, y, alpha=0)

cv.ridge <- cv.glmnet(x.train, y.train, alpha=0)


I Optimal λ and corresponding coefficients
cv.ridge$lambda.min
## [1] 29.68508
head(coef(cv.ridge, s="lambda.min"))
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 109.4192279
## AtBat -0.6764771
## Hits 2.5974777
## HmRun -0.7058689
## Runs 1.8565943
## RBI 0.3434801
Regularization: Ridge Regression 16
Parameter Tuning
I plot(cv.model) compares the means squared error across λ
plot(cv.ridge)
19 19 19 19 19 19 19 19 19
220000
Mean−Squared Error


●●●●
●●●●●●
●●●●
●●●
●●●
●●
●●●
●●
●●
●●
●●

160000












●●
●●●

●●
●●
●●
●●●
●●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
100000

4 6 8 10 12

log(Lambda)

I Mean squared error first remains fairly constant and then rises sharply

Regularization: Ridge Regression 17


Ridge Regression in R
I predict(model, newx=x, s=lambda) makes predictions for
new data x and a specific λ
pred.ridge <- predict(cv.ridge, newx=x.test, s="lambda.min")
head(cbind(pred.ridge, y.test))
## 1 y.test
## -Alan Ashby 390.1766 475.000
## -Andre Dawson 1094.5741 500.000
## -Andre Thornton 798.5886 1100.000
## -Alan Trammell 893.8298 517.143
## -Barry Bonds 518.9105 100.000
## -Bob Dernier 353.4100 708.333
I Mean absolute percentage error (MAPE)
mean(abs((y.test - pred.ridge)/y.test))
## [1] 0.6811053

Regularization: Ridge Regression 18


Scaling of Estimates
OLS estimation
I Recall: least square estimates are scale equivalent
I Multiplying xj by c ⇒ scaling of βj by a factor 1/c

Ridge regression
I In contrast, coefficients in ridge regression can change substantially
when scaling variable xj due to penalty term
I Best is to use the following approach
1 Scale variables via
xij
x̃ij = r
n
1 2
n ∑ (xij − x̄j )
i =1

which divides by the standard deviation of xj


2 Estimate the coefficients of ridge regression
I glmnet scales accordingly, but returns coefficients on original scale
Regularization: Ridge Regression 19
Bias-Variance Trade-Off
I Ridge regressions benefits from bias-variance trade-off
I As λ increases, the flexibility of ridge regression coefficients decreases
→ This Decreases variance but increases bias

60
50
Mean Squared Error
40
30
20
10
0

10–1 101 103


λ

I Squared bias (in black), variance (blue), and error on test set (red)
I Dashed line is minimum possible mean squared error
Regularization: Ridge Regression 20
Pros and Cons
Advantages
I Ridge regression can reduce the variance (with an increasing bias)
→ works best in situations where the OLS estimates have high
variance
I Can improve predictive performance
I Works in situations where p < n
I Mathematically simple computations

Disadvantages
I Ridge regression is not able to shrink coefficients to exactly zero
I As a result, it cannot perform variable selection

⇒ Alternative: Least Absolute Shrinkage and Selection Operator (LASSO)

Regularization: Ridge Regression 21


Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization: LASSO 22
LASSO
Least Absolute Shrinkage and Selection Operator (LASSO)
I Ridge regression always includes p variables, but LASSO performs
variables selection
I LASSO only changes the shrinkage penalty
!2
n p p
β LASSO = min ∑ yi − β0 − ∑ βj xij
β i =1
+ λ ∑ |β |j
j =1 j =1
| {z } | {z }
RSS shrinkage penalty

I Here, the LASSO uses the L1 -norm kβ


β k1 = ∑ |βj |
j
I This penalty allows coefficients to shrink towards exactly zero
I LASSO usually results into sparse models, that are easier to interpret

Regularization: LASSO 23
LASSO in R
I Implemented in glmnet(x, y, alpha=1) as part of the
glmnet package
lm.lasso <- glmnet(x.train, y.train, alpha=1)

Note: different value for alpha


I plot(...) shows how λ changes the estimated coefficients
plot(lm.lasso, xvar="lambda")
17 17 12 6
50
Coefficients

0
−50
−100

−2 0 2 4

Log Lambda
Regularization: LASSO 24
Parameter Tuning
I cv.glmnet(x, y, alpha=1) determines optimal λ via cross
validation by minimizing the mean squared error from a prediction
set.seed(0)
cv.lasso <- cv.glmnet(x.train, y.train, alpha=1)
I Optimal λ and corresponding coefficients ("." are removed variables)
cv.lasso$lambda.min
## [1] 2.143503
head(coef(cv.lasso, s="lambda.min"))
## 6 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 189.7212235
## AtBat -1.9921887
## Hits 6.6124279
## HmRun 0.6674432
## Runs .
## RBI .
Regularization: LASSO 25
Parameter Tuning
I Total variables
nrow(coef(cv.lasso))
## [1] 20
I Omitted variables
dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which(
coef(cv.lasso, s="lambda.min") == 0)]
## [1] "Runs" "RBI" "CAtBat" "CHits"
I Included variables
dimnames(coef(cv.lasso, s="lambda.min"))[[1]][which(
coef(cv.lasso, s="lambda.min") != 0)]
## [1] "(Intercept)" "AtBat" "Hits" "HmRun"
## [6] "Years" "CHmRun" "CRuns" "CRBI"
## [11] "LeagueN" "DivisionW" "PutOuts" "Assists"
## [16] "NewLeagueN"

Regularization: LASSO 26
Parameter Tuning
I plot(cv.model) compares the means squared error across λ
plot(cv.lasso)
18 18 16 14 12 10 8 6 5 4
100000 140000 180000 220000


Mean−Squared Error













●●
●●
●●
●●●●
●●●●●●●●●●●●●●
●●●●●●●●● ●●●
●●●●● ●
●●●●●●●●●●●●●●●

−1 0 1 2 3 4 5

log(Lambda)

I Mean squared error first remains fairly constant and then rises sharply
I Top axis denotes the number of included model variables
Regularization: LASSO 27
LASSO in R
I predict(model, newx=x, s=lambda) makes predictions for
new data x and a specific λ
pred.lasso <- predict(cv.lasso, newx=x.test, s="lambda.min")
I Mean absolute percentage error (MAPE) of LASSO
mean(abs((y.test - pred.lasso)/y.test))
## [1] 0.6328225
I For comparison, error of ridge regression
## [1] 0.6811053

Regularization: LASSO 28
Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization: Comparison 29
Problem Formulation
Both ridge regression and LASSO can be rewritten as
!2
n p p
β ridge = min ∑ yi − β0 − ∑ βj xij
β i =1
s. t. ∑ βj2 ≤ θ
j =1 j =1
| {z }
RSS
!2
n p p
β LASSO = min ∑ yi − β0 − ∑ βj xij
β i =1
s. t. ∑ |βj | ≤ θ
j =1 j =1
| {z }
RSS

Outlook: both ridge regression and LASSO have Bayesian formulations

Regularization: Comparison 30
Variable Selection with LASSO
Comparison of previous constraints
Ridge regression LASSO

β12 + β22 ≤ θ |β1 | + |β2 | ≤ θ

β2 βOLS β2 βOLS

βridge
βLASSO

β1
β1

I Objective function RSS as contours in red


I Constraints (blue) in 2 dimensions
I Intersection occurs at β1 = 0 for LASSO
Regularization: Comparison 31
Case Study
Example
I Comparison to OLS estimator

lm.ols <- lm(y.train ~ x.train)

# Workaround as predict.lm only accepts a data.frame


pred.ols <- predict(lm.ols, data.frame(x.train=I(x.test)))

mean(abs((y.test - pred.ols)/y.test))
## [1] 0.6352089

I Comparison
OLS Ridge regression LASSO
0.64 0.68 0.63

I Here: LASSO can outperform OLS with fewer predictors

Regularization: Comparison 32
Elastic Net
I Elastic net generalizes the ideas of both LASSO and ridge regression
I Combination of both penalties
 

β k2 /2 + α kβ
β ElasticNet = min RSS + λ (1 − α) kβ βk 
 
β | {z 2 } | {z }1
L2 -penalty L1 -penalty

I L1 -penalty helps generating a sparse model


I L2 -part overcomes a strict selection
I Parameter α controls numerical stability
I α = 0.5 tends to handle correlated variables as groups

Regularization: Comparison 33
Elastic Net in R
Example
I Test the elastic net with a sequence of values for α
I Report in-sample mean squared error

set.seed(0)
alpha <- seq(from=0, to=1, by=0.1)
en <- lapply(alpha, function(a)
cv.glmnet(x.train, y.train, alpha=a))
en.mse <- unlist(lapply(en, function(i)
i$cvm[which(i$lambda==i$lambda.min)]))
plot(alpha, en.mse, ylab="Mean Squared Error", pch=16)


130000
Mean Squared Error





120000

● ● ●


0.0 0.2 0.4 0.6 0.8 1.0

alpha
Regularization: Comparison 34
Elastic Net in R
Example (continued)
I Report out-of-sample mean absolute prediction error

en.mape <- unlist(lapply(en, function(i) {


pred <- predict(i, newx=x.test,
s="lambda.min")
mean(abs((y.test - pred)/y.test))
}))
plot(alpha, en.mape, ylab="MAPE", pch=16)
0.68


0.67
0.66
MAPE


0.65


0.64

● ● ●
● ● ●

0.0 0.2 0.4 0.6 0.8 1.0

alpha
Regularization: Comparison 35
Outline

1 Motivation for Regularization

2 Ridge Regression

3 LASSO

4 Comparison

5 Wrap-Up

Regularization: Wrap-Up 36
Summary
Regularization methods
I Regularization methods bring advantages beyond OLS
I Cross validation chooses tuning parameter λ
I LASSO performs variable selection
I Neither ridge regression nor LASSO dominates one another
I Cross validation finds the best approach for a given dataset

Outlook
I In practice, λ is scaled by rule of thumb to get better results
I Research has lately developed several variants and improvements
I Spike-and-slab regression can be a viable alternative for inferences

Regularization: Wrap-Up 37
Further Readings
Package glmnet
I glmnet tutorial: http://web.stanford.edu/~hastie/
glmnet/glmnet_alpha.html
I glmnet webinar: http://web.stanford.edu/~hastie/
TALKS/glmnet_webinar.pdf
→ see Hastie’s website for data and scripts

Background on methods
I Talk on elastic net: http:
//web.stanford.edu/~hastie/TALKS/enet_talk.pdf
I Section 6.2 in the book “An Introduction to Statistical Learning”

Applications
I Especially healthcare analytics, but also sports
→ e. g. Groll, Schauberger & Tutz (2015): Prediction of major international soccer tournaments based on team-specific

regularized Poisson regression: An application to the FIFA World Cup 2014. In: Journal of Quantitiative Analysis in Sports, 10:2,

Regularization: Wrap-Up
pp. 97–115. 38

You might also like