0% found this document useful (0 votes)
50 views7 pages

q6-5 Solution (Ridge and Lasso)

Uploaded by

usasua1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views7 pages

q6-5 Solution (Ridge and Lasso)

Uploaded by

usasua1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Assignment 6: Linear Model Selection

SDS293 - Machine Learning

Due: 1 November 2017 by 11:59pm

Conceptual Exercises
6.8.2 (p. 259 ISLR)
For parts each of the following, indicate whether each method is more or less flexible than least
squares. Describe how each method’s trade-off between bias and variance impacts its prediction
accuracy. Justify your answers.

(a) The lasso

Solution: Puts a budget constraint on least squares. It is therefore less flexible. The lasso
will have improved prediction accuracy when its increase in bias is less than its decrease in
variance.

(b) Ridge regression

Solution: For the same reason as above, this method is also less flexible. Ridge regression
will have improved prediction accuracy when its increase in bias is less than its decrease in
variance.

(c) Non-linear methods (PCR and PLS)

Solution: Non-linear methods are more flexible and will give improved prediction accuracy
when their increase in variance are less than their decrease in bias.

6.8.5 (p. 261)


Ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso
may give quite different coefficient values to correlated variables. We will now explore this property
in a very simple setting.

Suppose that n = 2, p = 2, x11 = x12 , x21 = x22 . Furthermore, suppose that y1 + y2 = 0 and
x11 + x21 = 0 and x12 + x22 = 0, so that the estimate for the intercept in a least squares, ridge
regression, or lasso model is zero: β̂0 = 0.

1
(a) Write out the ridge regression optimization problem in this setting.

Solution: In general, Ridge regression optimization looks like:


 
1...n
X 1...p
X 1...p
X
min  (yi − β̂0 − β̂j xj )2 + λ β̂i2 
i j i

In this case, β̂0 = 0 and n = p = 2. So, the optimization simplifies to:


h i
min (y1 − β̂1 x11 − β̂2 x12 )2 + (y2 − β̂1 x21 − β̂2 x22 )2 + λ(β̂12 + β̂22 )

(b) Argue that in this setting, the ridge coefficient estimates satisfy β̂1 = β̂2 .

Solution: We know the following: x11 = x12 , so we’ll call that x1 , and x21 = x22 , so we’ll
call that x2 . Plugging this into the above, we get:
h i
min (y1 − β̂1 x1 − β̂2 x1 )2 + (y2 − β̂1 x2 − β̂2 x2 )2 + λ(β̂12 + β̂22 )

Taking the partial derivatives of the above with respect to β̂1 and β̂2 and setting them equal
to 0 will give us the point at which the function is minimized. Doing this, we find:

β̂1 (x21 + x22 + λ) + β̂2 (x21 + x22 ) − y1 x1 − y2 x2 = 0

and
β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2 = 0
Since the right-hand side of both equations is identical, we can set the two left-hand sides
equal to one another:

β̂1 (x21 + x22 + λ) + β̂2 (x21 + x22 ) − y1 x1 − y2 x2 = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2

and then cancel out common terms:

β̂1 (x21 + x22 + λ) + β̂2 (x21 + x22 ) − y1 x1 − y2 x2 = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2

β̂1 (x21 + x22 ) + β̂1 λ + β̂2 (x21 + x22 ) = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 ) + β̂2 λ
β̂1 λ + β̂2 (x21 + x22 ) = β̂2 (x21 + x22 ) + β̂2 λ
β̂1 λ = β̂2 λ
Thus, β̂1 = β̂2 .

(c) Write out the lasso optimization problem in this setting.

Solution:
h i
min (y1 − β̂1 x11 − β̂2 x12 )2 + (y2 − β̂1 x21 − β̂2 x22 )2 + λ(|β̂1 | + |β̂2 |)

2
(d) Argue that in this setting, the lasso coefficients β̂1 and β̂2 are not unique – in other words,
there are many possible solutions to the optimization problem in (c). Describe these solutions.

Solution: One way to demonstrate that these solutions are not unique is to make a geometric
argument. To make things easier, we’ll use the alternate form of Lasso constraints that we
saw in class, namely: |β̂1 | + |β̂2 | < s. If we were to plot these constraints, they take the
familiar shape of a diamond centered at the origin (0, 0).

Next we’ll consider the squared optimization constraint, namely:

(y1 − β̂1 x11 − β̂2 x12 )2 + (y2 β̂1 x21 − β̂2 x22 )2

Using the facts we were given regarding the equivalence of many of the variables, we can
simplify down to the following optimization:
h i
min 2(y1 − (β̂1 + β̂2 )x11

This optimization problem has a minimum at β̂1 + β̂2 = y1 x11 , which defines a line parallel
to one edge of the Lasso-diamond β̂1 + β̂2 = s.

As β̂1 and β̂2 vary along the line β̂1 + β̂2 = y1 x11 , these contours touch the Lasso-diamond
edge β̂1 + β̂2 = s at different points. As a result, the entire edge β̂1 + β̂2 = s is a potential
solution to the Lasso optimization problem!

A similar argument holds for the opposite Lasso-diamond edge, defined by: β̂1 + β̂2 = −s.

Thus, the Lasso coefficients are not unique. The general form of solution can be given by
two line segments:

β̂1 + β̂2 = s; β̂1 ≥ 0; β̂2 ≥ 0 and β̂1 + β̂2 = −s; β̂1 ≤ 0; β̂2 ≤ 0

3
Applied Exercises
6.8.9 (p. 263 ISLR)
In this exercise, we will predict the number of applications received using the other variables in the
College data set. For consistency, please use set.seed(11) before beginning.

(a) Split the data set into a training set and a test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report
the test error obtained.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error
obtained, along with the number of non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test
error obtained, along with the value of M selected by cross-validation.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test
error obtained, along with the value of M selected by cross-validation.

(g) Comment on the results you obtained. How accurately can we predict the number of college
applications received? Is there much difference among the test errors resulting from these five
approaches?

4
A6 Applied Solutions
6.8.9 (a)

library(ISLR)
library(dplyr)

Check to make sure we don’t have any null values

sum(is.na(College))

## [1] 0

Split the data set into a training set and a test set.

set.seed(1)
train = College %>%
sample_frac(0.5)

test = College %>%


setdiff(train)

6.8.9 (b)

Fit a linear model using least squares on the training set, and report the test error obtained.

lm_fit = lm(Apps~., data = train)


lm_pred = predict(lm_fit, test)
mean((test[, "Apps"] - lm_pred)^2)

## [1] 1108531

6.8.9 (c)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error
obtained, along with the number of non-zero coefficient estimates.

library(glmnet)
# Build model matrices for
# test and training data
train_mat = model.matrix(Apps~., data = train)
test_mat = model.matrix(Apps~., data = test)

# Find best lambda using cross-validation,


# alpha = 0 --> use ridge regression
grid = 10 ^ seq(4, -2, length=100)
mod_ridge = cv.glmnet(train_mat, train[, "Apps"], alpha = 0, lambda = grid, thresh = 1e-12)

1
lambda_best_ridge = mod_ridge$lambda.min

# Predict on test data, report error


ridge_pred = predict(mod_ridge, newx = test_mat, s = lambda_best_ridge)
mean((test[, "Apps"] - ridge_pred)^2)

## [1] 1108512

6.8.9 (d)

Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along
with the number of non-zero coefficient estimates.

# Find best lambda using cross-validation,


# alpha = 1 --> use lasso
mod_lasso = cv.glmnet(train_mat, train[, "Apps"], alpha = 1, lambda = grid, thresh = 1e-12)
lambda_best_lasso = mod_lasso$lambda.min

# Predict on test data, report error


lasso_pred = predict(mod_lasso, newx = test_mat, s = lambda_best_lasso)
mean((test[, "Apps"] - lasso_pred)^2)

## [1] 1028718

predict(mod_lasso, newx = test_mat, s = lambda_best_lasso, type="coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"


## 1
## (Intercept) -4.248125e+02
## (Intercept) .
## PrivateYes -4.955003e+02
## Accept 1.540306e+00
## Enroll -3.900157e-01
## Top10perc 4.779689e+01
## Top25perc -7.926581e+00
## F.Undergrad -9.846932e-03
## P.Undergrad .
## Outstate -5.231286e-02
## Room.Board 1.880308e-01
## Books 1.265938e-03
## Personal .
## PhD -4.137294e+00
## Terminal -3.184316e+00
## S.F.Ratio .
## perc.alumni -2.181304e+00
## Expend 3.193679e-02
## Grad.Rate 2.877667e+00

6.8.9 (e)

Results for OLS, Lasso, Ridge are comparable. Lasso reduces the P.Undergrad, Personal and S.F.Ratio
variables to zero and shrinks coefficients of other variables. Below are the test R2 values for all models.

2
test_avg = mean(test[, "Apps"])

lm_test_r2 = 1 - mean((test[, "Apps"] - lm_pred)^2) /mean((test[, "Apps"] - test_avg)^2)

ridge_test_r2 = 1 - mean((test[, "Apps"] - ridge_pred)^2) /mean((test[, "Apps"] - test_avg)^2)

lasso_test_r2 = 1 - mean((test[, "Apps"] - lasso_pred)^2) /mean((test[, "Apps"] - test_avg)^2)

barplot(c(lm_test_r2,
ridge_test_r2,
lasso_test_r2),
ylim=c(0,1),
names.arg = c("OLS", "Ridge", "Lasso"),
main = "Test R-squared")
abline(h = 0.9, col = "red")

Test R−squared
1.0
0.8
0.6
0.4
0.2
0.0

OLS Ridge Lasso


Since the test R2 values for all three models are above .90, they all predict the number of college applications
with high accuracy.

You might also like