q6-5 Solution (Ridge and Lasso)
q6-5 Solution (Ridge and Lasso)
Conceptual Exercises
6.8.2 (p. 259 ISLR)
For parts each of the following, indicate whether each method is more or less flexible than least
squares. Describe how each method’s trade-off between bias and variance impacts its prediction
accuracy. Justify your answers.
Solution: Puts a budget constraint on least squares. It is therefore less flexible. The lasso
will have improved prediction accuracy when its increase in bias is less than its decrease in
variance.
Solution: For the same reason as above, this method is also less flexible. Ridge regression
will have improved prediction accuracy when its increase in bias is less than its decrease in
variance.
Solution: Non-linear methods are more flexible and will give improved prediction accuracy
when their increase in variance are less than their decrease in bias.
Suppose that n = 2, p = 2, x11 = x12 , x21 = x22 . Furthermore, suppose that y1 + y2 = 0 and
x11 + x21 = 0 and x12 + x22 = 0, so that the estimate for the intercept in a least squares, ridge
regression, or lasso model is zero: β̂0 = 0.
1
(a) Write out the ridge regression optimization problem in this setting.
(b) Argue that in this setting, the ridge coefficient estimates satisfy β̂1 = β̂2 .
Solution: We know the following: x11 = x12 , so we’ll call that x1 , and x21 = x22 , so we’ll
call that x2 . Plugging this into the above, we get:
h i
min (y1 − β̂1 x1 − β̂2 x1 )2 + (y2 − β̂1 x2 − β̂2 x2 )2 + λ(β̂12 + β̂22 )
Taking the partial derivatives of the above with respect to β̂1 and β̂2 and setting them equal
to 0 will give us the point at which the function is minimized. Doing this, we find:
and
β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2 = 0
Since the right-hand side of both equations is identical, we can set the two left-hand sides
equal to one another:
β̂1 (x21 + x22 + λ) + β̂2 (x21 + x22 ) − y1 x1 − y2 x2 = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2
β̂1 (x21 + x22 + λ) + β̂2 (x21 + x22 ) − y1 x1 − y2 x2 = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 + λ) − y1 x1 − y2 x2
β̂1 (x21 + x22 ) + β̂1 λ + β̂2 (x21 + x22 ) = β̂1 (x21 + x22 ) + β̂2 (x21 + x22 ) + β̂2 λ
β̂1 λ + β̂2 (x21 + x22 ) = β̂2 (x21 + x22 ) + β̂2 λ
β̂1 λ = β̂2 λ
Thus, β̂1 = β̂2 .
Solution:
h i
min (y1 − β̂1 x11 − β̂2 x12 )2 + (y2 − β̂1 x21 − β̂2 x22 )2 + λ(|β̂1 | + |β̂2 |)
2
(d) Argue that in this setting, the lasso coefficients β̂1 and β̂2 are not unique – in other words,
there are many possible solutions to the optimization problem in (c). Describe these solutions.
Solution: One way to demonstrate that these solutions are not unique is to make a geometric
argument. To make things easier, we’ll use the alternate form of Lasso constraints that we
saw in class, namely: |β̂1 | + |β̂2 | < s. If we were to plot these constraints, they take the
familiar shape of a diamond centered at the origin (0, 0).
(y1 − β̂1 x11 − β̂2 x12 )2 + (y2 β̂1 x21 − β̂2 x22 )2
Using the facts we were given regarding the equivalence of many of the variables, we can
simplify down to the following optimization:
h i
min 2(y1 − (β̂1 + β̂2 )x11
This optimization problem has a minimum at β̂1 + β̂2 = y1 x11 , which defines a line parallel
to one edge of the Lasso-diamond β̂1 + β̂2 = s.
As β̂1 and β̂2 vary along the line β̂1 + β̂2 = y1 x11 , these contours touch the Lasso-diamond
edge β̂1 + β̂2 = s at different points. As a result, the entire edge β̂1 + β̂2 = s is a potential
solution to the Lasso optimization problem!
A similar argument holds for the opposite Lasso-diamond edge, defined by: β̂1 + β̂2 = −s.
Thus, the Lasso coefficients are not unique. The general form of solution can be given by
two line segments:
β̂1 + β̂2 = s; β̂1 ≥ 0; β̂2 ≥ 0 and β̂1 + β̂2 = −s; β̂1 ≤ 0; β̂2 ≤ 0
3
Applied Exercises
6.8.9 (p. 263 ISLR)
In this exercise, we will predict the number of applications received using the other variables in the
College data set. For consistency, please use set.seed(11) before beginning.
(a) Split the data set into a training set and a test set.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report
the test error obtained.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error
obtained, along with the number of non-zero coefficient estimates.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test
error obtained, along with the value of M selected by cross-validation.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test
error obtained, along with the value of M selected by cross-validation.
(g) Comment on the results you obtained. How accurately can we predict the number of college
applications received? Is there much difference among the test errors resulting from these five
approaches?
4
A6 Applied Solutions
6.8.9 (a)
library(ISLR)
library(dplyr)
sum(is.na(College))
## [1] 0
Split the data set into a training set and a test set.
set.seed(1)
train = College %>%
sample_frac(0.5)
6.8.9 (b)
Fit a linear model using least squares on the training set, and report the test error obtained.
## [1] 1108531
6.8.9 (c)
Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error
obtained, along with the number of non-zero coefficient estimates.
library(glmnet)
# Build model matrices for
# test and training data
train_mat = model.matrix(Apps~., data = train)
test_mat = model.matrix(Apps~., data = test)
1
lambda_best_ridge = mod_ridge$lambda.min
## [1] 1108512
6.8.9 (d)
Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along
with the number of non-zero coefficient estimates.
## [1] 1028718
6.8.9 (e)
Results for OLS, Lasso, Ridge are comparable. Lasso reduces the P.Undergrad, Personal and S.F.Ratio
variables to zero and shrinks coefficients of other variables. Below are the test R2 values for all models.
2
test_avg = mean(test[, "Apps"])
barplot(c(lm_test_r2,
ridge_test_r2,
lasso_test_r2),
ylim=c(0,1),
names.arg = c("OLS", "Ridge", "Lasso"),
main = "Test R-squared")
abline(h = 0.9, col = "red")
Test R−squared
1.0
0.8
0.6
0.4
0.2
0.0