Ridge Regression LASSO
Ridge Regression LASSO
Submitted by:
Student-ID:
Advisor:
Prof. Dr. Dirk Neumann
Contents
1 Introduction 1
3 Ridge Regression 3
3.1 Bias Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Significance of Lambda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Lasso 7
4.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Properties of Lambda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Shrinkage: Soft Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Conclusion 13
A References i
B List of Figures ii
1 Introduction
Linear regression is a supervised learning model used for prediction and for explaining the
relationship between a dependent and an independent variable [5]. Ordinary Least Squares
(OLS) is the most commonly used method for fitting the linear regression model. Generally, the
main purpose of a model is to use some data in order to predict future observations. Prediction
accuracy and model complexity are two important concepts to be taken into account when
performing prediction or when explaining relationships between variables [1]. The OLS method
displays low performance with respect to these two criteria. OLS exhibit several statistical and
numerical problems which result in variable coefficient estimates, low predictive power and
not easily interpretable results. Some of the problems include multicollinearity, a number of
predictors higher than the number of observations or high variance.
Regularization techniques such as lasso and ridge regression overcome some of the problems
of OLS. Ridge regression shrinks the coefficients estimates towards zero, in this way improving
the variability, reducing the variance while slightly increasing the bias and raising the overall
accuracy of the model. However, when the number of predictors is high, ridge regression results
in models that are difficult to interpret. Another regularization technique called lasso improves
both accuracy and model interpretability by selecting which coefficients to shrink and by shrinking
some of them exactly to zero [7].
This paper provides an introduction into these two techniques. It is structured as follows. As
both of these technique extend the basic OLS model, Section 2 starts with a brief theoretical
background of OLS and discusses some of its limitations. To overcome some of the limitations of
OLS, we introduce ridge regression in Section 3. We start by motivating ridge regression with the
bias-variance trade-off, then introduce its mathematical formulation and finally discuss how this
method performs shrinkage. As an improvement of the ridge technique, we present the lasso in
Section 4. Like in the case of ridge, we cover its mathematical formulation and the shrinkage
method. After understanding the basic principles of both methods, we look at them comparatively
through a geometrical and Bayesian lens in Section 5. Finally, we conclude in Section 6.
N
y = β0 + ∑ β j x j + ε, (1)
j=1
2 Ordinary Least Squares 2
where ε is the error term and β0 the intercept. We predict the value of y based on the value of x j ,
by computing the following equation
N
ŷ = β̂0 + ∑ β̂ j x j , (2)
j=1
where the hat symbol indicates the estimated coefficients. In order to fit this model, we typically
use a method called least squares or Ordinary Least Squares (OLS). The idea behind OLS is to
find the parameters βˆj such that the regression line is the closest line to the data points (xni ,y)
[8]. The OLS method estimates the coefficients β1 , . . . ,βn by minimizing a quantity known as the
Residual Sum of Squares (RSS). This is defined as
N N N
RSS = ∑ (yi − ŷ)2 = ∑ [yi − (β̂0 + ∑ β̂ j x j )]2 , (3)
j=1 j=1 j=1
where yi is the actual value and ŷ is the predicted value. Differentiating the equation above with
respect to β̂ j and solving for β̂ j yields the OLS coefficients, which we denote β̂OLS . Hence, the
OLS estimates are
N N
β̂OLS = argmin ∑ [yi − (β0 + ∑ β j x j )]2 . (4)
β j=1 j=1
In order to gain more insight into how OLS work, we express the coefficients estimates in matrix
algebra representation. In this case, the regression model is y = Xβ + u, where y is the output
vector of n observations, β is a (p + 1) × 1 vector and X is a n × (p + 1) matrix with 1’s on the first
column and each row an input vector. Therefore, the RSS expression from Equation (3) becomes
RSS = (y − Xβ )T (y − Xβ ). (5)
By differentiating this quantity and performing some calculations, we obtain the OLS coefficient
estimates in matrix form
β̂OLS = (X T X)−1 X T y, (6)
contains a large number of predictors, the coefficient estimates are variable and cannot be
computed [4]. Further reasons why OLS need to be modified are the accuracy of the model and
the interpretability of the model. OLS estimates are often difficult to interpret when the model
contains many predictors. In this situation, OLS exhibits low bias and a large variance, which
influences negatively the prediction accuracy [9].
This raises the necessity of finding some alternatives to the OLS technique. We need a technique
to select only the relevant variables to include in the model, reduce the variance in such a way
that the model has a good accuracy and is easy to understand. One class of these techniques is
known as shrinkage or regularization [7]. These methods shrink the coefficients towards zero,
reduce the variance and select which coefficients to bring or not to zero. Two regularization
techniques are ridge regression and lasso [5].
3 Ridge Regression
Ridge regression solves some of the shortcomings of linear regression. Ridge regression is an
extension of the OLS method with an additional constraint. The OLS estimates are unconstrained,
and might exhibit a large magnitude, and therefore large variance. In ridge regression, the
coefficients are applied a penalty, so that they are shrunk towards zero, this also having the effect
of reducing the variance and hence, the prediction error. Similar to the OLS approach, we choose
the ridge coefficients to minimize a penalized residual sum of squares (RSS). As opposed to OLS,
ridge regression provides biased estimators which have a low variance [4].
Bias and variance also have an effect on the model complexity. Usually, the issue of model
complexity (high number of predictors) is dealt with by dividing the data into a training and a
validation set (or test set) and by estimating the coefficients from the training set [1]. When a
model contains a large number of parameters, the complexity increases. This has the effect of
increasing the variance and decreasing the bias. In the other case, when the model complexity
decreases, the variance decreases at the cost of increased bias. We choose the appropriate model
complexity such that bias trades off for variance in a way that reduces the error on the test set
[3]. The over fitting phenomenon is often a consequence of model complexity. Over fitting means
3 Ridge Regression 4
that a model performs good on the training set but poorly on the test set. Therefore, the relation
between variance and bias strongly influences over-fitting and under-fitting. Figure 1 depicts how
the variance and bias vary as model complexity is modified.
Test Sample
Training Sample
Figure 1: The Bias-Variance Trade-off Achieved with Ridge Regression: Influence on Pre-
diction Error and Model Complexity (adapted from [3]).
When the model complexity is high, we have a large error on the test set, and the predictions
exhibit a large variance. If the model complexity is low, under-fitting occurs, resulting in large
bias. Typically, the model with the best predictive capability achieves a balance between bias and
variance [1]. Ridge regression reaches a trade-off between bias and variance. In the following
sections, we show that it produces biased estimates with a large variance and that it works well
in situations when there is a large number of predictors. This solves the problem of variability in
OLS, which exhibit low bias but high variance [3].
where the term λ ∑Nj=1 β j2 is known as a "shrinkage penalty", λ is the tuning parameter which we
discuss in the following section and ∑Nj=1 β j2 is the square of the norm of the vector β . This is
q
known as the l2 norm and is defined as ||β ||2 = ∑Nj=1 β j2 . In other words, the ridge coefficients
β̂ridge minimize a penalized RSS, and because the penalty is given by the l2 norm, we call it an L2
penalty [9].
Before finding the parameters β̂ridge , we consider two important assumptions for ridge regression.
Firstly, the intercept is not penalized. Secondly, the predictors need to be standardized. In
contrast with OLS estimates, where multiplying by a constant will scale the coefficients inverse
proportionally; the ridge coefficients can change drastically when there is a multiplication with a
constant. Hence, for each xij from the training data, we subtract its mean then divide it by the
standard deviation [3]. Therefore, standardizing the inputs yields an intercept and predictors
given by
n
yi
β0 = y = ∑ , (10)
j=1 n
s
1 n
xij = xij / ∑ (xij − xj ),
n j=1
(11)
1
β̂ridge = argmin ||y − Xβ ||2 + λ ||β 2 ||. (12)
β 2
where I is the identity matrix. This shows that, opposed to OLS, ridge regression always provides
an unique solution. This is because the quantity (X T X + λ I) is always invertible even in the case
when the matrix X T X non-singular. This argument was the starting point of ridge regression [4].
Another equivalent formulation of ridge regression is obtained by minimizing a constrained
version of the RSS
N N
β̂ridge = argmin ∑ (yi − (β0 + ∑ β j x j ))2 , (14)
β j=1 j=1
subject to
N
∑ (β j )2 < t, (15)
j=1
3 Ridge Regression 6
where t is a shrinkage factor, t > 0. Hence, we obtain the ridge coefficients by minimizing the
RSS subject to a constraint given by an L2 penalty [9].
where d j represent the singular values of matrix X (discussed in the following section), or simply
the eigenvalues of the matrix X T X. This is a decreasing function of λ . When λ = 0, meaning
no penalization, the d f (λ ) = p; but when λ → ∞, then the d f (λ ) → 0 as a consequence of the
parameters being heavily penalized (or constrained). Thus, the more shrinkage is applied, the
lower the degrees of freedom [3]. Degrees of freedom are important in calculating some model
selection criteria for estimating λ [2].
3.4 Shrinkage
In this section, we investigate further the nature of the shrinkage performed by ridge regression.
Ridge regression performs a constant shrinkage on the coefficients, with the shrinkage given by
an amount of λ . Ridge regression includes all of the variables of the model [5]. Like mentioned
previously, the ridge estimator is a biased estimator of β , as opposed to the OLS estimate which
gives an unbiased estimate. An interesting case is when we consider an orthonormal design
matrix X (or orthonormal inputs). An orthonormal matrix refers to an orthogonal matrix, with
4 Lasso 7
orthogonal columns of length one, considering the columns are expressed as vectors. In this case,
the relationship between β̂ridge and β̂OLS becomes
β̂OLS
β̂ridge = , (18)
1+λ
which shows that the ridge coefficients are derived from a scaled version of the OLS coefficients
[3]. This relationship further illustrates the main characteristic of ridge regression, which is
shrinkage. Ridge regression always shrinks the coefficients towards zero, reducing the variance
but introducing additional bias.
We now turn to the relationship between ridge regression and principal components anal-
ysis (PCA). PCA refers to a method of explaining the variance-covariance structure of linear
combinations of variables [9]. We use the Singular Value Decomposition (SVD) of the matrix X,
which is an (N × p) matrix, to gain more insight into how ridge regression performs shrinkage.
We express matrix X as
X = UDV T , (19)
where
• U is an (n × p) orthogonal matrix
• V is an (n × p) orthogonal matrix
• D is a diagonal matrix with dimension (p × p) and diagonal elements dj , such that D =
diag(dj ).
The values d1 ≥ d2 ≥ . . . ≥ d p ≥ 0 are the singular values of the matrix X, which means that X
becomes a singular matrix if one or a few of the values of d j are zero. By replacing the expression
of X from Equation (19) into Equation (13), and after some mathematical arrangements, the
β̂ridge coefficients are given from the following equation
N d 2j
X β̂ridge = ∑ u j d 2 + λ uTj y, (20)
j=1 j
where u j are the columns of matrix U. This expression displays the relation between ridge
regression and principal component analysis (PCA). SVD serves as a way to express the principal
components of the matrix X, and these are in fact the columns of matrix V. The main idea behind
PCR is that the largest principal component is the one with the largest variance of the data and
the smallest principle component is the one with the smallest variance [3]. Ridge regression
calculates the coordinates of y subject to the orthonormal matrix U, and then shrinks them by the
d 2j d 2j
factor d 2j +λ
. As λ ≥ 0, the termd 2j +λ
≤ 1. This shows that ridge regression shrinks the principal
components which correspond to d 2j = λ j . More exactly, ridge regression shrinks the low-variance
directions more and keeps all principle components directions unchanged [9].
4 Lasso
Lasso, or "Least Absolute Shrinkage and Selection Operator", is another regularization method
with two additional features to ridge regression. Unlike ridge regression, it shrinks some
4 Lasso 8
coefficients exactly to zero. This property is known as sparsity. In addition, lasso shrinks some
specific coefficients. Lasso has the property of selecting variables from a large set, property known
as variable selection. Therefore, lasso performs regularization and variable selection [7].
where, as before, λ represents the shrinkage parameter. The term ∑Nj=1 |β j | is called the shrinkage
penalty, and is given in fact by the l1 norm of the vector β , defined as ||β ||1 = ∑ |β j |. Therefore,
we call it an L1 penalty. Thus, the main difference between ridge regression and lasso is that
lasso uses an L1 penalty unlike ridge regression which uses an L2 penalty. The difference between
an L1 penalty and L2 penalty is that the L1 penalty has the effect of shrinking some coefficients
exactly to zero [9].
As in the case of ridge regression, the predictors are standardized and the intercept is left
out of the model, being estimated as β0 = y. We express the lasso estimate solution as an L1
optimization problem
N N
β̂lasso = argmin ∑ [yi − (β̂0 + ∑ β̂ j x j )]2 (23)
β j=1 j=1
subject to
N
∑ |β j | < t, (24)
j=1
where t > 0 is a shrinkage factor. Between the parameter λ and t there is a one-to-one correspon-
dence, more exactly t shows the amount of shrinkage that is applied to the parameters. This is in
fact a quadratic programming problem, with several algorithms available for solving it [7], like
for example the least angle regression (LAR) algorithm.
In order to gain more insight into the properties of the lasso estimates, we look at the matrix
algebra formulation. In this case, lasso coefficients are the solutions to an L1 penalized problem
which always provides a unique solution that exists when the matrix X T X has a full rank. As
opposed to ridge regression, the coefficients β̂lasso have no closed form, because the constraint
given by the L1 penalty is in absolute value, which cannot be differentiated. The solutions for the
lasso problem are nonlinear in yi because the constraint has a non-smooth nature [9].
5 Comparison between Ridge and Lasso 9
where γ is a constant determined from the equation ∑Nj=1 |β̂OLS | = t [9] and we use the + sign to
denote the positive part of Equation (26). This type of estimator is known as a "soft-threshold"
estimator. Soft-thresholding means that the coefficients less than γ are shrunk to zero and
coefficients larger than γ are shrunk by the amount of γ. This shows that lasso performs a
continuous variable selection[9].
RSS is constant. Both regression methods find the point where the elliptical contour intersects the
constraint region. Ridge regression has a circle shaped constraint region defined by β12 + β22 ≤ t,
while lasso has a diamond shaped constraint region given by |β1 | + |β2 | ≤ t. In the case of lasso, if
the contour intersects the diamond at a corner, then one of the coefficients β j is equal to 0. In
contrast, for ridge regression, there is no intersection between the contour and the constraint
at the axis, which shows that the ridge coefficients will not be exactly equal to zero [5]. This
illustrates in a graphical manner the sparsity property of the lasso.
Lasso Ridge
Estimate Estimate
Figure 2: Lasso and Ridge Geometrical Interpretation: Contours of the Errors, represented
by elliptical contours and Constraint Functions for ridge (β12 + β22 ≤ t) and lasso (|β1 | + |β2 | ≤
t).
N
yi = ∑ Xi j β j + εi , (27)
j=1
where εi are the errors, which are independent and drawn from a normal distribution, yi is the
dependent variable, Xi j represent the indenpendent variables and β j the regression coefficients.
We derive the ridge and lasso estimates from the linear regression model, with an additional
assumption, that there is a prior distribution for β j [3]. For ridge regression, the prior distribution
of β j and of y are
where σ 2 is the variance and both σ and τ are known before. The relation between λ and τ is
λ = σ 2 /τ 2 . If we multiply this distribution by a likelihood function, then the resulting distribution
5 Comparison between Ridge and Lasso 11
is called the "posterior distribution". We assume that βridge have a prior Gaussian distribution with
zero mean; in addition, λ is a function of the standard deviation. Then, the ridge coefficients are
the mean and the mode of the posterior distribution since for the Gaussian distribution the mode
and mean coincide [1].
For the lasso case, we consider a double exponential (or Laplacian) prior distribution which has
the form
1
p(βlasso ) = τ(exp(−|β j |/τ), (30)
2
where τ = 1/λ , or, equivalently p(βlasso ) = λ /2(exp(−λ |β j |) [6]. Hence, the lasso estimates are
proportional to the log density of the double exponential distribution. We derive the lasso
estimate as the posterior mode with an independent double-exponential prior. However, this is
not the posterior mean as in the ridge regression case [7]. The shape of the Laplacian distribution
of the lasso explains the sparsity property of lasso. the Laplacian distribution has a peak point,
which indicates that some of the coefficients shrink exactly to zero [3], as opposed to the Gaussian
distribution which has a bell-shaped distribution.
We conclude this section by a generalized form of both ridge and lasso estimates in the Bayesian
interpretation, considering the criterion
( )
N p p
βe = argmin ∑ (yi − β0 − ∑ β j xi j )2 + λ ∑ |β j |q , (31)
β i=1 j=1 j=1
The term |β j |q represents the prior distribution and the quantity ∑ pj=1 |β j |q is the contour of the
prior distribution of the parameters, also called the Lq norm. The case of q = 1 corresponds to
the lasso, with the Laplacian distribution and q = 2 corresponds to ridge regression [5]. Figure 3
displays the contours of the shrinkage term (or regularization term).
Figure 3: Contours of the Penalization Term for Lasso (left) and Ridge (right) (adapted
from [5]).
5.3 Discussion
In this section, we briefly present the advantages and disadvantages of both methods. A compara-
tive overview of the two methods can be observed in Table 1.
Both ridge and lasso shrink the OLS estimates by a certain amount, by penalizing the Residual
Sum of Squares (RSS). Lasso measures the shrinkage by ∑Nj=1 |β j |, while ridge by ∑Nj=1 (β j )2 .
5 Comparison between Ridge and Lasso 12
Thus, the two methods use different kinds of penalties applied to the OLS equation. Ridge
performs shrinkage in a proportional manner, while lasso applies a type of shrinkage called
soft thresholding, which shrinks coefficients with a fixed quantity. In case of orthogonal design,
ridge and lasso estimates are simple functions of the OLS estimates, given by Equation (18),
respectively Equation (26). Figure 4 depicts this relationship graphically. It displays the OLS,
OLS
Model
Ridge
Estimate
Lasso
Figure 4: Ridge and Lasso Estimates in the Orthonormal Case (adapted from [5]).
ridge and lasso estimates plotted against the OLS estimates and the respective model estimates.
We can see that both the ridge and lasso estimates are functions of the OLS estimates.
For lasso, parameter λ controls how many variables to include in the model, while ridge includes
all the predictors in the model, thus not resulting in a parsimonious model (as small as possible).
Both methods outperform OLS because both achieve a reduction in variance at the cost of an
increase in bias. Lasso coefficients estimates are more interpretable, as a consequence of the
variable selection feature. However, in terms of prediction error (MSE) the two methods are
not comparable and we cannot determine which method performs better. According to [5],
lasso performs well in the setting when very few of the predictors have high coefficients and the
rest very low coefficients. Ridge regression performs well when there are many predictors to
explain the output, and each of the coefficients associated with the predictors have approximately
the same size. The choice of these methods depends on the particular data set at hand [3].
Nevertheless, the lasso method is more popular due to its variable selection property.
However, lasso exhibits certain limitations. In the case when the number of predictors is much
higher than the number of observations, lasso does not perform well. Some of the problems in
6 Conclusion 13
this situation include the fact that lasso selects at most n variables, and that for the lasso to be
defined we need to specify a bound on the L1 norm. Furthermore, the lasso cannot select a whole
group of variables, in the cases when there are correlations between variables in the group [10].
Lastly, in [7] it is demonstrated that when n > p and the predictors are highly correlated, the
lasso prediction drops, and ridge outperforms it.
Extensions to improve the lasso method have also been developed recently. One of these
techniques is the elastic net [10] [3], which combines ridge regression shrinkage and lasso
variable selection property and has the additional advantage of grouping variables. Other
extensions include BRIDGE regression, which introduces a generalization of lq norms or the
garotte [7].
6 Conclusion
In this paper, we explained the shrinkage methods of ridge regression and lasso. We use these
methods in business forecasting, when we try to predict future data based on past observations.
Shrinkage methods achieve a better prediction accuracy. Both ridge and lasso shrink the value
of the coefficients towards zero. Shrinkage solves some of the problems of OLS estimates, such
as multicollinearity, and helps reduce problems associated with complex models, by avoiding
over-fitting. The methods minimize a penalized residual sum of squares with different penalties.
The amount of shrinkage is controlled in both methods by a parameter usually chosen by cross-
validation. Ridge regression applies an amount of shrinkage which brings the coefficients towards
zero. This reduces the variance, at the cost of increased bias and improves the accuracy of the
model. Lasso soft thresholds the coefficients exactly to zero, which yields sparse models. In
addition, lasso performs variable selection and provides models that are easier to interpret. The
choice of which method to use depends on the data set. However, lasso exhibits low performance,
for example when the number of predictors is much higher than the number of observations or
when choosing grouped variables. There are several algorithms available for computing the lasso
estimates and several extensions to ridge and lasso. Some include elastic net, BRIDGE, or the
garotte. We can conclude that ridge and lasso are two powerful tools used for regression, with
ridge outperforming OLS and the elastic net outperforming lasso.
A References
[1] C. M. B ISHOP. Pattern recognition and machine learning. New York: Springer, 2006. ISBN:
0387310738.
[2] I. E. F RANK and J. H. F RIEDMAN. A Statistical View of Some Chemometrics Regression Tools.
In: Vol. 35, No. 2 (May 1993), p. 109. ISSN: 00401706.
[3] J. F RIEDMAN, T. H ASTIE, and R. T IBSHIRANI. The elements of statistical learning. Vol. 1.
Springer Series in Statistics New York, 2001. ISBN: 978-0-387-84858-7.
[4] A. E. H OERL and R. W. K ENNARD. Ridge Regression: Biased Estimation for Nonorthogonal
Problems. In: Vol. 42, No. 1 (Feb. 2000), p. 80. ISSN: 00401706.
[5] G. JAMES et al. An Introduction to Statistical Learning. Vol. 103. Springer Texts in Statistics.
New York, NY: Springer New York, 2013. ISBN: 978-1-4614-7137-0.
[6] T. PARK and G. C ASELLA. The Bayesian Lasso. In: Vol. 103, No. 482 (June 2008), pp. 681–
686. ISSN: 0162-1459, 1537-274X.
[7] R. T IBSHIRANI. Regression Shrinkage and Selection via the Lasso. In: Journal of the Royal
Statistical Society. Series B (Methodological), Vol. 58, No. 1 (1996), pp. 267–288. ISSN:
0035-9246.
[8] J. W OOLDRIDGE. Introductory Econometrics: A Modern Approach. Cengage Learning, Sept. 26,
2012. 910 pp. ISBN: 1111531048.
[9] X. YAN, X. S U, and W ORLD S CIENTIFIC (F IRM ). Linear regression analysis theory and comput-
ing. Singapore; Hackensack, N.J.: World Scientific Pub. Co., 2009. ISBN: 9789812834119.
[10] H. Z OU and T. H ASTIE. Regularization and variable selection via the elastic net. In: Vol. 67,
No. 2 (2005), pp. 301–320.
[11] H. Z OU, T. H ASTIE, and R. T IBSHIRANI. On the degrees of freedom of the lasso. In: Vol. 35,
No. 5 (Oct. 2007), pp. 2173–2192. ISSN: 0090-5364.
B List of Figures
1 The Bias-Variance Trade-off Achieved with Ridge Regression: Influence on Predic-
tion Error and Model Complexity (adapted from [3]). . . . . . . . . . . . . . . . . 4
2 Lasso and Ridge Geometrical Interpretation: Contours of the Errors, represented
by elliptical contours and Constraint Functions for ridge (β12 + β22 ≤ t) and lasso
(|β1 | + |β2 | ≤ t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Contours of the Penalization Term for Lasso (left) and Ridge (right) (adapted
from [5]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Ridge and Lasso Estimates in the Orthonormal Case (adapted from [5]). . . . . . 12
C List of Tables
1 Comparison between ridge regression and lasso. . . . . . . . . . . . . . . . . . . . 12