0% found this document useful (0 votes)
48 views

Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University

This document discusses regularization techniques for linear models, including ridge regression, lasso, elastic net, and group lasso. Ridge regression adds an L2 penalty to reduce variance. Lasso uses an L1 penalty to perform variable selection. Elastic net combines L1 and L2 penalties. Group lasso allows predefined groups of variables to be selected together. These techniques help address issues like multicollinearity and improve model performance.

Uploaded by

azjajaoan malaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University

This document discusses regularization techniques for linear models, including ridge regression, lasso, elastic net, and group lasso. Ridge regression adds an L2 penalty to reduce variance. Lasso uses an L1 penalty to perform variable selection. Elastic net combines L1 and L2 penalties. Group lasso allows predefined groups of variables to be selected together. These techniques help address issues like multicollinearity and improve model performance.

Uploaded by

azjajaoan malaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CS 7265 BIG DATA ANALYTICS

REGULARIZATION ON LINEAR MODEL

* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

Mingon Kang, Ph.D


Computer Science, Kennesaw State University
Problems in Linear Regression
 More predictors than the number of samples
 Not unusual to see such data. E.g., microarray data

 Ill-conditioned matrix
 LS estimates depend upon (𝐗 ′ 𝐗)−1 , and the
computation with ill-conditioned matrix X may be
singular or nearly singular
 In this case, small changes to the elements of X lead to
large change in (𝐗 ′ 𝐗)−1
Ill-conditioned Matrices
𝑥+𝑦 =2 𝑥+𝑦 =2
𝑥 + 1.001𝑦 = 2 𝑥 + 1.001𝑦 = 2.001

The solution is x = 2, y = 0 on the left, x=1, y=1 on


the right. The coefficient matrix is called ill-conditioned
because a small change in the constant coefficients
results in a large change in the solution.
Solutions
 Sometimes can be improved by trading a little bias
to reduce the variance of the predicted values
 Determine a smaller subset of predictors that
exhibit the strongest effect (Feature Selection)
 It
gives the big picture sacrificing some of the small
details
Motivation
 If more than two independent variables are highly
correlated:

 The intercept is approximated well, but coefficients?

Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Motivation
 It happens because x1 and x2 are highly
correlated.
 RSS(40, -38) = 21.7 (our estimate) is very closed to
RSS(1, 1) = 22.6 (the truth)
 Effective way of dealing with this problem is
through penalization:
 Instead of minimizing RSS only, we consider an
additional term in the regression form…

Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Ridge Regression
 Ridge Regression Model
𝑛

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦𝑖 − 𝐗𝐛)2


𝑖=1
𝑝
2
𝑠. 𝑡. 𝑏𝑗 ≤ 𝑐
𝑗=1
Ridge Regression
 Why does this help?
 Smaller coefficients give less sensitivity of the variables.

Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Ridge Regression
 Lagrange Multiplier
 A strategy for finding the local maxima or minima of a function
subject to equality/inequality constraints

Minimizing
𝑛

𝑓(𝑥) 𝑠. 𝑡. 𝑔(𝑥) ≤ 𝐶
𝑖=1
Equivalent to minimizing
𝑛

𝑓(𝑥) + λ𝑔(𝑥) ,
𝑖=1
Where λ is positive.
Lagrange Multiplier
 Example
 Find the extrema of the function F(x, y) = 2𝑦 + 𝑥
subject to the constraint 0 = g(x, y) = 𝑦 2 + 𝑥𝑦 − 1

 Derivative of H x, y, z = 2𝑦 + 𝑥 + z(𝑦 2 + 𝑥𝑦 − 1)

Test with http://www.math.uri.edu/~bkaskosz/flashmo/graph3d2/


Ridge Regression
 Ridge Regression Model
𝑛

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦𝑖 − 𝐗𝐛)2 + λ 𝐛 2 ,


𝑖=1
2
where 𝐛 is L-2 norm of b (Euclidean distance)

P-norm (p ≥ 1)
𝒏 𝟏/𝒑
𝒑
x 𝒑 ≔ 𝒙𝒊
𝒊=𝟏
p=1, Manhattan norm (L-1 norm); p=2, Euclidean norm; p=∞,
maximum norm
Optimization
H 𝐛, λ = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛 +λ𝐛′ 𝐛
= 𝐲 ′ 𝐲 − 𝟐𝐛′ 𝐗 ′ 𝐲 + 𝐛′ 𝐗 ′ 𝐗𝐛+λ𝐛′ 𝐛

𝜕H 𝐛, λ
= −2𝐗 ′ 𝐲 + 2𝐗 ′ 𝐗𝐛 + 2λ𝐛 = 𝟎
𝜕𝐛
𝐗 ′ 𝐗 + λ𝐈 𝐛 = 𝐗 ′ 𝐲
𝐛 = (𝐗 ′ 𝐗 + λ𝐈)−1 𝐗 ′ 𝐲

𝐗 ′ 𝐗 + λ𝐈 is always invertible. Always gives a unique


solution, 𝐛
Ridge Regression
 Similar to the ordinary least squares solution,
but with the addition of a “ridge”
regularization
 λ  0, 𝐛 𝑟𝑖𝑑𝑔𝑒 𝐛𝑂𝐿𝑆
 λ  ∞, 𝐛 𝑟𝑖𝑑𝑔𝑒 0

 Applying the ridge regression penalty has the effect of


shrinking the estimates toward zero
 Introduce bias but reduce the variance of the estimate
LASSO
 LASSO: Least Absolute Shrinkage and Selection Operator
 L-1 norm penalization
 Most coefficients are shrunken all the way to zeros  called
sparse
𝑛

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑦𝑖 − 𝐗 𝒊 𝐛)2


𝑖=1
𝑝

𝑠. 𝑡. 𝑏𝑗 ≤ 𝑐
𝑗=1
where 𝐗 𝒊 (= ℝ1×𝑝 ) represents the i-th row vector of X

Ref: http://statweb.stanford.edu/~tibs/lasso.html
LASSO
 Absolute value operation on penalization  not
differentiable and lacks of a close form solution
 In the special case of an orthonormal design matrix,
X’X=I, it is possible to obtain close form solution (but
approximate solution)
 𝑏𝐽𝑙𝑎𝑠𝑠𝑜 = 𝑆(𝑏𝐽𝑂𝐿𝑆 , λ), where S is soft-thresholding operator
defined as:
𝑧 − λ 𝑖𝑓 𝑧 > λ
𝑆 𝑧, λ = 0 𝑖𝑓 𝑧 ≤ λ
𝑧 + λ 𝑖𝑓 𝑧 < −λ
Soft-thresholding operator
 Proximal mapping of the L1-norm:
𝑛

𝐛∗ = argminb (𝑦𝑖 − 𝐗𝐛)2 + λ 𝐛 1


𝑖=1
𝛻 𝐲 − 𝐗𝐛 2 + 𝜕 λ 𝐛 1 ↔ −𝐗 ′ 𝐲 + 𝐗 ′ 𝐗𝐛 + λ𝜕 𝐛 1
The L-1norm is separable, so we can consider each of its
components separately. Let’s examine first the case where
𝑏𝑖 ≠ 0. Then, 𝜕 𝐛 1 = 𝑠𝑖𝑔𝑛(𝑏𝑖 ) and the optimum 𝑏𝑖∗ is
obtained as
−𝑋 ′ 𝑖 𝑦𝑖 + 𝑋 ′ 𝑖 𝑋𝑖 𝑏𝑖 + λ𝑠𝑖𝑔𝑛 𝑏𝑖 = 0
𝑏𝑖 = 𝑋 ′ 𝑖 𝑦𝑖 − λ𝑠𝑖𝑔𝑛 𝑏𝑖
Soft-thresholding operator
 In the case where 𝑏𝑖 = 0, the subdifferential of the
L-1norm is the interval [-1, 1] and the optimality
condition is
𝑋 ′ 𝑖 𝑦𝑖 + λ −1, 1 = 0
𝑋 ′ 𝑖 𝑦𝑖 = −λ, λ
|𝑋 ′ 𝑖 𝑦𝑖 | ≤ λ

Absolute value function (left), and its sub-differential as a function x(right)


Geometry of ridge vs lasso
 A geometric illustration of why lasso results in
sparsity. Lasso (left): confidents tend to be zeros
Reference
 http://web.as.uky.edu/statistics/users/pbreheny/60
3/2-20.pdf
 https://lagunita.stanford.edu/c4x/HumanitiesScienc
e/StatLearning/asset/model_selection.pdf
 http://statweb.stanford.edu/~tibs/sta305files/Rud
yregularization.pdf
 https://onlinecourses.science.psu.edu/stat857/node
/155
Elastic Net
 Hui Zou and Trevor Hastie in 2005
 Regularize combining L-1 and L-2 norm penalties of
the Lasso and Ridge
 Overcome the limitation of Lasso:
 For the case, where large p, small n and several
variables are highly correlated each other, LASSO
tends to select only one variable among them (others
become zeros).
Elastic Net
 The elastic net method includes the LASSO and
ridge regression

𝐛∗ = argminb (𝑦𝑖 − 𝐗𝐛)2 + λ1 𝐛 + λ2 𝐛 2


𝑖=1

 If λ1 = λ, λ2 = 0, it is LASSO
 If λ1 = 0, λ2 = λ, it is Ridge regression
Group LASSO
 Yuan and Lin in 2006
 When pre-defined groups of covariates are given,
they are allowed to be selected together

𝑛 𝐽

𝐛 = argminb (𝑦𝑖 − 𝐗𝐛)2 + λ 𝑏𝑗 𝐾𝑗


𝑖=1 𝑗=1
𝑏𝑖 𝐾𝑗 = (𝜑′ 𝐾𝜑)1/2
𝜑 ∈ ℝ𝑑 , 𝐾 ∈ ℝ𝑑×𝑑
Interaction effects
 Assumption that variables are independent in Linear model
 When considering the relationship among two or more
variables, need to describes simultaneous influence of the
variables.

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 + 𝑒

 Kernel function
 Convert non-linear to linear in a higher dimensional space
 https://www.youtube.com/watch?v=9NrALgHFwTo
References
 Elastic net
 H. Zou and T. Hastie, “Regularization and variable
selection via the elastic net”, J.R. Statist. Soc., 2005
 https://web.stanford.edu/~hastie/Papers/B67.2%20(
2005)%20301-320%20Zou%20&%20Hastie.pdf
 Group Lasso
 M. Yuan and Y. Lin, “Model selection and estimation in
regression with grouped variables”, J.R. Statist. Soc.,
2006
 http://pages.stat.wisc.edu/~myuan/papers/glasso.fina
l.pdf

You might also like