Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington
Ill-conditioned matrix
LS estimates depend upon (𝐗 ′ 𝐗)−1 , and the
computation with ill-conditioned matrix X may be
singular or nearly singular
In this case, small changes to the elements of X lead to
large change in (𝐗 ′ 𝐗)−1
Ill-conditioned Matrices
𝑥+𝑦 =2 𝑥+𝑦 =2
𝑥 + 1.001𝑦 = 2 𝑥 + 1.001𝑦 = 2.001
Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Motivation
It happens because x1 and x2 are highly
correlated.
RSS(40, -38) = 21.7 (our estimate) is very closed to
RSS(1, 1) = 22.6 (the truth)
Effective way of dealing with this problem is
through penalization:
Instead of minimizing RSS only, we consider an
additional term in the regression form…
Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Ridge Regression
Ridge Regression Model
𝑛
Reference: http://web.as.uky.edu/statistics/users/pbreheny/603/2-20.pdf
Ridge Regression
Lagrange Multiplier
A strategy for finding the local maxima or minima of a function
subject to equality/inequality constraints
Minimizing
𝑛
𝑓(𝑥) 𝑠. 𝑡. 𝑔(𝑥) ≤ 𝐶
𝑖=1
Equivalent to minimizing
𝑛
𝑓(𝑥) + λ𝑔(𝑥) ,
𝑖=1
Where λ is positive.
Lagrange Multiplier
Example
Find the extrema of the function F(x, y) = 2𝑦 + 𝑥
subject to the constraint 0 = g(x, y) = 𝑦 2 + 𝑥𝑦 − 1
Derivative of H x, y, z = 2𝑦 + 𝑥 + z(𝑦 2 + 𝑥𝑦 − 1)
P-norm (p ≥ 1)
𝒏 𝟏/𝒑
𝒑
x 𝒑 ≔ 𝒙𝒊
𝒊=𝟏
p=1, Manhattan norm (L-1 norm); p=2, Euclidean norm; p=∞,
maximum norm
Optimization
H 𝐛, λ = 𝐲 − 𝐗𝐛 ′ 𝐲 − 𝐗𝐛 +λ𝐛′ 𝐛
= 𝐲 ′ 𝐲 − 𝟐𝐛′ 𝐗 ′ 𝐲 + 𝐛′ 𝐗 ′ 𝐗𝐛+λ𝐛′ 𝐛
𝜕H 𝐛, λ
= −2𝐗 ′ 𝐲 + 2𝐗 ′ 𝐗𝐛 + 2λ𝐛 = 𝟎
𝜕𝐛
𝐗 ′ 𝐗 + λ𝐈 𝐛 = 𝐗 ′ 𝐲
𝐛 = (𝐗 ′ 𝐗 + λ𝐈)−1 𝐗 ′ 𝐲
𝑠. 𝑡. 𝑏𝑗 ≤ 𝑐
𝑗=1
where 𝐗 𝒊 (= ℝ1×𝑝 ) represents the i-th row vector of X
Ref: http://statweb.stanford.edu/~tibs/lasso.html
LASSO
Absolute value operation on penalization not
differentiable and lacks of a close form solution
In the special case of an orthonormal design matrix,
X’X=I, it is possible to obtain close form solution (but
approximate solution)
𝑏𝐽𝑙𝑎𝑠𝑠𝑜 = 𝑆(𝑏𝐽𝑂𝐿𝑆 , λ), where S is soft-thresholding operator
defined as:
𝑧 − λ 𝑖𝑓 𝑧 > λ
𝑆 𝑧, λ = 0 𝑖𝑓 𝑧 ≤ λ
𝑧 + λ 𝑖𝑓 𝑧 < −λ
Soft-thresholding operator
Proximal mapping of the L1-norm:
𝑛
If λ1 = λ, λ2 = 0, it is LASSO
If λ1 = 0, λ2 = λ, it is Ridge regression
Group LASSO
Yuan and Lin in 2006
When pre-defined groups of covariates are given,
they are allowed to be selected together
𝑛 𝐽
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 + 𝑒
Kernel function
Convert non-linear to linear in a higher dimensional space
https://www.youtube.com/watch?v=9NrALgHFwTo
References
Elastic net
H. Zou and T. Hastie, “Regularization and variable
selection via the elastic net”, J.R. Statist. Soc., 2005
https://web.stanford.edu/~hastie/Papers/B67.2%20(
2005)%20301-320%20Zou%20&%20Hastie.pdf
Group Lasso
M. Yuan and Y. Lin, “Model selection and estimation in
regression with grouped variables”, J.R. Statist. Soc.,
2006
http://pages.stat.wisc.edu/~myuan/papers/glasso.fina
l.pdf