0% found this document useful (0 votes)
153 views

Ridge Regression

Ridge regression is proposed to address issues that arise when there are many predictors or the predictor matrix is ill-conditioned. It places an L2 penalty on the regression coefficients to reduce their variance. This shrinks the coefficients toward zero, improving prediction. Ridge regression estimates are obtained by minimizing the residual sum of squares with a penalty term added that is proportional to the L2 norm of the coefficients. Geometrically, it finds the point where the residual sum of squares contour and a constraint circle intersect. Cross-validation is often used to select the ridge parameter lambda, which controls the amount of shrinkage.

Uploaded by

nitin30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views

Ridge Regression

Ridge regression is proposed to address issues that arise when there are many predictors or the predictor matrix is ill-conditioned. It places an L2 penalty on the regression coefficients to reduce their variance. This shrinks the coefficients toward zero, improving prediction. Ridge regression estimates are obtained by minimizing the residual sum of squares with a penalty term added that is proportional to the L2 norm of the coefficients. Geometrically, it finds the point where the residual sum of squares contour and a constraint circle intersect. Cross-validation is often used to select the ridge parameter lambda, which controls the amount of shrinkage.

Uploaded by

nitin30
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PublishedonSTAT897D(https://onlinecourses.science.psu.

edu/stat857)
Home>5.1RidgeRegression

5.1RidgeRegression
Motivation:toomanypredictors
Itisnotunusualtoseethenumberofinputvariablesgreatlyexceedthenumberof
observations,e.g.microarraydataanalysis,environmentalpollutionstudies.
Withmanypredictors,fittingthefullmodelwithoutpenalizationwillresultinlarge
predictionintervals,andLSregressionestimatormaynotuniquelyexist.
Motivation:illconditionedX
BecausetheLSestimatesdependupon(X X)1 ,wewouldhaveproblemsin
computingLS ifX Xweresingularornearlysingular.
Inthosecases,smallchangestotheelementsofX leadtolargechangesin

1
(X X )
.
TheleastsquareestimatorLS mayprovideagoodfittothetrainingdata,butitwill
notfitsufficientlywelltothetestdata.

RidgeRegression
Onewayoutofthissituationistoabandontherequirementofanunbiasedestimator.
WeassumeonlythatX'sandYhavebeencentered,sothatwehavenoneedforaconstant
termintheregression:
Xisanbypmatrixwithcenteredcolumns,
Yisacenterednvector.
HoerlandKennard(1970)proposedthatpotentialinstabilityintheLSestimator

^
= (X X )
X Y,

couldbeimprovedbyaddingasmallconstantvalue tothediagonalentriesofthematrix

X Xbeforetakingitsinverse.
Theresultistheridgeregressionestimator
^

^
ridge = (X X + Ip )
X Y

Ridgeregressionplacesaparticularformofconstraintontheparameters( 's):^ridge is
chosentominimizethepenalizedsumofsquares:
p

p
2

(yi xij j )
i=1

j=1

whichisequivalenttominimizationofi=1 (yi
p

j=1

2
j

< c

2
j

j=1

j=1

xij j )

subjectto,forsomec

> 0

,i.e.constrainingthesumofthesquaredcoefficients.

Therefore,ridgeregressionputsfurtherconstraintsontheparameters,j 's,inthelinear
model.Inthiscase,whatwearedoingisthatinsteadofjustminimizingtheresidualsumof
squareswealsohaveapenaltytermonthe 's.Thispenaltytermis (aprechosen
constant)timesthesquarednormofthe vector.Thismeansthatifthej 'stakeonlarge
values,theoptimizationfunctionispenalized.Wewouldprefertotakesmallerj 's,orj 's
thatareclosetozerotodrivethepenaltytermsmall.

GeometricInterpretationofRidgeRegression:

Theellipsescorrespondtothecontoursofresidualsumofsquares(RSS):theinnerellipse
hassmallerRSS,andRSSisminimizedatordinalleastsquare(OLS)estimates.
Forp

= 2

,theconstraintinridgeregressioncorrespondstoacircle,j=1

2
j

< c

Wearetryingtominimizetheellipsesizeandcirclesimultanouslyintheridgeregression.
Theridgeestimateisgivenbythepointatwhichtheellipseandthecircletouch.
ThereisatradeoffbetweenthepenaltytermandRSS.Maybealarge wouldgiveyoua
betterresidualsumofsquaresbutthenitwillpushthepenaltytermhigher.Thisiswhyyou

mightactuallyprefersmaller 'swithworseresidualsumofsquares.Fromanoptimization
perspective,thepenaltytermisequivalenttoaconstraintonthe 's.Thefunctionisstillthe
residualsumofsquaresbutnowyouconstrainthenormofthej 'stobesmallerthansome
constantc.Thereisacorrespondencebetween andc.Thelargerthe is,themoreyou
preferthej 'sclosetozero.Intheextremecasewhen = 0 ,thenyouwouldsimplybe
doinganormallinearregression.Andtheotherextremeas approachesinfinity,yousetall
the 'stozero.

PropertiesofRidgeEstimator:
^
ls

isanunbiasedestimatorof ^ridge isabiasedestimatorof .

Fororthogonalcovariates,X X

= nIp

,^ridge

n+

^
ls

.Hence,inthiscase,the

ridgeestimatoralwaysproducesshrinkagetowards0. controlstheamountof
shrinkage.
Animportantconceptinshrinkageisthe"effective''degreesoffreedomassociatedwitha
setofparameters.Inaridgeregressionsetting:
1.Ifwechoose = 0 ,wehavepparameters(sincethereisnopenalization).
2.If islarge,theparametersareheavilyconstrainedandthedegreesof
freedomwilleffectivelybelower,tendingto0as .
Theeffectivedegreesoffreedomassociatedwith1 , 2 , , p isdefinedas
p

df () = tr(X(X X + Ip )

X ) =
j=1

2
j

2
j

,
+

wheredj arethesingularvaluesofX .Noticethat = 0 ,whichcorrespondsto


noshrinkage,givesdf () = p (aslongasX Xisnonsingular),aswewould
expect.
Thereisa1:1mappingbetween andthedegreesoffreedom,soinpracticeone
maysimplypicktheeffectivedegreesoffreedomthatonewouldlikeassociated
withthefit,andsolvefor .
Asanalternativetoauserchosen ,crossvalidationisoftenusedinchoosing :we
select thatyieldsthesmallestcrossvalidationpredictionerror.
Theintercept0 hasbeenleftoutofthepenaltytermbecauseY hasbeencentered.
Penalizationoftheinterceptwouldmaketheproceduredependontheoriginchosenfor
Y.

Sincetheridgeestimatorislinear,itisstraightforwardtocalculatethevariance
covariancematrixvar(^ridge )

= (X X + Ip )

X X(X X + Ip )

ABayesianFormulation
Considerthelinearregressionmodelwithnormalerrors:
p

Yi = Xij j + i
j=1

isi.i.d.normalerrorswithmean0andknownvariance 2 .

Since isappliedtothesquarednormofthevector,peopleoftenstandardizeallofthe
covariatestomakethemhaveasimilarscale.Assumej hasthepriordistribution
.Alargevalueof correspondstoapriorthatismoretightly
concentratedaroundzero,andhenceleadstogreatershrinkagetowardszero.
2

j iid N (0, /)

1
^
N ( , (X X + Ip )
X X(X X + Ip )
)

Theposterioris|Y

,where

^
^
= ridge = (X X + Ip )
X Y

,confirmingthattheposteriormean(andmode)ofthe
Bayesianlinearmodelcorrespondstotheridgeregressionestimator.
Whereastheleastsquaressolutions^ls

= (X X )

X Y

areunbiasedifmodeliscorrectly

specified,ridgesolutionsarebiased,E(^ridge ) .However,atthecostofbias,ridge
regressionreducesthevariance,andthusmightreducethemeansquarederror(MSE).
2

M S E = Bias

MoreGeometricInterpretations(optional)
Inputsarecenteredfirst
Considerthefittedresponse
ridge

^
^ = X
y

= X(X

X + I)

= UD(D

= uj
j=1

+ I)

2
j

u
+

DU

2
j

+ V ariance

where\(\textbf{u}_j\)arethenormalizedprincipalcomponentsofX.
Ridgeregressionshrinksthecoordinateswithrespecttotheorthonormalbasisformed
bytheprincipalcomponents.
Coordinateswithrespecttoprincipalcomponentswithsmallervarianceareshrunk
more.
~
InsteadofusingX=(X1,X2,...,Xp)aspredictingvariables,usethenewinputmatrixX
=UD
Thenforthenewinputs:
2

ridge

dj

dj +

^
V ar( j ) =

2
2

dj

where 2 isthevarianceoftheerrorterm inthelinearmodel.


Theshrinkagefactorgivenbyridgeregressionis:
d
d

2
j

2
j

Wesawthisinthepreviousformula.Thelargeris,themoretheprojectionisshrunkinthe
directionofuj .Coordinateswithrespecttotheprincipalcomponentswithasmallervariance
areshrunkmore.
Let'stakealookatthisgeometrically.

[1]
Thisinterpretationwillbecomeconvenientwhenwecompareittoprincipalcomponents
regressionwhereinsteadofdoingshrinkage,weeithershrinkthedirectionclosertozeroor
wedon'tshrinkatall.Wewillseethisinthe"DimensionReductionMethods"lesson.
SourceURL:https://onlinecourses.science.psu.edu/stat857/node/155
Links:
[1]https://onlinecourses.science.psu.edu/stat857/javascript:popup_window(
'/stat857/sites/onlinecourses.science.psu.edu.stat857/files/lesson02/shrinkage_viewlet_swf.html','shrinkage',683,
525)

You might also like