Simple Linear Regression and Correlation
Simple Linear Regression and Correlation
Objectivesofthetopic:
Buildingsimplelinearregressionmodelstodata. Building simple linear regression models to data Understandingthemethodofleastsquaresandhowitisusedto estimateregressionmodelparameters. Assessingtheadequacyoftheregressionmodel. Testinghypothesesandconstructingconfidenceintervalson regressionmodelparameters. Predictingfuturevaluesandconstructingpredictionintervals. Applyingthecorrelationmodel. Applying the correlation model
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Regressionmodels
Regressionmodelsareusedtoestablisharelationship betweentwoormorevariables.
Hydrocarbon level(%) Observation Purity(%) x y 1 0.99 90.01 2 1.02 89.05 3 1.15 91.43 4 1.29 93.74 5 1.46 96.73 6 1.36 94.45 7 0.87 87.59 8 1.23 1 23 91.77 91 77 9 1.55 99.42 10 1.4 93.65 11 1.19 93.54 12 1.15 92.52 13 0.98 90.56 14 1.01 89.54 15 1.11 89.85 16 1.2 90.39 17 1.26 93.25 18 1.32 93.41 19 1.43 94.98 20 0.95 87.33
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Whenthescatterplotoftwovariablesshowsalineartrend, wesaythetwovariableshavealinearrelationship. The linear relationship between the mean of a random Thelinearrelationshipbetweenthemeanofarandom variableYandxisgivenas E(Y|x)=Y|x =0 +1 x | SinceYisarandomvariable,wecanwrite Y= + x+ Y = 0 + 1 x + Thevariable istherandomerror,whichhasameanof0and varianceof2. Itfollowsthat E(Y|x)=E(0 +1 x+)=0 +1 x+E()=0 +1 x V(Y|x)=V( + x+)=V( + x)+E()=0+2 V(Y|x) = V(0 + 1 x + ) = V(0 + 1 x) + E() = 0 + 2 = 2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Simplelinearregression
Forsimplelinearregression,thereisasingleregressor or predictorvariablexandadependentorresponsevariableY. Supposethatwehavenpairsofobservations(x1,y1),(x2,y2), ,(xn,yn). The method of least squares is used to estimate the Themethodofleastsquaresisusedtoestimatethe parameters0 and1 byminimizingthesumofthesquaresof theverticaldeviationsfromthestraightline.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
L = i2 = ( yi 0 1 xi ) 2
i =1 i =1
L 1
0 ,1
= 2 ( yi 0 1 xi ) xi = 0
i =1
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Simplifyingthelasttwoequationsleadstothenormal equations:
n 0 + 1 xi = yi
i =1 i =1
0 xi + 1 x = yi xi
i =1 i =1 2 i i =1
The least squares estimates of the intercept and the slope of Theleastsquaresestimatesoftheinterceptandtheslopeof theregressionlineare
0 = y 1 x
and
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
1 =
n n yi xi n yi xi i =1 i =1 n i =1 xi n xi2 i =1 n i =1
n 2
Thefittedregressionlineis
y = 0 + 1 x
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Foreachpairofobservations,thefollowingrelationholds:
yi = 0 + 1 xi + ei
i = 1,..., n
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Theleastsquareestimateoftheslopecanbewrittenas = S xy 1 S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
SeeExample111inthetextbook.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Estimating2
Theresidualscanbeusedtoestimatethevarianceoferror. Theerrorsumofsquaresisgivenby SS E = ei2 = ( yi yi ) 2
i =1 i =1 n n
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Asimplelinearregressionmodelisassumedtoadequately establishtherelationshipbetweencompressivestrengthx andintrinsicpermeabilityyofconcretemixes. di t i i bilit f t i Asampleofn=14wastaken,anditwasfound y=572 y2=23 530 x=43 x2=157 42 xy=1 697 8 y=572, =23,530, x=43, =157.42, xy=1,697.8 Calculatetheleastsquaresestimatesoftheslopeandthe interceptoftheregressionline. Estimate2. Predictthevalueofyforx=4.3. Forx=3.7andy=46.1,computethevalueoftheresidual.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
432 S xx = 157.42 = 25.35 14 (572)2 = 159.71 SST = 23,530 14 (572)(43) S xy = 1,697.8 = 59.06 14 SS E = 159.71 + 2.33(59.06) = 22.11 22.11 = = 1.84 12
2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Propertiesoftheleastsquaresestimators
Thevaluesoftheinterceptandtheslopeoftheregression linedependontheobservedvaluesoftheresponsevariable y,whichisarandomvariablewithmean0 +1 xandvariance hi h i d i bl ith d i 2. Thus,theleastsquaresestimatorsareinturnrandom , q variables. Ithasbeenshownthat
E ( 0 ) = 0 E (1 ) = 1 1 x2 V ( 0 ) = 2 + n S xx 2 V (1 ) = S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Thecovarianceoftheslopeandinterceptrandomvariables hasbeenshowntoequalto
, ) = 2 x cov( 0 1 S xx The estimated standard error of the slope and estimated Theestimatedstandarderroroftheslopeandestimated standarderroroftheinterceptare
se(1 ) = 2 S xx 1 x2 se( 0 ) = 2 + n S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Hypothesistests
Considerthehypothesisontheregressionlineslope: H0:1 =1,0 H1:1 1,0 Thestatistic 1 1, 0 T0 = 2 / S xx followsthetdistributionwithn2degreesoffreedom. follows the t distribution with n 2 degrees of freedom Thenullhypothesesisrejectedif|t0|>t/2,n2. The test statistic T0 can be written as TheteststatisticT canbewrittenas 1 1, 0 T0 = se(1 )
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Thehypothesisontheinterceptcanbewrittenas H0:0 =0,0 H1:0 0,0 Theteststatisticforthishypothesisis 0 0, 0 0 0, 0 T0 = = se( 0 ) 1 x2 2 + n S xx The null hypotheses is rejected if |t0| >t/2 n 2. Thenullhypothesesisrejectedif|t | t/2,n2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Significanceofregression
Aspecialhypothesisisgivenby H0:1 =0 H1:1 0 FailuretorejectH0 isequivalenttoconcludingthatthereisno linearrelationshipbetweenxandY. linear relationship bet een and Y IfH0 isrejected,thisimpliesthatxisofvalueinexplainingthe variabilityinY. y
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample112inthetextbook. Theregressionmodelforthedatahasbeen foundtobe:
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Hydrocarbon level(%) Purity(%) x y 0.99 0 99 90.01 90 01 1.02 89.05 1.15 91.43 1.29 93.74 1.46 96.73 1.36 94.45 0.87 87.59 1.23 91.77 1.55 99.42 1.4 93.65 1.19 93.54 1.15 92.52 0.98 90.56 1.01 89.54 1.11 89.85 1.2 90.39 1.26 1 26 93.25 93 25 1.32 93.41 1.43 94.98 0.95 87.33
y = 74.283 + 14.947 x
Testthehypothesis H0: 1 =0 : 0 H1:1 0 for =0.05. Inthecalculations,maintain3decimal places.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
We compute the value of the test statistic: Wecomputethevalueoftheteststatistic: 1 1, 0 14.947 0 t0 = = = 11.355 2 1.180 / 0.681 / S xx Fromthetdistributiontables,t0.005,18 =2.88. Sincet0 >t0.005,18,H0 shouldberejectandweshouldsaythat , theregressionlineinterceptisdifferentfromzero.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Asimplelinearregressionmodelisassumedtoadequately establishtherelationshipbetweencompressivestrengthx andintrinsicpermeabilityyofconcretemixes. di t i i bilit f t i Asampleofn=14wastaken,anditwasfound 1 = 2.33, 0 = 48.01 S xx = 25.35, 2 = 1.84 Testforsignificanceofregressionfor=0.05. g g Estimatethevarianceoftheestimatedslope.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Weneedtotestthehypothesis H0:1 =0 H1:1 0 Thevalueoftheteststatisticis 2.33 0 t0 = = -8.65 1.84 / 25.35 Since t0 = 8 65 < t0.025,12 = 2 179 H0 must be rejected and we Sincet =8.65<t =2.179,H mustberejectedandwe concludethatxsignificantlyexplainsthevariabilityiny. Thevarianceoftheregressionslopeis
Analysisofvariance
Themethodofanalysisofvariancecanbeusedtotestfor significanceofregression. ThetotalvariabilityinYispartitionedintomeaningful components,whichisthebasisofthetest. The analysis of variance identity is stated as Theanalysisofvarianceidentityisstatedas
( yi y )2 = ( yi y )2 + ( yi yi )2
i =1 i =1 i =1 n n n
SST
= SS R
+ SS E
( x, y )
( y y) ( y y)
( x, y )
y = 0 + 1 x
( y y)
( x, y )
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
SSE istheresidualvariationleftunexplainedbytheregression line,anditiscalledtheerrorsumofsquares. SST isthetotalcorrectedsumofsquaresofy. SST hasn1degreesoffreedom,SSR has1,andSSE hasn2. Di iding the s m of sq ares b the degrees of freedom leads Dividingthesumofsquaresbythedegreesoffreedomleads towhatiscalledthemeansquares.Thus,SSR/1=MSR and SSE/(n2)=MSE. Definetherandomvariable SS R / 1 MS R F0 = = SS E /( n 2) MS E
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Considerthenullhypothesis H0:1 =0 H1:1 0 IftheH0 istrue,thenF0 followstheFdistributionwith1and n2degreesoffreedom. n 2 degrees of freedom Thus,H0 shouldberejectediff0 >f,1,n2. The computations are organized in the analysis of variance Thecomputationsareorganizedintheanalysisofvariance table:
Sourceof variation Regression Error Total Sumof squares SSR SSE SST Degreesof freedom 1 n2 n1 Mean squares MSR MSE Fo MSR/MSE
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample113inthetextbook. Testthesignificanceofregressionusingtheanalysisof variance. Theanalysisofvariancetable
Sourceof variation Regression Error Total Sumof squares 152.13 21.25 173.38 Degreesof freedom 1 18 19 Mean squares 152.13 1.18 fo 128.86
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Confidenceintervalsontheslopeandintercept
Ithasbeenshownthatthetworandomvariables 1 1
2 / S xx 0 0 1 x2 2 + n S xx
followthetdistributionwithn2degreesoffreedom. A 100(1)% confidence interval on the slope in simple linear A100(1 )%confidenceintervalontheslopeinsimplelinear regressionis
1 t / 2,n2 2 / S xx 1 1 + t / 2 ,n2 2 / S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
A100(1)%confidenceintervalontheinterceptinsimple linearregressionis
0 t / 2 ,n 2
1 x2 1 x2 2 + 0 0 + t / 2 , n 2 2 + n S xx n S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample114inthetextbook. Itishasbeencalculatedthat
12.181
17.713
Note:maintain3decimalplacesallthewayinthecalculations. DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Confidenceintervalonthemeanresponse
Ataspecificvaluex0,theconfidenceintervalonE(Y|x0)=Y|x0 iscalledtheconfidenceintervalabouttheregressionline. SinceE(Y|x0)=Y|x0 =0 +1 x0,anunbiasedpointestimator ofthemeanresponseatx0 wouldbe = + x
Y | x0
0 1 0
Thevarianceofthemeanresponseis p
1 ( x0 x ) 2 V ( Y |x0 ) = 2 + n S xx
Noticethatthemeanresponseisnormallydistributed becausetheslopeandtheinterceptsarebothnormal. because the slope and the intercepts are both normal
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Therefore,therandomvariable Y | x0 Y | x0
1 ( x0 x ) 2 2 + n S xx
hasatdistributionwithn2degreesoffreedom. A 100(1 )% confidence interval on the mean response at A100(1 )%confidenceintervalonthemeanresponseat x=x0 is
Y | x0 t / 2 , n 2 1 ( x0 x ) 2 1 ( x0 x ) 2 2 2 + Y | x0 Y | x0 + t / 2 , n 2 + n S xx n S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample115inthetextbook. Itisgiventhat n = 20, x = 1.196, 0 = 74.283, 1 = 14.947
S xx = 0.681, 2 = 1.180
Ithasbeenaskedtoconstructa95%confidenceintervalon themeanresponseforx0 =1. The estimated mean response is Theestimatedmeanresponseis Y | x0 = 74.283 + 14.947(1) = 89.23 Fromthetdistributiontables,t0.025,18 =2.101. Theconfidenceintervalis
1 (1 1.196) 2 1 (1 1.196) 2 89.23 2.101 1.18 + Y | x0 89.23 + 2.101 1.18 + 20 0.681 20 0.681 88.486 Y | x0 89.974
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Predictionofnewobservations
RegressionmodelsareusedtopredictnewobservationsY correspondingtoaspecificvalueoftheregressor variablex. Ifx0 isthevalueoftheregressor,then Y0 = 0 + 1 x0 isthepointestimatorofthenewvalueoftheresponseY is the point estimator of the ne al e of the response Y0. Theerrorinthepredictionisgivenby e = Y Y
p 0 0
1 ( x0 x ) 2 = 2 1 + + n S xx
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Iftheestimatedvarianceisused,thentherandomvariable Y Y
0 0
1 ( x0 x ) 2 2 1 + + n S xx
hastdistributionwithn2degreesoffreedom. A 100(1 )% confidence interval on the predicted observation A100(1 )%confidenceintervalonthepredictedobservation Y0 atthevaluex0 is
y0 t / 2 , n 2 1 ( x0 x ) 2 1 ( x0 x ) 2 2 1 + + Y0 y0 + t / 2,n 2 1 + + n S xx n S xx
2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample116inthetextbook. Itisgiventhat n = 20, x = 1.196, 0 = 74.283, 1 = 14.947
S xx = 0.681, 2 = 1.180
Ithasbeenaskedtoconstructa95%confidenceintervalon thepredictedresponseY0 forx0 =1. The estimated response is Theestimatedresponseis y0 = 74.283 + 14.947(1) = 89.23 Fromthetdistributiontables,t0.025,18 =2.101. Theconfidenceintervalis
1 (1 1.196) 2 1 (1 1.196) 2 89.23 2.101 1.181 + + + Y0 89.23 + 2.101 1.181 + 20 0.681 20 0.681 86.829 Y0 91.631
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Adequacyoftheregressionmodel
Assumptionsofregressionmustbetestedbeforethe regressionmodelcanbeusedtoprovidemeaningful information. i f ti Theassumptionofregressionare Errors are uncorrelated Errorsareuncorrelated Errorshavemeanzero Errorshaveconstantvariance Errorsarenormallydistributed Theorderoftheregressionmodelmustbechecked.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Residualanalysis
Residualsarethedifferencesbetweentheactualobservations andthefittedvalues: ei = yi yi Residualanalysisisusedtochecktheassumptionthatthe errorsareapproximatelynormalwithconstantvarianceand errors are approximately normal with constant variance and whetheradditionaltermsinthemodelwillbeuseful. Tocheckthenormalityoftheresiduals,thenormalprobability plotisused. Thestandardizedresidualsareusedmorethantheactual residuals. residuals Thestandardizedresidualdi iscomputedas ei di = 2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Approximately95%ofthestandardizedresidualsshouldfallin theintervalbetween2and+2iftheerrorsarenormal. Otherplotsarehelpful,suchas: Residualsintimesequence Resid als s fitted al es Residualsvs fittedvalues Residualsvs theregressor x
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Seeexample117inthetextbook.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Coefficientofdetermination
Awidelyusedmeasureofregressionmodeladequacyisthe coefficientofdetermination,R2. Thecoefficientofdeterminationis SS R SS E 2 R = = 1 SST SST Fromtheanalysisofvarianceidentity,thevalueofR2 is between0and1. ThevalueofR2 tellsabouttheamountofvariabilityinthe dataexplainedoraccountedforbytheregressionmodel. Forexample,whenR2 =0.877,itissaidthatthemodel accountsfor87.7%ofthevariabilityinthedata.
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Correlation
Insomeapplicationsofregression,bothXandYarerandom variables. Hence,itisassumedthattheobservations(Xi,Yi)arejointly distributedrandomvariables,withadistributionfunction f(x,y). ( ,y) Itisassumedf(x,y)isabivariate normalfunction. TherandomvariableYhasmeanY andvarianceY2,Xhas meanX andvarianceX2. ThecorrelationcoefficientbetweenYandXisdefinedas XY = X Y
TheconditionaldistributionofYgivenX=xisnormal
1 fY |x ( y ) = e 2 Y |x
where
1 y 0 1 x 2 Y |x
Y 0 = Y X X Y X Thevarianceisgivenby 1 =
2 2 Y |x = Y (1 - 2 )
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Thus,theconditionaldistributionofYgivenX=xisnormal with
E (Y | x ) = 0 + 1 x
2 V (Y | x ) = Y | x
(Y Y )( X
i =1 i n i=1 i 1
X)
( X i X )2
S XY = S XX
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Theestimatorof isthesamplecorrelationcoefficient:
R=
(Y Y )( X
i =1 i n n i =1
X) =
( X i X ) 2 (Yi Y ) 2
i =1
S XY S XX SST
Itcanbewrittenthat
S XX 1 S XY SS R R = = = SST SST SST Hence,thecoefficientofdeterminationR2 isonlythesquare ofthecorrelationcoefficientbetweenYandX. of the correlation coefficient bet een Y and X
2 2 1
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Significanceofcorrelation
Thehypothesisonthesignificanceofthecorrelationisstated as H0: =0 H1: 0 The random ariable Therandomvariable
1 R2 followsthetdistributionwithn2degreesoffreedomifH0 is true. Therefore,H0 shouldberejectedif|t0|>t/2,n2. T0 = R n2
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM
Example
Pullstrength Wirelength ll h l h
Seeexample118inthetextbook. Theindustrialengineerisinvestigating therelationshipbetweenpullstrength ofawirebondandwirestrength. From the data the following are Fromthedata,thefollowingare computed: Sxx =698.56 Sxy =2027.71 y SST =6105.94 Thesamplecorrelationcoefficientis
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
y 9.95 24.45 31.75 35 25.02 16.86 14.38 9.6 24.35 27.5 17.08 37 41.95 11.66 11 66 21.65 17.89 69 10.3 34.93 34 93 46.59 44.88 54.12 56.63 22.13 21.15
x 2 8 11 10 8 4 2 2 9 8 4 11 12 2 4 4 20 1 10 15 15 16 17 6 5
t0 =
0.98 25 2
= 24.80
DrMuhammadAlSalamah,IndustrialEngineering,KFUPM