Linear Regression Analysis: Module - Ii
Linear Regression Analysis: Module - Ii
MODULE – II
Lecture - 5
yi = β 0* + β1 ( xi − x ) + ε i
s xy
where β= β 0 + β1 x . The least squares estimators of β 0* and β=
*
0 1
are b0* y=
and b1 respectively.
sxx
Using the results that E (b* ) = β * ,
0 0
E (b1 ) = β1 ,
σ2
Var (b0* ) = ,
n
σ2
Var (b1 ) = .
s xx
When σ 2 is known, then the statistic
b0* − β 0*
~ N (0,1)
σ 2
n
and
b1 − β1
~ N (0,1).
σ 2
sxx
3
Moreover, both the statistics are independently distributed. Thus
2
* *
b0 − β 0 ~ χ12
σ 2
n
and
2
b1 − β1 ~ χ 2
2 1
σ
s
xx
are also independently distributed because b0* and b1 are independently distributed. Consequently sum of these two
Since
SS res
~ χ n2− 2
σ 2
Substituting b0=
*
b0 + b1 x and β=
*
0 β 0 + β1 x , we get
n − 2 Qf
2 SS res
where
n n
Q f = n(b0 − β 0 ) + 2∑ xi (b0 − β1 )(b1 − β1 ) + ∑ xi2 (b1 − β1 ) 2 .
2
=i 1 =i 1
Since
n − 2 Q f
P 1−α
≤ F2,n − 2 =
2 SS res
holds true for all values of β 0 and β1, so the 100(1 − α )% confidence region for β 0 and β1 is
n − 2 Qf
≤ F2,n − 2;α .
2 SS res
This confidence region is an ellipse which gives the 100 (1 − α )% probability that β 0 and β1 are contained simultaneously in
this ellipse.
5
Analysis of variance
The technique of analysis of variance is usually used for testing the hypothesis related to equality of more than one
parameters, like population means or slope parameters. It is more meaningful in case of multiple regression model when
there are more than one slope parameters. This technique is discussed and illustrated here to understand the related
basic concepts and fundamentals which will be used in developing the analysis of variance in the next module in multiple
linear regression model where the explanatory variables are more than two.
A test statistic for testing H 0 : β1 = 0 can also be formulated using the analysis of variance technique as follows.
n n n
= ∑ ( yi − y ) + ∑ ( yˆi − yi ) 2 − 2∑ ( yi − y )( yˆi − y ).
2
=i 1 =i 1 =i 1
n n
Further consider
=i 1 =i 1
∑ ( yi − y )( yˆi − y=) ∑ ( y − y )b ( x − x )
i 1 i
n
= b12 ∑ ( xi − x ) 2
i =1
n
= ∑ ( yˆ − y ) .
i =1
i
2
n n n
Thus we have
i i i
=i 1 =i 1 =i 1
∑ ( y − y )= ∑ ( y − yˆ ) + ∑ ( yˆ − y ).
2 2
i
2
The term ∑ ( y − y)
i =1
i
2
is called the sum of squares about the mean or corrected sum of squares of y (i.e., SS corrected)
or total sum of squares denoted as syy.
6
n
The term ∑ ( y − yˆ )
i =1
i i
2
describes the deviation: observation minus predicted value, viz., the residual sum of squares, i.e.:
n
=
SS res ∑ ( y − yˆ )
i =1
i i
2
n
whereas the term ∑ ( yˆ − y )
i =1
i
2
describes the proportion of variability explained by regression,
n
=
SS reg ∑ ( yˆ − y ) .
i =1
i
2
n
If all observations yi are located on a straight line, then in this case= ∑
( yi − yˆi ) 0 and= 2
thus SScorrected SS r e g .
i =1
Note that SSreg is completely determined by b1 and so has only one degrees of freedom. The total sum of squares
n n
s yy = ∑ ( yi − y ) 2 has (n - 1) degrees of freedom due to constraint ∑ ( y − y) =
i 0 and SS res has (n - 2) degrees of
i =1 i =1
All sums of squares are mutually independent and distributed as χ df2 with df degrees of freedom if the errors are normally
distributed.
The mean square due to regression is
SS r e g
MS r e g =
1
and mean square due to residuals is
SS res
MSE =.
n−2
The test statistic for testing H 0 : β1 = 0 is
MS r e g
F0 = .
MSE
7
If H 0 : β1 = 0 is true, then MS r e g and MSE are independently distributed and thus F0 ~ F1, n − 2 .
Some other forms of SS reg , SS res and syy can be derived as follows:
The sample correlation coefficient then may be written as
sxy
rxy = .
sxx s yy
sxy s yy
Moreover, we have =
b1 = rxy .
sxx sxx
n
= ∑ [( y − y ) − b ( x − x )]
i =1
i 1 i
2
= s yy − b12 sxx
( sxy ) 2
= s yy − .
sxx
SScorrected = syy
and
SS r e=
g s yy − SS res
( sxy ) 2
=
sxx
= b12 sxx
= b1sxy .
9
It can be noted that a fitted model can be said to be good when residuals are small. Since SSres is based on residuals, so a
measure of quality of fitted model can be based on Ssres. When intercept term is present in the model, a measure of
goodness of fit of the model is given by
SS res
R2 = 1 −
s yy
SS r e g
= .
s yy
This is known as the coefficient of determination. This measure is based on the concept that how much variation in y’s
stated by syy is explainable by SSreg. and how much unexplainable part is contained in SSres. The ratio SSreg / syy describes
the proportion of variability that is explained by regression in relation to the total variability of y. The ratio SSres / syy
describes the proportion of variability that is not covered by the regression.