0% found this document useful (0 votes)
16 views

Chapter 1

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter 1

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CHAPTER 1

SIMPLE LINEAR REGRESSION


Regression Analysis is also often used in the determination of a relationship between
two or more quantitative (numerical) variables. For example, a businessperson may
want to know whether the volume of sales for a given month is related to the amount
of advertising the firm spent. Educators are interested in determining whether the
number of hours a student studies is related to the student’s scores in an exam.
Medical researchers are interested in question such as “Is caffeine related to heart
damage?” or “Is there a relationship between age and blood pressure?”
These are only a few of the many questions that can be answered by using the
techniques of correlation and regression analysis. Correlation is a statistical method
used to determine whether a linear relationship between two variables exists.
Regression is a statistical method used to describe the nature of the relationship
between variables – that is, positive or negative, linear or non-linear related.

1.1 SCATTER DIAGRAM


Researcher often collects data on two numerical or quantitative variables to see
whether a relationship exists between the variables. The variable that can be
controlled or manipulated is called the independent variable while the variable that
cannot be controlled or manipulated is called the dependent variable.
The independent and dependent variables can be plotted on a graph called scatter
plot. Scatter diagram is a plot of all the ordered pairs of bivariate data on a coordinate
axis system. The input variable x is plotted on the horizontal axis, and the output
variable y is plotted on the vertical axis. The scatter plot is a visual way to describe
the nature of the relationship between the independent and dependent variables.

Example:
In a study involving children’s fear related to being hospitalized, the age and the
score each child made on the Child Medical Fear Scale (CMFS) are given in the
table below. Construct a scatter diagram for this data.

Age (x) 8 9 9 10 11 9 8 9 8 11 7 6 6
CMFS (y) 31 25 40 27 35 29 25 34 44 19 28 47 42

AGE (x) 8 9 8 12 15 13 10 10 11 8 9 8
CMFS (y) 7 35 16 12 23 26 26 20 40 30 38 29

After the plot is drawn, it can be used to determine which type of relationship, if any,
exists. The scatter diagram reveals a more or less strong tendency rather than a
precise linear relationship. The line represents the nature of the relationship on
average. However, it is clear from the plot that there is a negative relationship
between the two variables, since as the age increases, the fear level tends to decrease.
1
Scatter Diagram for Children's Fear

50
45
40
35
CMFS 30
25
20
15
10
5 6 7 8 9 10 11 12 13 14 15
Age

From a scatter diagram, we can see whether there appears a linear correlation
between variable x and variable y. As we intend to know how strong the correlation
is, thus we need some method to measure the correlation.

1.2 COEFFICIENT OF LINEAR CORRELATION


Linear correlation measures the strength of a linear relationship between two
variables. A simple relationship can be positive or negative. A positive relationship
exists when both variables increase or decrease at the same time. For instance, a
person’s height and weight are positively related, the taller the person is, generally,
the more the person weighs.
In a negative relationship, as one variable increases, the other variable decreases,
and vice versa. For example, if one measures the strength of people over 60 years of
age, one will find that as age increase, strength generally decreases. However, when
one variable increases, and there is no definite or obvious shift is another variable,
indicates that there is no correlation.
Notes:
1. Perfect positive correlation: all the points lie along a line with a positive slope.
2. Perfect negative correlation: all the points lie along a line with a negative slope.
3. The points lie along a horizontal or vertical line: no correlation.
4. The points exhibit some other non-linear pattern: no linear relationship, no
correlation.

The coefficient of linear correlation measures the strength of the linear relationship
between two variables. Population correlation coefficient is denoted by, ρ whereas
sample correlation coefficient is denoted by r. The range of the correlation
coefficient is from –1 to +1.
If there is a strong positive linear relationship between the variables, the value of r
will be close to +1. If there is a strong negative linear relationship between the

2
variables, the value of r will be close to -1. Typically, if the value of r  0.7 , the
correlation is said to be strong. When there is no linear relationship between the
variables or only weak relationship, the value of r will be close to 0. Typically, if the
value of 0.1  r  0.3 , the correlation is said to be weak.

Notation for correlation:


Population:  : − 1    +1 Sample: r : − 1  r  +1

There are several ways to compute the value of the correlation coefficient. The
Pearson’s Product Moment Correlation Coefficient is calculated as follows
n (  xy ) − (  x )(  y )
r=
( x − x )( y − y ) SS xy
= =
( n − 1) SS xx SS yy  n ( x 2 ) − ( x )2   n ( y 2 ) − ( y )2 
       
SS xx SS yy

where
( x)
2

SS xx = "sum of squares for x " =  x 2 −


n
( y)
2

SS yy = "sum of squares for y " =  y 2 −


n

SS xy = "sum of squares for y " =  xy −


(  x )(  y )
n

Example:
Below are the data for the weight (in thousands of pounds) x and the gasoline mileage
(miles per gallon) y for ten different automobiles. Find the linear correlation
coefficient between the two variables.
x y x2 y2 xy
2.5 40 6.25 1600 100.0
3.0 43 9.00 1849 129.0
4.0 30 16.00 900 120.0
3.5 35 12.25 1225 122.5
2.7 42 7.29 1764 113.4
4.5 19 20.25 361 85.5
3.8 32 14.44 1024 121.6
2.9 39 8.41 1521 113.1
5.0 15 25.00 225 75.0
2.2 14 4.84 196 30.8
x  y =309  x 2 =123.73  y 2 =10665  xy
=34.1 =1010.9

Calculation for r:
( x) ( 34.1)
2 2

SS xx =  x 2
− = 123.73 − = 7.449
n 10

3
( y) ( 309 )
2 2

SS yy =  y 2
− = 10665 − = 1116.9
n 10

SS xy =  xy −
(  x )(  y ) = 1010.9 −
( 34.1)( 309 ) = −42.79
n 10
SS xy −42.79
r= = = −0.47
SS xx SS yy ( 7.449)(1116.9 )
Since the value of r is computed from data obtained from the samples, there are two
possibilities when r is not equal to zero: either the value of r is large enough to
conclude that there is a significant linear relationship between the variables, or the
value of r is due to chance and therefore can be treated as zero. In order to make this
decision, researcher may use the hypothesis testing procedure.
In hypothesis testing, one of the following is true:
H0 :  = 0 There is no correlation between x and y in the population
H0 :   0 There is significant correlation between x and y in the population
If both variables are normally distributed, the value of the test statistic t is calculated
as below:
n−2
t=r ~ tn − 2
1− r2
Although the hypothesis above is two-tailed, some problems involve the claim that
correlation is either positive or negative and such the hypothesis testing involve one-
tailed test.

1.3 SIMPLE LINEAR REGRESSION MODEL


Regression analysis concepts deal with finding the best relationship between
dependent variable y and independent variable x, qualifying the strength of that
relationship, and the use of methods that allow for prediction of the response values
(y) given values of the predictor x. The y variable value can only be determined if
the independent variables values (denoted by x1 , x2 , ..., xk where k is the number of
independent variables) are known.

Regression analysis is used to determine the mathematical relationship between


these variables through a linear equation termed as regression model. A simple
linear regression model involved only one variable, that is k = 1. Meanwhile,
multiple linear regression are employed for cases involving more than one
independent variable (k > 1).

4
The simple linear regression model assumes an exact linear relationship between the
expected or average value of y, the dependent variable and x, the independent or
predictor variable:
E  yi  = 0 + 1xi

Actual observed values of y differ from the expected value by an unexplained or


random error  i , that is:
yi − E ( yi ) =  i
yi = E ( yi ) +  i =  0 + 1 xi +  i

Thus, the population simple linear regression model is given by:


yi = 0 + 1 xi +  i
where the intercept 0 and the slope 1 are unknown constants and
yi = 0 + 1 xi +  i is a random error component. The errors are assumed to have
mean zero and unknown variance  2 . Additionally we usually assume that the errors
are uncorrelated. This means that the value of one error does not depend on the value
of any other error.

Observed value
35
30
Error,
25

y 20
15 Predicted value
10
5
0
0 10 20 30 40

Figure 2.1 Picturing the Simple Linear Regression Model

The parameters 0 and 1 are usually called regression coefficients. These


coefficients have a simple and often useful interpretation. The slope 1 is the change
in mean of the distribution of y produced by a unit change in x. If the range of data
on x indicates x = 0, then the intercept 0 is the mean of the distribution of the
response y when x = 0. If the range of x does not include zero, then 0 has no
practical interpretation.

5
It is convenient to view the explanatory x as controlled by the data analyst and
measured with negligible error, while the response y is a random variable. That is,
there is a probability distribution for y at each possible value for x. The mean of this
distribution is
E ( y x ) =  y|x = E ( 0 + 1x +  ) =  0 + 1x
The variance of y given any value of x is
Var ( y x ) = Var (  0 + 1x +  ) = Var (  ) =  2
Thus, the mean of y is a linear function of x although the variance of y does not
depend on the value of x. Furthermore, because the errors are uncorrelated, the
responses are also uncorrelated.

Assumptions of Simple Regression Model


1. The relationship between x and y is a straight-line relationship.
2. The values of the independent variable x are assumed fixed (not random); the
randomness in the values of y only comes from the error term  i .
3. The errors  i are normally distributes with mean 0 and variance  2 . The errors
are uncorrelated in successive observations. That is:
(i) E (  i ) = 0
(ii) Var ( i ) = E ( i2 ) =  2 ; homoscedasticity assumption
(iii) Cov ( i j ) = E ( i j ) = 0 ; no serial correlation assumption

1.3.1 VISUALIZING THE FITTED REGRESSION LINE


Given a sample of n bivariate data ( x , y ) , ( x , y ) ,
1 1 2 2 , ( xk , yk ) , we want to
estimate the true relationship between y and x in the population. Graphically, we
would like to draw a straight line so that it seems to fit the data as well as possible.
From the diagram below, which line do you think, gives the best fit to the data?

A
B

6
Estimation of a simple linear regression relationship involves finding estimated or
predicted values of the intercept and slope of the linear regression line. The method
of least squares produces a line that is certain to appear satisfactory since it goes
through the center of the data, ( x , y ) and does so at a sensible angle. Although
the line passes through the data, very few of the data point actually lie on the line.

The estimated linear regression line is:


yˆi = ˆ0 + ˆ1 xi
where yˆ i is the value of y lying on the fitted regression line for a given value of x,
that is the value of y as estimated by the regression model. The difference between
the observed yi and the predicted yˆ i is called the residual, denoted by  i , that is:
y

If the values of ̂ 0 and ˆ1 are well chosen, then all of the residuals will be small.
Some of the residuals will be negative and some will be positive so that  i2 is
small. It is mathematically convenient to work with  i
2
instead. A line that fits the
data well will result to a minimum value of  i
2
.

1.3.2 ESTIMATION: THE METHOD OF LEAST SQUARES


The method of least squares is used to estimate 0 and 1 such that 0 and 1 will
be estimated so that the sum of the squares of the differences between the
observations yi and the straight line is a minimum. The least square method is a
method to find the constants 0 and 1 such that the
n n
S ( 0 , 1 ) = Sum of Squares Residuals =  =  ( yi − 0 − 1x1 )
2 2
i
i =1 i =1

is as small as possible.
7
Differentiating with respect to 0 and 1 to minimize SSRes,
S n
S n
= −2 ( yi − 0 − 1 xi ) = 0 and = −2 xi  ( yi − 0 − 1xi ) = 0
0 i =1 1 i =1

which can be simplified into the so called “normal equations”


n n n n n

 yi = n0 + 1  xi
i =1 i =1
and  xi yi = 0  xi + 1  xi 2
i =1 i =1 i =1

Solving the normal equations simultaneously for 0 and 1 , we obtain the regression
line estimators:
 n  n 
  yi   xi 
yi xi −  i =1  i =1 
n

 n SS xy
ˆ0 = y − ˆ1 x and that ˆ1 = i =1 2
=
 n  SS xx
 i 
x
xi −  i =1 
n


i =1
2

n
n
Note: The least square equation line is a line that actually minimizes 
i =1
i
2
with

respect to 0 and 1 , and it is sometimes called the line of best fit.

Line of Best Fit Equation: The equation is determined by ̂ 0 and ˆ1 , where ̂ 0 and
ˆ1 are values that satisfy the least squares criterion.

After obtaining the least-squares fit, a number of interesting questions come to mind:
1. How well does this equation fit the data?
2. Is the model likely to be useful as a predictor?
3. Are any of the basic assumptions (such as constant variance and uncorrelated
errors) violated, and if so, how serious is this?
All of these issues must be investigated before the model is finally adopted for use.
As noted previously, the residuals play a key role in evaluating model adequacy.
Residuals can be viewed as realizations of the model errors  i . Thus, to check the
constant variance and uncorrelated errors assumptions, the residuals must look like
a random sample from a distribution with these properties. Further study about this
query will be discussed in Chapter 3, where the use of residuals in model adequacy
checking is explored.

Example:
A recent article measured the job satisfaction of subjects with a 14-question survey.
The data below represents the job satisfaction scores, y, and the salaries, x, for a

8
sample of similar individuals. (a) Find the equation of the line of best fit and show
the diagram.

Data:
x 3 7 6 6 10 12 12 12 13 13 14 15
y 33 38 24 61 52 45 29 65 82 63 50 79

Solution:
x y x2 y2 xy
3 33 9 1089 99
7 38 49 1444 266
6 24 36 576 144
6 61 36 3721 366
10 52 100 2704 520
12 45 144 2025 540
12 29 144 841 348
12 65 144 4225 780
13 82 169 6724 1066
13 63 169 3969 819
14 50 196 2500 700
15 79 225 6241 1185
 x = 123 y = 621 x
2
= 1421  y 2 =36059  xy = 6833

( x) (123)
2 2

SS xx =  x 2
− = 1421 − = 160.25
n 12

SS xy =  xy −
(  x )(  y ) = 6833 − (123)( 621) = 467.75
n 12

SS xy 467.75
ˆ1 = = = 2.919
SS xx 160.25

ˆ0 = y − ˆ1x = 
y − 1  x 621 − 2.919 (123)
= = 21.83
n 12
Thus, equation of the line of best fit: yˆi = 21.83 + 2.919 xi

21.83

x
-7.48

9
1.3.3 ESTIMATION OF ERROR VARIANCE
In addition to estimating 0 and 1 , an estimate for the error variance,  2 is needed
for hypothesis testing on the parameters and to construct interval estimates related
to the model. In a regression model, for every x value, there is a different value of
y. Consequently, the random error,  will take on different value for each x. The
measure of spread of these different values of  is given by its standard deviation,
  . The standard deviation of errors tells us how widely the errors are spread, and
hence how widely the y values are spread from the regression line.
Recall the assumptions for random error, 
1. The errors for each observation are independent of each other.
2. The errors for each x has mean equal to zero, i.e., that is E (  ) = 0 for each x.
3. For any given x, the errors are normally distributed.
4. For each x, population errors has the same standard deviation,   .

Note that   denotes the standard deviation of errors in the population and usually
is unknown. In such cases, it is estimated by the standard deviation of errors in the
sample, with the formula below:

( )
n n n 2

 ( y − yˆi ) =   yi − ˆ0 + ˆ1 x 


2 2
SSE = =
i =1
i
i =1
i
i =1
 
=
=
n
=  yi2 − ny 2 − ˆ1SS xy = SST − ˆ1SS xy
i =1

The residual sum of squares has n – 2 degrees of freedom, because two degrees of
freedom are associated with the estimates ̂ 0 and ˆ1 involved in obtaining yˆ i . It can
be shown that the expected value of SSE is E ( SSE ) = ( n − 2) 2 , so an unbiased
estimator of  2 is given by:
SSE
ˆ2 = = MSE
n−2

The quantity MSE is called the residual mean square. The square root of ˆ 2 is
sometimes called the standard error of regression, and it has the same units as the
response variable y.

Because ˆ 2 depends on the residual sum of squares, any violation of the


assumptions on the model errors or any misspecification of the model form may
seriously damage he usefulness of ˆ 2 as an estimate of  2 . Because  2 is computed
from the regression model residuals, we say that it is a model-dependent estimate
of  2 .
10
Example 1:
The weights and heights of a sample of 8 women are shown below. (a) Estimate the
regression line of weights and heights. (b) Compute the error variance.

Heights, x (inches) 65 65 62 67 69 65 61 67
Weight, y (pounds) 105 125 110 120 140 135 95 130

Solutions:
x y x2 xy y2
65 105 4225 6825 11025
65 125 4225 8125 15625
62 110 3844 6820 12100
67 120 4489 8040 14400
69 140 4761 9660 19600
65 135 4225 8775 18225
61 95 3721 5795 9025
67 130 4489 8710 16900
 x = 521 y = 960 x
2
= 33979  xy = 62750  y 2 = 116900

SS xx = 33979 −
5212
= 48.875 SS xy = 62750 −
( 521)( 960 ) = 230.0
8 8
SS 230.0
ˆ1 = xy = = 4.706
SS xx 48.875

ˆ0 = y − ˆ1 x =
 y − ( ˆ  x ) = 960 − ( 4.706 )( 521) = − 186.478
1

n 8

The regression line of weights and heights: yˆ = − 186.478 + 4.706x

From the previous example, SSxy = 230.0 and ˆ1 = 4.706 . Now, we need to
compute SS yy = SST :
( y)
2
9602
SST =  y 2 − = 116900 − = 1700
n 8
Hence, the estimated error variance for the sample is:
SST - ˆ1SS xy 1700 − ( 4.706 )( 230.0 )
ˆ2 = = = 102.94
n−2 8−2

Warning/Caution
Calculating involving SS yy , SS xy and  must be made with at least six decimal
places (could be ten!) to avoid substantial error in the calculation of SSE .

11
Example 2: The Rocket Propellant Data
A rocket motor is manufactured by bonding an igniter propellant and a sustainer
propellant together inside a metal housing. The shear strength of the bond between
the two types of propellant is an important quality characteristic. It is suspected that
shear strength is related to the age in weeks of the batch of sustainer propellant.
Twenty observations on shear strength and the age of the corresponding batch of
propellant have been collected and are shown in Table 2. The scatter diagram, shown
in Figure 2, suggests that there is a strong statistical relationship between shear
strength (in psi) and propellant age (in weeks), and the tentative assumption of the
straight-line model y = 0 + 1 x +  appears to be reasonable.

Table 2.1 Data for Example 2.1


Strength Age Obs, i Strength Age
Obs, i (psi), yi (weeks), xi
(psi), yi (weeks), xi
1 2158.70 15.50 11 2165.20 13.00
2 1678.15 23.75 12 2399.55 3.75
3 2316.00 8.00 13 1779.80 25.00
4 2061.30 17.00 14 2336.75 9.75
5 2207.05 5.50 15 1765.30 22.00
6 1708.30 19.00 16 2053.50 18.00
7 1784.70 24.00 17 2414.40 6.00
8 2575.00 2.50 18 2200.50 12.50
9 2357.00 7.50 19 2654.20 2.00
10 2256.70 11.00 20 1753.70 21.50

Scatter diagram of shear strength vs propellant age

2700
2600
2500
2400
Shear Strength

2300
2200
2100
2000
1900
1800
1700
1600

0 2 4 6 8 10 12 14 16 18 20 22 24 26
Age of propellant

To estimate the model parameters, first calculate


2
 n 
  xi 
SS xx =  xi −  i =1  = 4677.6875 −
n
2 71422.5625
= 1106.56
i =1 n 20

12
n n

n  y x i i
( 267.25)( 42627.15)
SS xy =  yi xi − i =1 i =1
= 528492.6375 − = − 41,112.65
i =1 n 20
Therefore, the estimated coefficients are:
SS − 41112.6544
ˆ1 = xy = = − 37.15
SS xx 1106.5594

ˆ0 = y − ˆ1 x = 2131.3575 − ( −37.1536 )13.3625 = 2627.82


The least-squares fit (regression model) is given as: yˆ = 2627.82 − 37.15x

The slope -37.15 is interpreted as the average weekly decrease in propellant shear
strength due to the age of the propellant. Since the lower limit of the x’s is near the
origin, the intercept 2627.82 represents the shear strength in a batch of propellant
immediately following manufacture.
Table: Data, Fitted Values, and Residuals
Observed Value, yi Fitted value, yˆi Residual,  i
2158.70 2051.94 106.76
1678.15 1745.42 -67.27
2316.00 2330.59 -14.59
2061.30 1996.21 65.09
2207.50 2423.48 -215.98
1708.30 1921.90 -213.60
1784.70 1736.14 48.56
2575.00 2534.94 40.06
2357.90 2349.17 8.73
2256.70 2219.13 37.57
2165.20 2144.83 20.37
2399.55 2488.50 -88.95
1799.80 1698.98 80.82
2336.75 2265.58 71.17
1765.30 1810.44 -45.14
2053.50 1959.06 94.44
2414.40 2404.90 9.50
2200.50 2163.40 37.10
2654.20 2553.52 100.68
1753.70 1829.02 -75.32
y i = 42627.15  yˆ i = 42627.15 e i = 0.00

To estimate  2 , first find SST and SSE


2
 n 
  yi  ( 42,627.15 )
2
 
n n
SST =  yi − ny =  yi −
2 2 2 i =1
= 92547433.45 − = 1693737.60
i =1 i =1 n 20

13
SSE = SST − ˆ1SS xy = 1693737.60 − ( −37.15)( −41112.65) = 166402.65

SSE 166402.65
Therefore, the estimate of  2 is computed as: ˆ2 = = = 9244.59
n−2 18

1.4 PROPERTIES OF OLS ESTIMATES


1.4.1 STANDARD ERROR OF OLS ESTIMATES
n n

n  yi  xi n

SS  yi xi − i =1 i =1
 y (x − x)i i n
=  ci yi
n
Note that ˆ1 = xy = i =1 = i =1
SS xx SS xx SS xx i =1

( x − x )
2

 2
( )  
n n n i

Var ˆ1 = Var   ci yi  =  ci2Var ( yi ) =  2  ci2 =  2 i =1


=
 i =1  i =1 i =1 SS xx2 SS xx

( ) ( )
Var ˆ0 = Var y − ˆ1 x = Var ( y ) + x 2Var ˆ1 − 2 xCov y , ˆ1 ( ) ( )
 2 
( )
1 n 
= Var ( y ) + x 2Var ˆ1 = Var   yi  + x 2   
 n i =1   SS xx 
 n  2    2   
2 2
1 1 n
= 2 Var   yi  + x   = 2 Var ( yi ) + x  
n  i =1   SS xx  n i =1  SS xx 
2    21 x2 
= 2 ( n  ) + x 
2
1
 =   +
2

n  SS xx   n SS xx 

There are several other useful properties of least-squares fit:


n n n n
1)  ( yi − y ) =  i = 0
i =1 i =1
2)  yi =  yˆi
i =1 i =1
n n
3) x
i =1
i i =0 4)  yˆ 
i =1
i i =0

5) The least-squares regression line always passes through the centroid, ( x , y ) of


the data.

The quality of least-squares estimators, ̂0 and ˆ1 is summarized as the Gauss-
Markov theorem which states that for the regression model with the assumptions
E (  ) = 0 , Var (  ) =  2 and Cov ( i j ) = 0 , the least-squares estimators are unbiased
and have minimum variance when compared with all other unbiased estimators that

14
are linear combinations of yi . It is often say that the least-squares estimators are best
linear unbiased estimators (BLUE), where best implies minimum variance.
As we want our estimators to be precise, the variance of the estimators need to be
low. The variances of ̂0 and ˆ1 are affected by:
1) the variance of the error term,  2 . If the error term has small variance, the
estimators will be more precise.
2) The amount of variation in the X variable. If there is a lot of variation in the X
variable, the estimators will be more precise.
3) The sample size, n. The larger the sample size, the more precise the estimators.

1.4.2 HYPOTHESIS TESTING OF REGRESSION COEFFICIENTS


One of the main purpose for determining a regression line is to find the true value of
the slope, 1 of the population regression line. However, the population regression
line can only be estimated using sample data. Consequently, inferences on the
population regression line are made based on the sample regression line. Recall that
the slope ˆ1 of a sample regression line is a point estimator of the slope 1 of the
population regression line. The different sample regression line estimated from
different sample taken from the same population produces different ˆ1 . Thus, ˆ1 is
a random variable and possesses a probability or sampling distribution.

Since the errors  i are NID ( 0, 2 ) , the observations yi are NID ( 0 + 1x, 2 ) . Now
ˆ is a linear combination of the observations, so ˆ is normally distributed with
1 1

mean 1 and variance   SS xx Therefore, to test the hypothesis that the slope of
2

regression line equals a constant, say b1 :

H0 : 1 = b1 vs. H1 : 1  b1
ˆ1 − b1
the test statistic is: Z= ~ N ( 0,1) .
  SS xx
2

However, typically,  2 is unknown and is estimated from the sample by MSE. The
test statistic is now a t, given by:
ˆ1 − b1
t= ~ tn − 2
MSE SS xx

The null hypothesis is rejected if t  t 2,n− 2 . Note from above that the test is a 2-
tailed test. However, depending on certain application, underlying theory may
require the analyst to perform a 1-tailed test.
Similarly, the test statistic for testing H 0 :  0 = b0 is given by:

15
ˆ1 − b1
t= ~ tn − 2
1 x  2
MSE  + 
 n SS xx 

The significance of the estimated regression model is tested based on the following
hypothesis:
H0 : 1 = 0 vs. H1 : 1  0
Failing to reject H0 : 1 = 0 implies that there is no linear relationship between x and
y. This situation is illustrates in the two figures below. Note that this may imply
either that x is of little value in explaining the variation in y and that the best
estimator of y for any x is yˆ = y (a) or that the true relationship between x and y is
not linear (b). Therefore, failing to reject H0 : 1 = 0 is equivalent to saying that there
is no linear relationship between y and x.

y y

x x
(a) (b)
Situations where the hypothesis is not rejected

Alternatively, if H0 : 1 = 0 is rejected, this implies that x is of value (important) in


explaining the variability in y. However, rejecting H0 : 1 = 0 could mean either the
straight-line model is adequate (a) or that even though there is a linear effect of x,
better results could be obtained with higher order polynomial terms in x (b).

y y

x x
(a) (b)
Situations where the hypothesis is rejected

16
The confidence interval for 1 is constructed around its point estimate, ̂1 . The
(1 −  ) 100% confidence interval for 1 is given by:
( ) ( )
ˆ1 − t 2,n−2 se ˆ1  1  ˆ1 − t 2,n−2 se ˆ1

This confidence interval has the usual “frequency” interpretation. That is, if we were
to take repeated samples of the same size at the same x level and construct, for
example, 95% confidence intervals on each the slope for each sample, then 95% of
those intervals will contain the true value of 1 . Note that, if the confidence interval
does not contain zero, this indicates that ̂1 is significantly different from zero.
Meanwhile, (1 −  ) 100% confidence interval for  2 is given by:

( n − 2 ) MSE   2  ( n − 2 ) MSE
2 2,n−2 
12− 2,n−2

Example: Rocket Propellant


The estimate of the slope is ˆ1 = −37.15 and the estimate of  2 to be
MSE = ˆ 2 = 9244.59 . The standard error of the slope is calculated as:

( )
se ˆ1 =
MSE
SS xx
=
9244.59
1106.56
= 2.89

Therefore, the test statistic is:


ˆ1 −37.15
t= = = −12.85
( )
se ˆ1 2.89
If  = 0.05 is chosen, the critical value of t is t0.025, 18 = 2.101 . Thus, null hypothesis
H0 : 1 = 0 is rejected and conclude that there is a linear relationship between shear
strength and the age of the propellant.
The 95% confidence interval on the slope is calculated as:

( ) ( )
ˆ1 − t0.025,18 se ˆ1  1  ˆ1 + t0.025,18 se ˆ1
−37.15 − ( 2.101)( 2.89 )  1  −37.15 + ( 2.101)( 2.89 )
−43.22  1  −31.08

In other words, 95% of such intervals will include the true value of the slope. If
different value for  is chosen, the width of the resulting confidence interval would
have been different. In general, the larger the confidence level, (1 −  ) is, the wider
the confidence interval.

17
The 95% confidence interval on  2 is calculated as follows:

( n − 2 ) MSE   2  ( n − 2 ) MSE 
18 ( 9244.59 )
  2 
18 ( 9244.59 )

 2,n−2
2
 2
1− 2,n − 2  2
0.025,18  0.975,18
2

18 ( 9244.59 ) 18 ( 9244.59 )
  2   5282.65   2  20,219.03
31.5 8.23

1.5 COEFFICIENT OF DETERMINATION


How good is your regression model? It is often useful to compute a number that
summarizes how well the OLS regression line fits the data. That is, how well does
the x variable explains the variation that occurs in the y variable? The coefficient of
determination, R 2 of the model is a measure that can answer these questions. The
measure is calculated as:
n

 ( yˆ − y )
2
i
SSR SS
R2 = = i =1
n
= ˆ1 xy
( y − y )
SST 2 SS yy
i
i =1
n

 ( y − yˆ )
2
i
SSE
=1− =1− i =1
n

( y − y )
SST 2
i
i =1

Since SST is a measure of the variability in y without considering the effect of the
independent variable x and SSE is a measure of the variability in y remaining after x
has been considered, R 2 is often called the proportion (%) of variation in y that can
be explained by the independent x. Since 0  SSE  SST , it follows that 0  R 2  1 .
Values of R 2 that are close to 1 imply that most of the variability in y can be
explained by the regression model.
For the regression model for the rocket propellant data, we have
SSR 1,527,334.95
R2 = = = 0.9018
SST 1,693,737.60
that is 90.18% of the variability in strength is accounted for by the regression model.
The statistic R 2 should be used with caution, since it is always possible to make R 2
large by adding enough terms to the model.

The higher the value of R 2 is, the better the regression model. However, R 2 does
not measure the appropriateness of the linear model, for R 2 will often be large even
though y and x are non-linearly related. In addition, although R 2 is large, this does
not necessarily imply that the regression model will be an accurate predictor.

18
1.6 ESTIMATING MEAN RESPONSE AND PREDICTION
A major use of a regression model is to estimate the mean value for y, or mean
response E ( y ) for a particular value of the independent variable x. That is, we are
attempting to estimate the mean results of a very large number of the
independent variable at the particular x-value. For example, someone might wish
to estimate the mean shear strength of the propellant bond in a rocket motor made
from a batch of sustainer propellant that is 10 weeks old.
Let x0 be the value of x variable for which we wish to estimate the mean response,
say E ( y | x0 ) = ymr . Assumed that x0 is within the range of the original data that is
used to fit the model, i.e x0 lies within the x-space . An unbiased point estimator of
E ( y | x0 ) = ymr is calculated as:

E ( y | x0 ) = ˆ y|x0 = yˆ mr = ˆ0 + ˆ1 x0

The variance of ˆ y | x0 or yˆmr is given as:

( )
Var ( ˆ y | x0 = yˆ mr ) = Var ˆ0 + ˆ1 x0 = Var  y + ˆ1 ( x0 − x ) 

 2 ( x0 − x )  1 ( x0 − x )2 
2
 2
= + =   +
2

n SS xx  n SS xx


( )
Note that Cov y , ˆ1 = 0 , thus the sampling distribution of
ˆ y|x − E ( y | x0 )
0
~ tn − 2
(
MSE 1 n + ( x0 − x ) SS xx
2
)
Consequently, a (1 −  ) % confidence interval on the mean response at the point
x = x0 is

 1 ( x0 − x )2   1 ( x0 − x )2 
yˆ mr − t 2, n-2 MSE  +   ymr  yˆ mr + t 2, n-2 MSE  + 
n SS   n SS 
 xx   xx 
Note that the width of the confidence interval for E ( y | x0 ) is a function of x0 . The
interval width is a minimum for x0 = x and widens as x0 = x increases. Intuitively
this is reasonable, as we would expect our best estimates of y to be made at x values
near the centre of the data and the precision of estimation to deteriorate as we move
to the boundary of the x-space. (Refer to plot at the end of chapter!)

Now, let us use the estimated regression model to predict a particular y value for
a given value of x. Unlike the case for the mean response, here we are predicting

19
the outcome from a single value of x. If x0 is the value of explanatory variable of
interest, the point estimate of the new/predicted value is given by:
yˆnpr = ˆ0 + ˆ1x0
which is the same as the mean response, yˆ mr . The variance of the predicted value is
given by
 1 ( x0 − x )2 
Var ( yˆ npr ) =   1 + +
2

 n SS xx 
Thus, the (1 −  ) % prediction interval on the future observation at the point x = x0
is
 1 ( x0 − x )2   1 ( x0 − x )2 
yˆ npr − t 2, n-2 MSE 1 + +   ynpr  yˆ npr + t 2, n-2 MSE 1 + + 
 n SS   n SS 
 xx   xx 

Many regression textbooks state that one should never use a regression model to
extrapolate beyond the range of the original data. In other words, it means never
using the prediction equation beyond the boundary of the x-space.

It is clear from the two expressions above that:


1) confidence interval for the mean response is always narrower than the prediction
interval of new observation
2) the width of confidence and prediction intervals becomes smaller as n increases.
Thus, in theory, a more precise estimate of the mean value of y or the prediction of
new value of y can be obtained by selecting large sample.

Example
Consider finding a 95% confidence interval on E ( y | x0 ) for the rocket propellant
data. The confidence interval can be calculated as:

 1 ( x0 − 13.3625) 
2

ˆ y|x − ( 2.101) 9244.59  +   E ( y | x0 )


0
 20 1106.56 
 
 1 ( x0 − 13.3625) 
2

 ˆ y|x0 + ( 2.101) 9244.59  + 


 20 1106.56 
 
For example, if x0 = x = 13.3625 then ˆ y|x0 = yˆmr = 2131.40 , and the confidence
interval becomes
2086.23  E ( y |13.3625) = ymr 13.3625  2176.57

Meanwhile, if x0 = 10 , then ˆ y|x0 = yˆ0 = 2256.32 with 95% confidence interval of

2206.75  E ( y |10) = ymr 10  2305.89


20
Table below contains the 95% confidence limits on E ( y | x0 ) for several other values
of x0 . Note that the width of the confidence interval increases as x0 = x increases.

Confidence Limits on E (y | x0) for Several Values of x0


Lower x0 Upper Width
Confidence Limit Confidence Limit of Interval
2438.919 3 2593.821 154.902
2341.360 6 2468.481 127.121
2241.104 9 2345.836 104.732
2136.098 12 2227.942 91.844
2086.230 x = 13.3625 2176.571 90.341
2024.318 15 2116.822 92.504
1905.890 18 2012.351 106.461
1782.928 21 1912.412 129.484
1657.395 24 1815.045 157.650

Suppose now, we try to predict the propellant shear strength in a motor made from
a batch of sustainer propellant that is 10 weeks old. The predicted value would be:
yˆnpr 10 = 2627.82 − 37.15(10) = 2256.32

with 95% prediction interval of:

 (10 − 13.3625) 
2
1
2256.32 − ( 2.101) 9244.59  1 + +   E ( y | x0 )
 20 1106.56 
 
 (10 − 13.3625) 
2
1
 2256.32 + ( 2.101) 9244.59 1 + + 
 20 1106.56 
 

2048.32  ynpr 10  2464.32

Comparing the mean response with the predicted value for x = 10 , it is clear that the
prediction interval is wider than the corresponding confidence interval. The reason
being, the prediction interval depends on both the error from the fitted model as well
as the error associated with future observation.
How can we interpret the 95% prediction interval?
If we repeat the study of obtaining a regression data set many times, each time
forming a 95% prediction interval at a particular value of x = x0 then approximately
95% of the prediction intervals will contain the corresponding true value of y.

21
22

You might also like