Chapter 1
Chapter 1
Example:
In a study involving children’s fear related to being hospitalized, the age and the
score each child made on the Child Medical Fear Scale (CMFS) are given in the
table below. Construct a scatter diagram for this data.
Age (x) 8 9 9 10 11 9 8 9 8 11 7 6 6
CMFS (y) 31 25 40 27 35 29 25 34 44 19 28 47 42
AGE (x) 8 9 8 12 15 13 10 10 11 8 9 8
CMFS (y) 7 35 16 12 23 26 26 20 40 30 38 29
After the plot is drawn, it can be used to determine which type of relationship, if any,
exists. The scatter diagram reveals a more or less strong tendency rather than a
precise linear relationship. The line represents the nature of the relationship on
average. However, it is clear from the plot that there is a negative relationship
between the two variables, since as the age increases, the fear level tends to decrease.
1
Scatter Diagram for Children's Fear
50
45
40
35
CMFS 30
25
20
15
10
5 6 7 8 9 10 11 12 13 14 15
Age
From a scatter diagram, we can see whether there appears a linear correlation
between variable x and variable y. As we intend to know how strong the correlation
is, thus we need some method to measure the correlation.
The coefficient of linear correlation measures the strength of the linear relationship
between two variables. Population correlation coefficient is denoted by, ρ whereas
sample correlation coefficient is denoted by r. The range of the correlation
coefficient is from –1 to +1.
If there is a strong positive linear relationship between the variables, the value of r
will be close to +1. If there is a strong negative linear relationship between the
2
variables, the value of r will be close to -1. Typically, if the value of r 0.7 , the
correlation is said to be strong. When there is no linear relationship between the
variables or only weak relationship, the value of r will be close to 0. Typically, if the
value of 0.1 r 0.3 , the correlation is said to be weak.
There are several ways to compute the value of the correlation coefficient. The
Pearson’s Product Moment Correlation Coefficient is calculated as follows
n ( xy ) − ( x )( y )
r=
( x − x )( y − y ) SS xy
= =
( n − 1) SS xx SS yy n ( x 2 ) − ( x )2 n ( y 2 ) − ( y )2
SS xx SS yy
where
( x)
2
Example:
Below are the data for the weight (in thousands of pounds) x and the gasoline mileage
(miles per gallon) y for ten different automobiles. Find the linear correlation
coefficient between the two variables.
x y x2 y2 xy
2.5 40 6.25 1600 100.0
3.0 43 9.00 1849 129.0
4.0 30 16.00 900 120.0
3.5 35 12.25 1225 122.5
2.7 42 7.29 1764 113.4
4.5 19 20.25 361 85.5
3.8 32 14.44 1024 121.6
2.9 39 8.41 1521 113.1
5.0 15 25.00 225 75.0
2.2 14 4.84 196 30.8
x y =309 x 2 =123.73 y 2 =10665 xy
=34.1 =1010.9
Calculation for r:
( x) ( 34.1)
2 2
SS xx = x 2
− = 123.73 − = 7.449
n 10
3
( y) ( 309 )
2 2
SS yy = y 2
− = 10665 − = 1116.9
n 10
SS xy = xy −
( x )( y ) = 1010.9 −
( 34.1)( 309 ) = −42.79
n 10
SS xy −42.79
r= = = −0.47
SS xx SS yy ( 7.449)(1116.9 )
Since the value of r is computed from data obtained from the samples, there are two
possibilities when r is not equal to zero: either the value of r is large enough to
conclude that there is a significant linear relationship between the variables, or the
value of r is due to chance and therefore can be treated as zero. In order to make this
decision, researcher may use the hypothesis testing procedure.
In hypothesis testing, one of the following is true:
H0 : = 0 There is no correlation between x and y in the population
H0 : 0 There is significant correlation between x and y in the population
If both variables are normally distributed, the value of the test statistic t is calculated
as below:
n−2
t=r ~ tn − 2
1− r2
Although the hypothesis above is two-tailed, some problems involve the claim that
correlation is either positive or negative and such the hypothesis testing involve one-
tailed test.
4
The simple linear regression model assumes an exact linear relationship between the
expected or average value of y, the dependent variable and x, the independent or
predictor variable:
E yi = 0 + 1xi
Observed value
35
30
Error,
25
y 20
15 Predicted value
10
5
0
0 10 20 30 40
5
It is convenient to view the explanatory x as controlled by the data analyst and
measured with negligible error, while the response y is a random variable. That is,
there is a probability distribution for y at each possible value for x. The mean of this
distribution is
E ( y x ) = y|x = E ( 0 + 1x + ) = 0 + 1x
The variance of y given any value of x is
Var ( y x ) = Var ( 0 + 1x + ) = Var ( ) = 2
Thus, the mean of y is a linear function of x although the variance of y does not
depend on the value of x. Furthermore, because the errors are uncorrelated, the
responses are also uncorrelated.
A
B
6
Estimation of a simple linear regression relationship involves finding estimated or
predicted values of the intercept and slope of the linear regression line. The method
of least squares produces a line that is certain to appear satisfactory since it goes
through the center of the data, ( x , y ) and does so at a sensible angle. Although
the line passes through the data, very few of the data point actually lie on the line.
If the values of ̂ 0 and ˆ1 are well chosen, then all of the residuals will be small.
Some of the residuals will be negative and some will be positive so that i2 is
small. It is mathematically convenient to work with i
2
instead. A line that fits the
data well will result to a minimum value of i
2
.
is as small as possible.
7
Differentiating with respect to 0 and 1 to minimize SSRes,
S n
S n
= −2 ( yi − 0 − 1 xi ) = 0 and = −2 xi ( yi − 0 − 1xi ) = 0
0 i =1 1 i =1
yi = n0 + 1 xi
i =1 i =1
and xi yi = 0 xi + 1 xi 2
i =1 i =1 i =1
Solving the normal equations simultaneously for 0 and 1 , we obtain the regression
line estimators:
n n
yi xi
yi xi − i =1 i =1
n
n SS xy
ˆ0 = y − ˆ1 x and that ˆ1 = i =1 2
=
n SS xx
i
x
xi − i =1
n
i =1
2
n
n
Note: The least square equation line is a line that actually minimizes
i =1
i
2
with
Line of Best Fit Equation: The equation is determined by ̂ 0 and ˆ1 , where ̂ 0 and
ˆ1 are values that satisfy the least squares criterion.
After obtaining the least-squares fit, a number of interesting questions come to mind:
1. How well does this equation fit the data?
2. Is the model likely to be useful as a predictor?
3. Are any of the basic assumptions (such as constant variance and uncorrelated
errors) violated, and if so, how serious is this?
All of these issues must be investigated before the model is finally adopted for use.
As noted previously, the residuals play a key role in evaluating model adequacy.
Residuals can be viewed as realizations of the model errors i . Thus, to check the
constant variance and uncorrelated errors assumptions, the residuals must look like
a random sample from a distribution with these properties. Further study about this
query will be discussed in Chapter 3, where the use of residuals in model adequacy
checking is explored.
Example:
A recent article measured the job satisfaction of subjects with a 14-question survey.
The data below represents the job satisfaction scores, y, and the salaries, x, for a
8
sample of similar individuals. (a) Find the equation of the line of best fit and show
the diagram.
Data:
x 3 7 6 6 10 12 12 12 13 13 14 15
y 33 38 24 61 52 45 29 65 82 63 50 79
Solution:
x y x2 y2 xy
3 33 9 1089 99
7 38 49 1444 266
6 24 36 576 144
6 61 36 3721 366
10 52 100 2704 520
12 45 144 2025 540
12 29 144 841 348
12 65 144 4225 780
13 82 169 6724 1066
13 63 169 3969 819
14 50 196 2500 700
15 79 225 6241 1185
x = 123 y = 621 x
2
= 1421 y 2 =36059 xy = 6833
( x) (123)
2 2
SS xx = x 2
− = 1421 − = 160.25
n 12
SS xy = xy −
( x )( y ) = 6833 − (123)( 621) = 467.75
n 12
SS xy 467.75
ˆ1 = = = 2.919
SS xx 160.25
ˆ0 = y − ˆ1x =
y − 1 x 621 − 2.919 (123)
= = 21.83
n 12
Thus, equation of the line of best fit: yˆi = 21.83 + 2.919 xi
21.83
x
-7.48
9
1.3.3 ESTIMATION OF ERROR VARIANCE
In addition to estimating 0 and 1 , an estimate for the error variance, 2 is needed
for hypothesis testing on the parameters and to construct interval estimates related
to the model. In a regression model, for every x value, there is a different value of
y. Consequently, the random error, will take on different value for each x. The
measure of spread of these different values of is given by its standard deviation,
. The standard deviation of errors tells us how widely the errors are spread, and
hence how widely the y values are spread from the regression line.
Recall the assumptions for random error,
1. The errors for each observation are independent of each other.
2. The errors for each x has mean equal to zero, i.e., that is E ( ) = 0 for each x.
3. For any given x, the errors are normally distributed.
4. For each x, population errors has the same standard deviation, .
Note that denotes the standard deviation of errors in the population and usually
is unknown. In such cases, it is estimated by the standard deviation of errors in the
sample, with the formula below:
( )
n n n 2
The residual sum of squares has n – 2 degrees of freedom, because two degrees of
freedom are associated with the estimates ̂ 0 and ˆ1 involved in obtaining yˆ i . It can
be shown that the expected value of SSE is E ( SSE ) = ( n − 2) 2 , so an unbiased
estimator of 2 is given by:
SSE
ˆ2 = = MSE
n−2
The quantity MSE is called the residual mean square. The square root of ˆ 2 is
sometimes called the standard error of regression, and it has the same units as the
response variable y.
Heights, x (inches) 65 65 62 67 69 65 61 67
Weight, y (pounds) 105 125 110 120 140 135 95 130
Solutions:
x y x2 xy y2
65 105 4225 6825 11025
65 125 4225 8125 15625
62 110 3844 6820 12100
67 120 4489 8040 14400
69 140 4761 9660 19600
65 135 4225 8775 18225
61 95 3721 5795 9025
67 130 4489 8710 16900
x = 521 y = 960 x
2
= 33979 xy = 62750 y 2 = 116900
SS xx = 33979 −
5212
= 48.875 SS xy = 62750 −
( 521)( 960 ) = 230.0
8 8
SS 230.0
ˆ1 = xy = = 4.706
SS xx 48.875
ˆ0 = y − ˆ1 x =
y − ( ˆ x ) = 960 − ( 4.706 )( 521) = − 186.478
1
n 8
From the previous example, SSxy = 230.0 and ˆ1 = 4.706 . Now, we need to
compute SS yy = SST :
( y)
2
9602
SST = y 2 − = 116900 − = 1700
n 8
Hence, the estimated error variance for the sample is:
SST - ˆ1SS xy 1700 − ( 4.706 )( 230.0 )
ˆ2 = = = 102.94
n−2 8−2
Warning/Caution
Calculating involving SS yy , SS xy and must be made with at least six decimal
places (could be ten!) to avoid substantial error in the calculation of SSE .
11
Example 2: The Rocket Propellant Data
A rocket motor is manufactured by bonding an igniter propellant and a sustainer
propellant together inside a metal housing. The shear strength of the bond between
the two types of propellant is an important quality characteristic. It is suspected that
shear strength is related to the age in weeks of the batch of sustainer propellant.
Twenty observations on shear strength and the age of the corresponding batch of
propellant have been collected and are shown in Table 2. The scatter diagram, shown
in Figure 2, suggests that there is a strong statistical relationship between shear
strength (in psi) and propellant age (in weeks), and the tentative assumption of the
straight-line model y = 0 + 1 x + appears to be reasonable.
2700
2600
2500
2400
Shear Strength
2300
2200
2100
2000
1900
1800
1700
1600
0 2 4 6 8 10 12 14 16 18 20 22 24 26
Age of propellant
12
n n
n y x i i
( 267.25)( 42627.15)
SS xy = yi xi − i =1 i =1
= 528492.6375 − = − 41,112.65
i =1 n 20
Therefore, the estimated coefficients are:
SS − 41112.6544
ˆ1 = xy = = − 37.15
SS xx 1106.5594
The slope -37.15 is interpreted as the average weekly decrease in propellant shear
strength due to the age of the propellant. Since the lower limit of the x’s is near the
origin, the intercept 2627.82 represents the shear strength in a batch of propellant
immediately following manufacture.
Table: Data, Fitted Values, and Residuals
Observed Value, yi Fitted value, yˆi Residual, i
2158.70 2051.94 106.76
1678.15 1745.42 -67.27
2316.00 2330.59 -14.59
2061.30 1996.21 65.09
2207.50 2423.48 -215.98
1708.30 1921.90 -213.60
1784.70 1736.14 48.56
2575.00 2534.94 40.06
2357.90 2349.17 8.73
2256.70 2219.13 37.57
2165.20 2144.83 20.37
2399.55 2488.50 -88.95
1799.80 1698.98 80.82
2336.75 2265.58 71.17
1765.30 1810.44 -45.14
2053.50 1959.06 94.44
2414.40 2404.90 9.50
2200.50 2163.40 37.10
2654.20 2553.52 100.68
1753.70 1829.02 -75.32
y i = 42627.15 yˆ i = 42627.15 e i = 0.00
13
SSE = SST − ˆ1SS xy = 1693737.60 − ( −37.15)( −41112.65) = 166402.65
SSE 166402.65
Therefore, the estimate of 2 is computed as: ˆ2 = = = 9244.59
n−2 18
n yi xi n
SS yi xi − i =1 i =1
y (x − x)i i n
= ci yi
n
Note that ˆ1 = xy = i =1 = i =1
SS xx SS xx SS xx i =1
( x − x )
2
2
( )
n n n i
( ) ( )
Var ˆ0 = Var y − ˆ1 x = Var ( y ) + x 2Var ˆ1 − 2 xCov y , ˆ1 ( ) ( )
2
( )
1 n
= Var ( y ) + x 2Var ˆ1 = Var yi + x 2
n i =1 SS xx
n 2 2
2 2
1 1 n
= 2 Var yi + x = 2 Var ( yi ) + x
n i =1 SS xx n i =1 SS xx
2 21 x2
= 2 ( n ) + x
2
1
= +
2
n SS xx n SS xx
The quality of least-squares estimators, ̂0 and ˆ1 is summarized as the Gauss-
Markov theorem which states that for the regression model with the assumptions
E ( ) = 0 , Var ( ) = 2 and Cov ( i j ) = 0 , the least-squares estimators are unbiased
and have minimum variance when compared with all other unbiased estimators that
14
are linear combinations of yi . It is often say that the least-squares estimators are best
linear unbiased estimators (BLUE), where best implies minimum variance.
As we want our estimators to be precise, the variance of the estimators need to be
low. The variances of ̂0 and ˆ1 are affected by:
1) the variance of the error term, 2 . If the error term has small variance, the
estimators will be more precise.
2) The amount of variation in the X variable. If there is a lot of variation in the X
variable, the estimators will be more precise.
3) The sample size, n. The larger the sample size, the more precise the estimators.
Since the errors i are NID ( 0, 2 ) , the observations yi are NID ( 0 + 1x, 2 ) . Now
ˆ is a linear combination of the observations, so ˆ is normally distributed with
1 1
mean 1 and variance SS xx Therefore, to test the hypothesis that the slope of
2
H0 : 1 = b1 vs. H1 : 1 b1
ˆ1 − b1
the test statistic is: Z= ~ N ( 0,1) .
SS xx
2
However, typically, 2 is unknown and is estimated from the sample by MSE. The
test statistic is now a t, given by:
ˆ1 − b1
t= ~ tn − 2
MSE SS xx
The null hypothesis is rejected if t t 2,n− 2 . Note from above that the test is a 2-
tailed test. However, depending on certain application, underlying theory may
require the analyst to perform a 1-tailed test.
Similarly, the test statistic for testing H 0 : 0 = b0 is given by:
15
ˆ1 − b1
t= ~ tn − 2
1 x 2
MSE +
n SS xx
The significance of the estimated regression model is tested based on the following
hypothesis:
H0 : 1 = 0 vs. H1 : 1 0
Failing to reject H0 : 1 = 0 implies that there is no linear relationship between x and
y. This situation is illustrates in the two figures below. Note that this may imply
either that x is of little value in explaining the variation in y and that the best
estimator of y for any x is yˆ = y (a) or that the true relationship between x and y is
not linear (b). Therefore, failing to reject H0 : 1 = 0 is equivalent to saying that there
is no linear relationship between y and x.
y y
x x
(a) (b)
Situations where the hypothesis is not rejected
y y
x x
(a) (b)
Situations where the hypothesis is rejected
16
The confidence interval for 1 is constructed around its point estimate, ̂1 . The
(1 − ) 100% confidence interval for 1 is given by:
( ) ( )
ˆ1 − t 2,n−2 se ˆ1 1 ˆ1 − t 2,n−2 se ˆ1
This confidence interval has the usual “frequency” interpretation. That is, if we were
to take repeated samples of the same size at the same x level and construct, for
example, 95% confidence intervals on each the slope for each sample, then 95% of
those intervals will contain the true value of 1 . Note that, if the confidence interval
does not contain zero, this indicates that ̂1 is significantly different from zero.
Meanwhile, (1 − ) 100% confidence interval for 2 is given by:
( n − 2 ) MSE 2 ( n − 2 ) MSE
2 2,n−2
12− 2,n−2
( )
se ˆ1 =
MSE
SS xx
=
9244.59
1106.56
= 2.89
( ) ( )
ˆ1 − t0.025,18 se ˆ1 1 ˆ1 + t0.025,18 se ˆ1
−37.15 − ( 2.101)( 2.89 ) 1 −37.15 + ( 2.101)( 2.89 )
−43.22 1 −31.08
In other words, 95% of such intervals will include the true value of the slope. If
different value for is chosen, the width of the resulting confidence interval would
have been different. In general, the larger the confidence level, (1 − ) is, the wider
the confidence interval.
17
The 95% confidence interval on 2 is calculated as follows:
( n − 2 ) MSE 2 ( n − 2 ) MSE
18 ( 9244.59 )
2
18 ( 9244.59 )
2,n−2
2
2
1− 2,n − 2 2
0.025,18 0.975,18
2
18 ( 9244.59 ) 18 ( 9244.59 )
2 5282.65 2 20,219.03
31.5 8.23
( yˆ − y )
2
i
SSR SS
R2 = = i =1
n
= ˆ1 xy
( y − y )
SST 2 SS yy
i
i =1
n
( y − yˆ )
2
i
SSE
=1− =1− i =1
n
( y − y )
SST 2
i
i =1
Since SST is a measure of the variability in y without considering the effect of the
independent variable x and SSE is a measure of the variability in y remaining after x
has been considered, R 2 is often called the proportion (%) of variation in y that can
be explained by the independent x. Since 0 SSE SST , it follows that 0 R 2 1 .
Values of R 2 that are close to 1 imply that most of the variability in y can be
explained by the regression model.
For the regression model for the rocket propellant data, we have
SSR 1,527,334.95
R2 = = = 0.9018
SST 1,693,737.60
that is 90.18% of the variability in strength is accounted for by the regression model.
The statistic R 2 should be used with caution, since it is always possible to make R 2
large by adding enough terms to the model.
The higher the value of R 2 is, the better the regression model. However, R 2 does
not measure the appropriateness of the linear model, for R 2 will often be large even
though y and x are non-linearly related. In addition, although R 2 is large, this does
not necessarily imply that the regression model will be an accurate predictor.
18
1.6 ESTIMATING MEAN RESPONSE AND PREDICTION
A major use of a regression model is to estimate the mean value for y, or mean
response E ( y ) for a particular value of the independent variable x. That is, we are
attempting to estimate the mean results of a very large number of the
independent variable at the particular x-value. For example, someone might wish
to estimate the mean shear strength of the propellant bond in a rocket motor made
from a batch of sustainer propellant that is 10 weeks old.
Let x0 be the value of x variable for which we wish to estimate the mean response,
say E ( y | x0 ) = ymr . Assumed that x0 is within the range of the original data that is
used to fit the model, i.e x0 lies within the x-space . An unbiased point estimator of
E ( y | x0 ) = ymr is calculated as:
( )
Var ( ˆ y | x0 = yˆ mr ) = Var ˆ0 + ˆ1 x0 = Var y + ˆ1 ( x0 − x )
2 ( x0 − x ) 1 ( x0 − x )2
2
2
= + = +
2
n SS xx n SS xx
( )
Note that Cov y , ˆ1 = 0 , thus the sampling distribution of
ˆ y|x − E ( y | x0 )
0
~ tn − 2
(
MSE 1 n + ( x0 − x ) SS xx
2
)
Consequently, a (1 − ) % confidence interval on the mean response at the point
x = x0 is
1 ( x0 − x )2 1 ( x0 − x )2
yˆ mr − t 2, n-2 MSE + ymr yˆ mr + t 2, n-2 MSE +
n SS n SS
xx xx
Note that the width of the confidence interval for E ( y | x0 ) is a function of x0 . The
interval width is a minimum for x0 = x and widens as x0 = x increases. Intuitively
this is reasonable, as we would expect our best estimates of y to be made at x values
near the centre of the data and the precision of estimation to deteriorate as we move
to the boundary of the x-space. (Refer to plot at the end of chapter!)
Now, let us use the estimated regression model to predict a particular y value for
a given value of x. Unlike the case for the mean response, here we are predicting
19
the outcome from a single value of x. If x0 is the value of explanatory variable of
interest, the point estimate of the new/predicted value is given by:
yˆnpr = ˆ0 + ˆ1x0
which is the same as the mean response, yˆ mr . The variance of the predicted value is
given by
1 ( x0 − x )2
Var ( yˆ npr ) = 1 + +
2
n SS xx
Thus, the (1 − ) % prediction interval on the future observation at the point x = x0
is
1 ( x0 − x )2 1 ( x0 − x )2
yˆ npr − t 2, n-2 MSE 1 + + ynpr yˆ npr + t 2, n-2 MSE 1 + +
n SS n SS
xx xx
Many regression textbooks state that one should never use a regression model to
extrapolate beyond the range of the original data. In other words, it means never
using the prediction equation beyond the boundary of the x-space.
Example
Consider finding a 95% confidence interval on E ( y | x0 ) for the rocket propellant
data. The confidence interval can be calculated as:
1 ( x0 − 13.3625)
2
Suppose now, we try to predict the propellant shear strength in a motor made from
a batch of sustainer propellant that is 10 weeks old. The predicted value would be:
yˆnpr 10 = 2627.82 − 37.15(10) = 2256.32
(10 − 13.3625)
2
1
2256.32 − ( 2.101) 9244.59 1 + + E ( y | x0 )
20 1106.56
(10 − 13.3625)
2
1
2256.32 + ( 2.101) 9244.59 1 + +
20 1106.56
Comparing the mean response with the predicted value for x = 10 , it is clear that the
prediction interval is wider than the corresponding confidence interval. The reason
being, the prediction interval depends on both the error from the fitted model as well
as the error associated with future observation.
How can we interpret the 95% prediction interval?
If we repeat the study of obtaining a regression data set many times, each time
forming a 95% prediction interval at a particular value of x = x0 then approximately
95% of the prediction intervals will contain the corresponding true value of y.
21
22