0% found this document useful (0 votes)
3 views

Chapter 11_Simple linear regression and Correlation

Chapter 11 covers Simple Linear Regression and Correlation, focusing on building empirical models, estimating parameters using least squares, and analyzing residuals for model adequacy. It outlines the process of fitting a regression line to data, testing hypotheses, and predicting future observations. The chapter also includes examples and equations related to the least squares estimators and their applications in engineering and scientific data.

Uploaded by

baotochi87
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chapter 11_Simple linear regression and Correlation

Chapter 11 covers Simple Linear Regression and Correlation, focusing on building empirical models, estimating parameters using least squares, and analyzing residuals for model adequacy. It outlines the process of fitting a regression line to data, testing hypotheses, and predicting future observations. The chapter also includes examples and equations related to the least squares estimators and their applications in engineering and scientific data.

Uploaded by

baotochi87
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Chapter 11:

SIMPLE LINEAR REGRESSION


AND
CORRELATION
Chapter outline
11.1 Empirical Models.
11.2 Simple Linear Regression.
11.3 Properties of the Least Squares Estimators.
11.4 Hypothesis Tests in Simple Linear Regression.
11.8 Correlation.
Learning Objectives
After careful study of this chapter, you should be able
to do the following:
1. Use simple linear regression for building empirical
models to engineering and scientific data.
2. Understand how the method of least squares is used
to estimate the parameters in a linear
regressionmodel.
3. Analyze residuals to determine whether the
regression model is an adequate fit to the data or
whether any underlying assumptions are violated.
Learning Objectives
4. Test statistical hypotheses and construct
confidence intervals on regression model
parameters.
5. Use the regression model to predict a future
observation and ctoonstruct an appropriate
prediction interval on the future observation.
6. Apply the correlation model.
7. Use simple transformations to achieve a linear
regression model.
11.1 Empirical Models
• Regression analysis is the process of building
mathematical models or mathematical functions
that can describe, predict or control of a variable
from one or more other variables.
• Many problems in engineering and science
involve exploring the relationships between two
or more variables.
• Regression analysis is a statistical technique
that is very useful for these types of problems.
11.1 Empirical Models
Example 1:
Suppose that a car rental company that offers
hybrid vehicles charts its revenue as shown below.
How best could we predict the company’s revenue
for the year 2016 ?
Year, x 1996 2001 2006 2011 2016

Yearly Revenue, y
(in millions of 5.2 8.9 11.7 16.8 ?
dollars)
11.1 Empirical Models
Suppose that we plot these points and try to draw a
line through them that fits. Note that there are
several ways in which this might be done. (See the
graphs below).Each would give a different estimate
of the company’s total revenue for 2016.
11.1 Empirical Models
 Based on the scatter diagram, it is probably reasonable
to assume that the mean of the random variable Y is
related to x by the following straight-line relationship:
E Y | x  Y |x  0  1 x
where the slope and intercept of the line are called
regression coefficients.
 The simple linear regression model is given by

Y  0  1 x  
where  is the random error term.
11.1 Empirical Models
 We think of the regression model as an empirical
model.
 Suppose that the mean and variance of  are 0
and 2, respectively, then
E Y | x   E  0  1 x     0  1 x  E  
 0  1 x

The variance of Y given x is


V Y | x  V  0  1 x   
V  0  1 x   V  
0   2  2
11.1 Empirical Models
The true regression model is a line of mean values:
Y |x  0  1 x
where 1 can be interpreted as the change in the mean of Y
for a unit change in x.
• Also, the variability of Y at a particular value of x is
determined by the error variance, 2.
• This implies there is a distribution of Y-values at each x
and that the variance of this distribution is the same at each
x.
11.1 Empirical Models
To determine the equation of the line that “best” fits the
data, we note that for each data point there will be a
deviation, or error, between the y-value at that point
and the y-value of the point on the linethat is directly above
or below the point.
Those deviations, in the example, y1 –5.2, y2 – 8.9,
y3 – 11.7, and y4 – 16.8, will be positive or negative,
depending on the location of the line.
11.1 Empirical Models
11.2 Simple Linear Regression
We wish to fit these data points with a line,
y  0  1 x

that uses values of 1 and  0 that, somehow, minimize


the deviations in order to have a good fit.

One way of minimizing the deviations is based on the


least-squares assumption.
11.2 Simple Linear Regression
Note that squaring each y-deviation gives us a sum of
nonnegative terms. Were we to simply add the deviations,
positive and negative deviations would cancel each
other out.

Using the least-squares assumption with the yearly


revenue data, we want to minimize.
(y1  5.2)2  (y2  8.9)2  (y3  11.7)2  (y4  16.8)2
11.2 Simple Linear Regression
Also, since the points (1, y1), (2, y2), (3, y3), and (4, y4)
must be solutions of y = β1 x + β0 , it follows that
y1  1 (1)   0  11   0
y2  1 (2)   0  2 1   0
y3  1 (3)   0  31   0
y4  1 (4)   0  4 1   0

Substituting these values for each y in the previous


equation, we now have a function of two variables.
L( 1 ,  0 ) (11   0  5.2) 2  ( 2 1   0  8.9) 2
 (31   0  11.7)  ( 41   0  16.8)
2 2
11.2 Simple Linear Regression
Thus, to find the regression line for the given set of
data, we must find the values of β0 and β1 that minimize
the function L given by the sum above.
L L
We first find and .
 0 1
L
2( 1   0  5.2)  2(2 1   0  8.9)  2(31   0  11.7)
 0
 2(4 1   0  16.8) 20 1  8 0  85.2
L
2 1   0  5.2   2 2 1   0  8.9 2
1
2 31   0  11.7 3  2 4 1   0  16.8  60 1  20 0  250.6
11.2 Simple Linear Regression
We set the derivatives equal to 0 and solve the resulting
system:
20 1  8 0  85.2  0
60 1  20  0  250.6  0
It can be shown that the solution to this system is
 0 1.25 1 3.76
We leave it to the student to complete the D-test
to verify that (1.25, 3.76) does, in fact, yield a
minimum of L.
11.2 Simple Linear Regression
There is no need to compute L(1.25, 3.76). The values
of β1 and β0 are all we need to determine
y = β1 x + β0. The regression line is
y = 3.76x + 1.25.
The graph of this “best-fit”
regression line together
with the data points is
shown below.
Compare it to the
graphs before.
11.2 Simple Linear Regression
Now, we can use the regression equation to predict the car
rental company’s yearly revenue in 2016.
y = 3.76(5) + 1.25 = 20.05 or about $20.05 million.
The case of simple linear regression considers a single
regressor or predictor x and a dependent or response
variable Y.
The expected value of Y at each level of x is a random
variable:
E Y | x   0  1 x

We assume that each observation, Y, can be described by


the model
Y  0  1 x  
11.2 Simple Linear Regression
• Suppose that we have n pairs of observations (x1, y1),
(x2, y2), …, (xn, yn).

Figure: Deviations of the data from the estimated


regression model.
11.2 Simple Linear Regression
• The method of least squares is used to estimate the
parameters, 0 and 1 by minimizing the sum of the
squares of the vertical deviations in Figure.

Figure: Deviations of the data from the estimated


regression model.
11.2 Simple Linear Regression
• The n observations in the sample can be expressed as
yi  0  1 xi   i , i 1, n
• The sum of the squares of the deviations of the
observations from the true regression line is
n n
L     yi   0  1 xi 
2 2
i
i 1 i 1

•  and 
The least squares estimators of  0 and 1 , say  ,
0 1

L
 
n
  
 2 yi    x 0
must satisfy 0 1 i
 0  0 1 i 1

L
 
n
  
 2 yi    x x 0
0 1 i i
1  0 1 i 1
11.2 Simple Linear Regression
Simplifying these two equations yields
n n
   x  y
n 0 1 
i 1
i 
i 1
i

n n n
0  xi  1  x  xi yi *
2
i
i 1 i 1 i 1

Equations (*) are called the least squares normal equations.


The solution to the normal equation results in the least
 and 
squares estimators  .
0 1
11.2 Simple Linear Regression
Definition: The least squares estimates of the intercept
and slope in the simple linear regression model are
 n  n 
n   yi    xi 
 i 1   i 1 

 xi iy 
n  xy  nx y S xy
 1  i 1
  ***
 
2 2
 n 
  x S xx
n  x
 i 1 
i   x 2

n
i 1
x 2
i 
n
n
 y    x where y  1 n
1
 0 1  yi and x   xi
n i 1 n i 1
The fitted or estimated regression line is therefore

y    x
0 1
11.2 Simple Linear Regression

Note that each pair of observations satisfies the


relationship
 
yi   x  e , i 1, n
0 1 i i

where e  y  
i i y is called the residual. The residual
i
describes the error in the fit the model to the ith
observation yi . Later in this chapter we will use the
residuals to provide information about the adequacy of the
fitted model.
11.2 Simple Linear Regression
Example 2: Test for
significance of
regression using
the model for the
oxygen purity
data from Table.
Find simple linear
regression model
to the oxygen purity
data in Table ?
11.2 Simple Linear Regression
Solution:
The following quantities may be computed:
20 20
n 20;  xi 23.92; yi 1,843.21
i 1 i 1

x 1.196; y 92.1605
20 20

 i
y 2

i 1
170,044.5321;  i 29.2892
x 2

i 1

20

x yi 1
i i 2,214.6566
11.2 Simple Linear Regression
Solution:
2
 20

20   xi  23.92 2
S xx  xi 
2  i 1  29.2892  0.68088
i 1 20 20
 20   20 
20   yi    xi 
S xy  xi yi   i 1   i 1 
i 1 20

2,214.6566 
23.92 1,843.21 10.17744
20
11.2 Simple Linear Regression
Solution:
Therefore, the least squares estimates of the slope and
intercept are

 S xy  y    x 74.28331
1  14.94748;  0 1
S xx
The fitted simple linear regression model (with the
coefficients reported to three decimal places) is


y 74.283  14.947 x
This model is plotted in Fig.11.2, along with the sample data.
11.2 Simple Linear Regression

Figure 11.2: Scatter plot of oxygen purity y versus


hydrocarbon level x and regression model

y 74.283  14.947 x
11.2 Simple Linear Regression
Practical Interpretation: Using the regression model, we
would predict oxygen purity of  y 89.23%when the
hydrocarbon level is x 1% .The 89.23% purity may be
interpreted as an estimate of the true population mean
purity when x 1% , or as an estimate of a new observation
when .These xestimates
1% are, of course, subject to
error; that is, it is unlikely that a future observation on purity
would be exactly 89.23% when the hydrocarbon level is
1%. In subsequent sections, we will see how to use
confidence intervals and prediction intervals to describe the
error in estimation from a regression model.
11.2 Simple Linear Regression
Example 3: Number of Cost (1000$)
passengers
To study the relationship
61 4.28
between ticket prices and
63 4.08
number of passengers on 69 4.17
each flight, research, 11 70 4.48
commercial flights, we have 74 4.30
the following data table: 76 4.82
Find regression line of the 81 4.70
number of passengers in 86 5.11
91 5.13
term of ticket prices ?
95 5.64
y  24.53  21.67 x 97 5.56
11.2 Simple Linear Regression
 Estimating 2
 
n n 2
The error sum of squares is E  i  yi  
2
SS  e yi
i 1 i 1

 SS E
An unbiased estimator of  is  
22
**
n 2

where SS E can be easily computed using


n n
 S ; SS   y  y   y 2  n y 2
 i  i
2
SS E SST  1 xy T
i 1 i 1
11.3 Properties of the Least Squares
Estimators
2
•  
Slope Properties: E 1 1 ; V 1    S xx

• Intercept Properties
 2

 
  ; V 
E  0 0
   
0
2 1 x
 

 n S xx 
 In simple linear regression the estimated standard error of
the slope and intercept are

ˆ 2  1 x 2

 
se ˆ1 
S xx
 
ˆ
and se  0  ˆ  
2

 n S

xx 
respectively, where is computed from
2
 **
11.4 Hypothesis Tests in Simple Linear
Regression
11.4.1. Use of t-Tests:

Suppose we wish An appropriate test statistic would be Reject the null


to test hypothesis

1  1,0 1  1,0


or t 
 H 0 : 1 1,0 t0  2 t0  t /2,n 2
  /S
0

se  1 
 H1 : 1 1,0 xx

0   0,0  0   0,0


 H 0 :  0  0,0 t0  

 H1 :  0  0,0
2 1
  
x
2



se  0  t0  t /2,n 2

 n S xx 
 
11.4 Hypothesis Tests in Simple Linear
Regression
An important special case of the hypotheses of
Equation *** is
 H 0 : 1 0

 H1 : 1 0
These hypotheses relate to the significance of
regression.
Failure to reject H0 is equivalent to concluding
that there is no linear relationship between x and Y.
11.4 Hypothesis Tests in Simple Linear
Regression
Figure 1:
The hypothesis
H 0 : 1 0
is not rejected.

Figure 2:
The hypothesis
H 0 : 1 0
is rejected.
11.4 Hypothesis Tests in Simple Linear
Regression
Example 4: Test for
significance of
regression using
the model
for the oxygen
purity data from
table
a/1 at  0.01

b/ 0 at  0.01
11.4 Hypothesis Tests in Simple Linear
Regression
Solution:
 H 0 : 1 0
a/ The hypotheses are 
 H1 : 1 0
2
 
We have 1 14.947; n 20; S xx 0.68088, 1.18
Test statistic
 1  1 14.947
t0    11.35
 2
 / S xx  
se 
 1
1.18 / 0.68088

Because t0 11.35  t0.005,18 2.88, so we reject H 0


11.4 Hypothesis Tests in Simple Linear
Regression
Solution:
 H 0 :  0 0
b/ The hypotheses are 
 H1 :  0 0
Test statistic t0 46.62
Because t0 46.62  t0.005,18 2.88 , so we reject H 0
11.4 Hypothesis Tests in Simple Linear
Regression
11.4.2. Analysis of variance approach to test significance of regression.
Suppose we wish to test
 H 0 : 1 0

 H1 : 1 0
Test for Significance of Regression

SS R / 1 MS R
Where F0  
SS E / n  2  MS E

   
n 2 n 2
SS R  
yi  y ; SS E  yi  
yi
Reject if
i 1 i 1
Where is ndistribution (see Appendices VI)
 
2
SST  yi  y SS R  SS E
i 1
H 0 F0  F ,1,n 2
F ,1,n 2 F
11.4 Hypothesis Tests in Simple Linear
Regression
ANOVA table
Source of Sum of Degrees Mean
F0
Variation Squares of Square
Freedom

Regression S SSR
SS R 1 xy 1 MS R 
1 MS R
F0 
Error SSE MS E
SS E n 2 MS E 
n 2
Total SST n 1
11.4 Hypothesis Tests in Simple Linear
Regression
Example 5: We will use the analysis of variance approach to test for
significance of regression using the oxygen purity data model from Example 2.
Recall that
and
173.38;
SSTregression
The of
sum 14.947;
squares is S xy 10.17744, n 20
1

and the error sum of squares is


SS R   S 14.947 10.17744 152.13
1 xy
The test statistic is ,for which we find
SS E SST  SS R 21.25
that the , soMS
we Rconclude that
is not zero. F0  128.86
MS E

P  value 1.23 10 9 1


11.4 Hypothesis Tests in Simple Linear
Regression
Note that:
-The analysis of variance procedure for testing for
significance of regression is equivalent to the t-test.
That is, either procedure will lead to the same
conclusions.
-The t-test is somewhat more flexible in that it would
allow testing against a one-sided alternative
hypothesis, while the F-test is restricted to a two-
sided alternative.
11.8 Correlation
We assume that the joint distribution of X i and Yi is the
bivariate normal distribution presented in Chapter 5, and
Y and  Y2 are the mean and variance of Y,  X and  X2 are
the mean and variance of X and  is the correlation
coefficient between Y and X . Recall that the correlation
coefficient is defined as
 XY
 where  XY is the covariance between Y and X
 X Y

The condition distribution of Y for a given value of X  x


is 1  y   0  1x 
2

1  
2   Y | x

Y Y
fY | x  y   e 
;  0 Y   X  ; 1  
 Y |x 2 X X
11.8 Correlation
It is possible to draw inferences about the correlation
coefficient  in this model. The estimator of  is the sample
correlation coefficient
n

 Y  Y  X
i i  X  S XY
R i 1

n 2 n S XX .SST
X   Y  Y 
2
i  X i
i 1 i 1

 SST
Note that 1 R
S XX
S S
 SS R
2

We may also write: R  1
2 XX
 1 XY

SYY SST SST
11.8 Correlation
Properties:
  1 R 1
 R  0 :positive correlation.
 R  0 :negative correlation
 R 0 :no correlation
11.8 Correlation
Case 1:
It is often useful to test the hypotheses

 H 0 :  0 (there is no relationship)

 H1 :  0 (there is a relationship)

Test
R n 2
statistic t0 
1  R2

Reject H0 if t0  t /2,n 2
11.8 Correlation
Case 2:
It is often useful to test the hypotheses

 H 0 :   0

 H1 :    0
Teststatistic
z0 arctanh R  arctanh  0  n  3
where
u u
e e
tanh u  u
e  e u
Reject H0 if
z0  z /2
11.8 Correlation
 The approximate 100(1- )% confidence interval
is
 z /2   z /2 
tanh  arctan hr     tanh  arctan hr  
 n 3  n 3
Example: Use the given data to find the equation of the
regression line and the value of the linear correlation
coefficient r and test the hypothesis that  0 using  0.05
and  0.1
a/
x 2 4 5 6
y 7 11 13 20

b/ Cost 9 2 3 4 2 5 9 10
Number 85 52 55 68 67 86 83 73

You might also like