0% found this document useful (0 votes)
23 views

Week08 Correlation and Regression

Uploaded by

a.bocus2510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Week08 Correlation and Regression

Uploaded by

a.bocus2510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

SIS 1037Y(1)

Analytical techniques for


Information Systems
Lecture 10
Correlation and Regression
Topics
 Review and Preview
 Correlation
 Regression
 Prediction Intervals and Variation
 Multiple Regression
 Nonlinear Regression

SIS 1037Y 2 2020/2021


Review
 Previously we looked at methods for making
inferences from two samples.
 We considered two dependent samples, with each
value of one sample somehow paired with a value
from the other sample, and we illustrated the use of
hypothesis tests for claims about the population of
differences.
 We also illustrated the construction of confidence-
interval estimates of the mean of all such differences.
 We now consider paired sample data, but the
objective is fundamentally different from that seen
before.

SIS 1037Y 3 2020/2021


Preview
 Methods for determining whether a correlation, or
association, between two variables exists and whether the
correlation is linear.
 For linear correlations, we can identify an equation that
best fits the data and we can use that equation to predict
the value of one variable given the value of the other
variable.
 Methods for analyzing differences between predicted
values and actual values.
 In addition, we consider methods for identifying linear
equations for correlations among three or more variables.
 We conclude with some basic methods for developing a
mathematical model that can be used to describe
nonlinear correlations between two variables.

SIS 1037Y 4 2020/2021


Topics
 Review and Preview
 Correlation
 Regression
 Prediction Intervals and Variation
 Multiple Regression
 Nonlinear Regression

SIS 1037Y 5 2020/2021


Key Concept
 The linear correlation coefficient, r, which is a number
that measures how well paired sample data fit a
straight-line pattern when graphed.
 Using paired sample data (sometimes called bivariate
data), we find the value of r (usually using
technology), then we use that value to conclude that
there is (or is not) a linear correlation between the
two variables.
 We consider only linear relationships, which means
that when graphed, the points approximate a
straight-line pattern.
 We then discuss methods of hypothesis testing for
correlation.
SIS 1037Y 6 2020/2021
Part 1: Basic Concepts of Correlation
 Definition
 A correlation exists between two variables
when the values of one are somehow
associated with the values of the other in
some way.
 A linear correlation exists between two
variables when there is a correlation and the
plotted points of paired data result in a
pattern that can be approximated by a
straight line.

SIS 1037Y 7 2020/2021


Exploring the Data
 We can often see a relationship between two
variables by constructing a scatterplot.
 The following slides show scatterplots with different
characteristics.

SIS 1037Y 8 2020/2021


Scatterplots of Paired Data

SIS 1037Y 9 2020/2021


Scatterplots of Paired Data

SIS 1037Y 10 2020/2021


Scatter Plot Example

Volume Cost per


per day day Cost per Day vs. Production Volume
23 125
250
26 140
Cost per Day 200
29 146
150
33 160
100
38 167
50
42 170
0
50 188
20 30 40 50 60 70
55 195
Volume per Day
60 200

SIS 1037Y 11 2020/2021


Covariance and Correlation:
Covariance and correlation measure linear association between
two variables, say X and Y.

Covariance:
Population Parameter:  X ,Y   (Y  
i Y )( X i   X )
N
The population parameter describes linear association between
X and Y for the population.

Estimator/Sample Statistic: s X ,Y 
 (Y  Y )( X
i i  X)
n 1
The sample statistic or estimator is used with sample data to
estimate the linear association between X and Y for the population.
Covariance
1. No causal effect is implied
2. Create deviations for Y and deviations for X for each
observation.
3. Form the products of these deviations.
4. The graph that follows illustrates these deviations.
1. In Quadrant 1, the products of deviations are positive.
2. In Quadrant 2, the products of deviations are negative.
3. Covariance – on average, what are the products of
deviations? Are they positive or negative?
5. Covariance is not widely used, because the units are
often confusing.

SIS 1037Y 13 2020/2021


Used car Prices - Sample of 10 Cars

17000 Quadrant II
(Xi  X )  0 Quadrant I

16000
(Xi  X )  0
(Yi  Y )  0
15000
(Yi  Y )  0
(Yi  Y )  0
14000
Price

(Yi  Y )  0

13000 (Xi  X )  0

12000
(Xi  X )  0

11000
Quadrant III Quadrant IV

10000
10000 15000 20000 25000 30000 35000 40000 45000
Mileage
SIS 1037Y 14 2020/2021
Interpreting Covariance
 Covariance between two variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions

cov(X,Y) = 0 X and Y are independent

 The covariance has a major flaw:


 It is not possible to determine the relative strength of the
relationship from the size of the covariance

2020/2021
SIS 1037Y 15
Smoking and Lung Capacity
 Suppose, for example, we wanted to investigate the
relationship between cigarette smoking and lung
capacity
 We might ask a group of people about their smoking
habits, and measure their lung capacities

SIS 1037Y 16 2020/2021


Smoking and Lung Capacity

Cigarettes (X) Lung Capacity (Y)


0 45
5 42
10 33
15 31
20 29

SIS 1037Y 17 2020/2021


Smoking and Lung Capacity
 We can easily enter these data and produce a scatterplot.

50

40

30
Lung Capacity

20
-10 0 10 20 30

SIS 1037Y Smoking 18 2020/2021


Smoking and Lung Capacity
 We can see easily from the graph that as smoking
goes up, lung capacity tends to go down.
 The two variables covary in opposite directions.
 We now examine two statistics, covariance and
correlation, for quantifying how variables covary.

SIS 1037Y 19 2020/2021


Covariance
 When two variables covary in opposite directions, as
smoking and lung capacity do, values tend to be on
opposite sides of the group mean. That is, when
smoking is above its group mean, lung capacity tends
to be below its group mean.
 Consequently, by averaging the product of deviation
scores, we can obtain a measure of how the variables
vary together.

SIS 1037Y 20 2020/2021


Calculating Covariance

Cigarettes (X) Lung Capacity (Y)


dX dXdY dY

0 10 90 +9 45

5 5 30 +6 42

10 0 0 3 33

15 +5 25 5 31

20 +10 70 7 29

Sum = 50 215 Sum = 180


2020/2021 SIS 1037Y
21
Calculating Covariance
 So we obtain

1
S xy  ( 215)  53.75
4

SIS 1037Y 22 2020/2021


Example Covariance

6 x y xi  x yi  y ( xi  x )( yi  y )
5
0 3 -3 0 0
4

3
2 2 -1 -1 1
2
3 4 0 1 0
1
4 0 1 -3 -3
0 6 6 3 3 9
0 1 2 3 4 5 6 7
x3 y3  7

 ( x  x)( y  y))
i i
7
What does this
cov(x, y )  i 1
  1.75 number tell us?
2020/2021
n SIS11037Y 4 23
Problem with Covariance:
 The value obtained by covariance is dependent on the
size of the data’s standard deviations: if large, the value
will be greater than if small… even if the relationship
between x and y is exactly the same in the large versus
small standard deviation datasets.

SIS 1037Y 24 2020/2021


Example of how covariance value
relies on variance
High variance data Low variance data

Subject x y x error * y x y X error * y


error error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50

Sum of x error * y error : 7000 Sum of x error * y error : 28

Covariance: 1166.67 Covariance: 4.67


2020/2021 SIS 1037Y
25
Solution: Pearson’s r
 Covariance does not really tell us anything
 Solution: standardise this measure

 Pearson’s r: standardises the covariance value.


 Divides the covariance by the multiplied standard deviations of X
and Y:

cov(x, y )
rxy 
sx s y
2020/2021 SIS 1037Y 26
Pearson’s R continued

 ( x  x)( y  y)
n

 ( x  x)( y  y)
i i i i
cov(x, y )  i 1
rxy  i 1

n 1 (n  1) s x s y

Z xi * Z yi
rxy  i 1
n 1
SIS 1037Y 27 2020/2021
Coefficient of Correlation: r
 Measures the relative strength of the linear relationship
between two numerical variables
 The correlation coefficient is also known as the product-
moment coefficient of correlation or Pearson's
correlation.

SIS 1037Y 2020/2021


28
Requirements for Linear Correlation
1. The sample of paired (x, y) data is a simple random
sample of quantitative data.
2. Visual examination of the scatterplot must confirm
that the points approximate a straight-line pattern.
3. The outliers must be removed if they are known to
be errors. The effects of any other outliers should
be considered by calculating r with and without the
outliers included.

SIS 1037Y 29 2020/2021


Notation for the
Linear Correlation Coefficient
n number of pairs of sample data
 denotes the addition of the items indicated
x sum of all x-values
x 2
indicates that each x-value should be squared and then those squares added

 x
2
indicates that each x-value should be added and the total then squared

 xy indicates each x-value is multiplied by its corresponding y -value. Then sum those up.
r linear correlation coefficient for sample data
 linear correlation coefficient for a population of paired data

SIS 1037Y 30 2020/2021


Formula
The linear correlation coefficient r measures the strength of a
linear relationship between the paired values in a sample.
Here are the formulas:

r
 ( x  x )( y  y )
[ ( x  x ) ][ ( y  y )
2 2
]

n xy   x  y
r
[n( x 2 )  ( x) 2 ][n( y 2 )  ( y ) 2 ]

( xy  x * y )
r
( x 2  x 2 )( y 2  y 2 )
SIS 1037Y 31 2020/2021
Correlation:
Correlation measures the degree of linear association between
two variables, say X and Y. There are no units – dividing
covariance by the standard deviations eliminates units.
Correlation is a pure number. The range is from -1 to +1. If the
correlation coefficient is -1, it means perfect negative linear
association; +1 means perfect positive linear association.

Cov ( X Y )
 XY 
Population Parameter:  X Y
s X ,Y
Estimator/Sample Statistic: rX ,Y 
s X sY
The sample statistic or estimator is used with sample data to
estimate the linear association between X and Y for the population.
Computing a correlation

Cigarettes Lung
(X) Capacity
X2 XY Y2 (Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
2020/2021 SIS 1037Y
33
Computing a Correlation

(5)(1585)  (50)(180)
rxy 
(5)(750)  502  (5)(6680)  1802 
7925  9000

(3750  2500)(33400  32400)
1075
  .9615
1250  (1000)
SIS 1037Y 34 2020/2021
Conclusion

rxy   0.96
rxy = -0.96 implies almost certainty smoker will
have diminish lung capacity

 Greater smoking exposure implies greater likelihood


of lung damage

SIS 1037Y 35 2020/2021


Tree
Calculation
Trunk
Example
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713

SIS 1037Y 36 2020/2021


Calculation Example (continued)

Tree n xy   x  y
Height, r
y 70 [n( x 2 )  (  x)2 ][n( y 2 )  (  y)2 ]
60

8(3142)  (73)(321)
50 
40
[8(713)  (73)2 ][8(14111)  (321) 2 ]

 0.886
30

20

10

0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x

SIS 1037Y 37 2020/2021


Interpreting r
 Using the following table: If the absolute value of
the computed value of r, exceeds the value in the
table, conclude that there is a linear correlation.
Otherwise, there is not sufficient evidence to
support the conclusion of a linear correlation.
 Using Software: If the computed P-value is less
than or equal to the significance level, conclude
that there is a linear correlation. Otherwise, there
is not sufficient evidence to support the
conclusion of a linear correlation.

SIS 1037Y 38 2020/2021


SIS 1037Y 39 2020/2021
Caution
 Know that the methods here apply to a linear
correlation.
 If we conclude that there does not appear to
be linear correlation, know that it is possible
that there might be some other association
that is not linear.

SIS 1037Y 40 2020/2021


Properties of the Linear Correlation
Coefficient r
1. – 1 ≤ r ≤ 1
2. If all values of either variable are converted to a
different scale, the value of r does not change.
3. The value of r is not affected by the choice of x and
y. Interchange all x- and y-values and the value of r
will not change.
4. r measures strength of a linear relationship.
5. r is very sensitive to outliers, which can dramatically
affect the value of r.

SIS 1037Y 41 2020/2021


Example

The paired shoe / height data from five males are listed
below. Use a computer or a calculator to find the value of
the correlation coefficient r.

SIS 1037Y 42 2020/2021


Example - Continued
Requirement Check: The data are a simple random sample
of quantitative data, the plotted points appear to roughly
approximate a straight-line pattern, and there are no
outliers.

SIS 1037Y 43 2020/2021


Is There a Linear Correlation?

 Using computer/calculator, we find that for


the shoe and height example, r = 0.591, P-
Value = 0.294.
 We now proceed to interpret its meaning.
 Our goal is to decide whether or not there
appears to be a linear correlation between
shoe print lengths and heights of people.
 We can base our interpretation on a P-value or
a critical value from our table.

SIS 1037Y 44 2020/2021


Interpreting the Linear Correlation
Coefficient r
 Using computer software:
 If the P-value is less than the level of significance,
conclude there is a linear correlation.
 Our example with technologies provided a P-
value of 0.294.
 Because that P-value is not less than the
significance level of 0.05, we conclude there is
not sufficient evidence to support the conclusion
that there is a linear correlation between shoe
print length and heights of people.
SIS 1037Y 45 2020/2021
Interpreting the Linear
Correlation Coefficient r
Using Table:
Our table yields r = 0.878 for five pairs of data and a 0.05
level of significance. Since our correlation was r = 0.591, we
conclude there is not sufficient evidence to support the
claim of a linear correlation.

SIS 1037Y 46 2020/2021


Interpreting r: Explained Variation
 The value of r2 is the proportion of the
variation in y that is explained by the linear
relationship between x and y.

SIS 1037Y 47 2020/2021


Example
 We found previously for the shoe and height
example that r = 0.591.
 With r = 0.591, we get r2 = 0.349.
 We conclude that about 34.9% of the variation
in height can be explained by the linear
relationship between lengths of shoe prints
and heights.

SIS 1037Y 48 2020/2021


Common Errors Involving Correlation
 1. Causation: It is wrong to conclude that correlation
implies causality.
 2. Averages: Averages suppress individual variation
and may inflate the correlation coefficient.
 3. Linearity: There may be some relationship
between x and y even when there is no linear
correlation.

SIS 1037Y 49 2020/2021


Part 2: Formal Hypothesis Test
 Formal Hypothesis Test
 We wish to determine whether there is a
significant linear correlation between two
variables.
 Notation:
 n = number of pairs of sample data
 r = linear correlation coefficient for a
sample of paired data
 ρ = linear correlation coefficient for a
SIS 1037Y
population of paired data
50 2020/2021
Hypothesis Test for Correlation
Requirements
1. The sample of paired (x, y) data is a simple random
sample of quantitative data.
2. Visual examination of the scatterplot must confirm
that the points approximate a straight-line pattern.
3. The outliers must be removed if they are known to be
errors. The effects of any other outliers should be
considered by calculating r with and without the
outliers included.

SIS 1037Y 51 2020/2021


Hypothesis Test for Correlation
Hypotheses
H0 :   0 (There is no linear correlation.)
H1 :   0 (There is a linear correlation.)

Test Statistic: r
Critical Values: Refer to Table.

P-values: Refer to technology.

SIS 1037Y 52 2020/2021


Hypothesis Test for Correlation
 If | r | > critical value from our table, reject the null
hypothesis and conclude that there is sufficient
evidence to support the claim of a linear correlation.

 If | r | ≤ critical value from our table, fail to reject the


null hypothesis and conclude that there is not
sufficient evidence to support the claim of a linear
correlation.

SIS 1037Y 53 2020/2021


Example
 We found previously for the shoe and height
example that r = 0.591.
 Conduct a formal hypothesis test of the claim
that there is a linear correlation between the
two variables.
 Use a 0.05 significance level.

SIS 1037Y 54 2020/2021


Example - Continued
We test the claim:
H0 :   0 (There is no linear correlation)
H1 :   0 (There is a linear correlation)

With the test statistic r = 0.591 from the earlier example. The
critical values of r = ± 0.878 are found in the table with n = 5
and α = 0.05.

We fail to reject the null and conclude there is not sufficient


evidence to support the claim of a linear correlation.
SIS 1037Y 55 2020/2021
P-Value Method for a Hypothesis Test
for Linear Correlation
The test statistic is below, use n – 2 degrees
of freedom.
r
t
1 r 2

n2

P-values can be found using software or a


table.
SIS 1037Y 56 2020/2021
Example
Continuing the same example, we calculate the test statistic:

r 0.591
t   1.269
1 r 2
1  0.591 2

n2 52

The t-table shows this test statistic yields a P-value that is


greater than 0.20. The P-value is 0.2937.

SIS 1037Y 57 2020/2021


Example - Continued
 Because the P-value of 0.2937 is greater than the
significance level of 0.05, we fail to reject the null
hypothesis.
 We conclude there is not sufficient evidence to
support the claim of a linear correlation between
shoe print length and heights.

SIS 1037Y 58 2020/2021


One-Tailed Tests
One-tailed tests can occur with a claim of a positive linear
correlation or a claim of a negative linear correlation. In such
cases, the hypotheses will be as shown here.

For these one-tailed tests, the P-value method can be


used as seen earlier.

SIS 1037Y 59 2020/2021


Topics
 Review and Preview
 Correlation
 Regression
 Prediction Intervals and Variation
 Multiple Regression
 Nonlinear Regression

SIS 1037Y 60 2020/2021


Key Concept
In Part 1 of this section we find the equation of the straight line
that best fits the paired sample data. That equation
algebraically describes the relationship between two variables.
The best-fitting straight line is called a regression line and its
equation is called the regression equation.
In Part 2, we discuss marginal change, influential points, and
residual plots as tools for analyzing correlation and regression
results.

SIS 1037Y 61 2020/2021


Part 1: Basic Concepts of Regression
Regression

The regression equation expresses a relationship between


x (called the explanatory variable, predictor variable or
independent variable), and ŷ (called the response variable
or dependent variable).

The typical equation of a straight line y = mx + b is


expressed in the form ŷ = b0 + b1x, where b0 is the y-
intercept and b1 is the slope.

SIS 1037Y 62 2020/2021


Definitions

Regression Equation:

Given a collection of paired sample data, the regression


line (or line of best fit, or least-squares line) is the straight
line that “best” fits the scatterplot of data.

The regression equation ŷ = b0 + b1x algebraically describes


the regression line.

SIS 1037Y 63 2020/2021


Notation for
Regression Equation

Population Sample
Parameter Statistic
y-Intercept of
β0 b0
regression equation
Slope of regression
β1 b1
equation
Equation of the
y = β0 + β1x ŷ = b0 + b1x
regression line

SIS 1037Y 64 2020/2021


Requirements
1. The sample of paired (x, y) data is a random sample of
quantitative data.

2. Visual examination of the scatterplot shows that the


points approximate a straight-line pattern.

3. Any outliers must be removed if they are known to be


errors. Consider the effects of any outliers that are not
known errors.

SIS 1037Y 65 2020/2021


Formulas for b1 and b0

sy
Slope: b1  r
sx
y-intercept: b0  y  b1 x

Use technology to compute these values.

SIS 1037Y 66 2020/2021


Formulas for b1 and b0
 n
  n 
 n  xi 
b    yi 
n i 1
n  0
b    n
i 1

 x 2  1   xy
  
x
i 1
i
i 1
i

 
 i 1
i i


Solving the system of equations yields

n xi yi   xi  yi  y  a1  xi
b1  b0  i
 y  b1 x
n xi2   xi 
2
n

SIS 1037Y 67 2020/2021


Example
Let us return to the example seen earlier. We would like to
use the explanatory variable, x, shoe print length, to
predict the response variable, y, height.

The data are listed below:

SIS 1037Y 68 2020/2021


Example - Continued
Requirement Check:

1. The data are assumed to be a simple random sample.

2. The scatterplot showed a roughly straight-line pattern.

3. There are no outliers.

Technology can be used for finding the equation of a


regression line.

SIS 1037Y 69 2020/2021


Example – using technology

SIS 1037Y 70 2020/2021


Example - Continued
All these technologies, and manual calculation, show that
the regression equation can be expressed as:

yˆ  125  1.73x
Now we use the formulas to determine the regression
equation.

SIS 1037Y 71 2020/2021


Example
Recall from the previous section that r = 0.591269.

Technology can be used to find the values of the sample means and
sample standard deviations used below.

sy 4.87391
b1  r  0.591269  1.72745
sx 1.66823

b0  y  b1 x  177.3  1.72745  30.04   125.40740


(These are the same coefficients found using technology)

SIS 1037Y 72 2020/2021


Example
Graph the regression equation on a scatterplot:
yˆ  125  1.73x

SIS 1037Y 73 2020/2021


Using the Regression
Equation for Predictions
1. Use the regression equation for predictions only if the
graph of the regression line on the scatterplot confirms
that the regression line fits the points reasonably well.
2. Use the regression equation for predictions only if the
linear correlation coefficient r indicates that there is a linear
correlation between the two variables (as described
earlier).

SIS 1037Y 74 2020/2021


Using the Regression
Equation for Predictions
3. Use the regression line for predictions only if the data do
not go much beyond the scope of the available sample
data. (Predicting too far beyond the scope of the available
sample data is called extrapolation, and it could result in
bad predictions.)
4. If the regression equation does not appear to be useful for
making predictions, the best predicted value of a variable is
its sample mean.

SIS 1037Y 75 2020/2021


Strategy for Predicting Values of y

SIS 1037Y 76 2020/2021


Using the Regression
Equation for Predictions
If the regression equation is not a good model, the best
y the y values.
predicted value of y is simply , the mean of
Remember, this strategy applies to linear patterns of points
in a scatterplot.
If the scatterplot shows a pattern that is not a straight-line
pattern, other methods apply.
See following sections.

SIS 1037Y 77 2020/2021


Example
Use the 5 pairs of shoe print lengths and heights to predict
the height of a person with a shoe print length of 29 cm.

The regression line does not fit the points well. The
correlation is r = 0.591, which suggests there is not a linear
correlation (the P-value was 0.294).

The best predicted height is simply the mean of the sample


heights:

y  177.3 cm
SIS 1037Y 78 2020/2021
Example
Use the 40 pairs of shoe print lengths from the given Data
Set to predict the height of a person with a shoe print
length of 29 cm.

Now, the regression line does fit the points well, and the
correlation of r = 0.813 suggests that there is a linear
correlation (the P-value is 0.000).

SIS 1037Y 79 2020/2021


Example - Continued
Using technology we obtain the regression equation and
scatterplot:

SIS 1037Y 80 2020/2021


Example - Continued
The given shoe length of 29 cm is not beyond the scope of
the available data, so substitute in 29 cm into the
regression model:

yˆ  80.9  3.22 x
 80.9  3.22  29 
 174.3 cm

A person with a shoe length of 29 cm is predicted to be


174.3 cm tall.
SIS 1037Y 81 2020/2021
Part 2: Beyond the Basics of
Regression
Definition:
In working with two variables related by a regression
equation, the marginal change in a variable is the amount
that it changes when the other variable changes by exactly
one unit.
The slope b1 in the regression equation represents the
marginal change in y that occurs when x changes by one
unit.

SIS 1037Y 82 2020/2021


Example

For the 40 pairs of shoe print lengths and heights, the


regression equation was:

yˆ  80.9  3.22 x
The slope of 3.22 tells us that if we increase shoe print
length by 1 cm, the predicted height of a person increases by
3.22 cm.

SIS 1037Y 83 2020/2021


Definition

In a scatterplot, an outlier is a point lying far away from the


other data points.

Paired sample data may include one or more influential


points, which are points that strongly affect the graph of the
regression line.

SIS 1037Y 84 2020/2021


Example
For the 40 pairs of shoe prints and heights, observe what
happens if we include this additional data point:
x = 35 cm and y = 25 cm

SIS 1037Y 85 2020/2021


Example - Continued
The additional point is an influential point because the graph
of the regression line because the graph of the regression
line did change considerably.

The additional point is also an outlier because it is far from


the other points.

SIS 1037Y 86 2020/2021


Definition
For a pair of sample x and y values, the residual is the
difference between the observed sample value of y and the
y-value that is predicted by using the regression equation.

That is:

residual  observed y  predicted y  y  yˆ

SIS 1037Y 87 2020/2021


Residuals

SIS 1037Y 88 2020/2021


Definition
A straight line satisfies the least-squares property if the sum
of the squares of the residuals is the smallest sum possible.
A residual plot is a scatterplot of the (x, y) values
after each of the y-coordinate values has been
replaced by the residual value y – ŷ (where ŷ denotes
the predicted value of y).
That is, a residual plot is a graph of the points (x, y –
ŷ).

SIS 1037Y 89 2020/2021


Residual Plot Analysis
When analyzing a residual plot, look for a pattern in the way
the points are configured, and use these criteria:

The residual plot should not have any obvious patterns (not
even a straight line pattern). This confirms that the
scatterplot of the sample data is a straight-line pattern.

The residual plot should not become thicker (or thinner)


when viewed from left to right. This confirms the
requirement that for different fixed values of x, the
distributions of the corresponding y values all have the
same standard deviation.

SIS 1037Y 90 2020/2021


Example
The shoe print and height data are used to generate the
following residual plot:

SIS 1037Y 91 2020/2021


Example - Continued
The residual plot becomes thicker, which suggests that the
requirement of equal standard deviations is violated.

SIS 1037Y 92 2020/2021


Example - Continued
On the following slides are three residual plots.

Observe what is good or bad about the individual regression


models.

SIS 1037Y 93 2020/2021


Example - Continued
Regression model is a good model:

SIS 1037Y 94 2020/2021


Example - Continued
Distinct pattern: sample data may not follow a straight-line
pattern.

SIS 1037Y 95 2020/2021


Example - Continued
Residual plot becoming thicker: equal standard deviations
violated.

SIS 1037Y 96 2020/2021


Complete Regression Analysis

1. Construct a scatterplot and verify that the pattern of the


points is approximately a straight-line pattern without
outliers. (If there are outliers, consider their effects by
comparing results that include the outliers to results that
exclude the outliers.)
2. Construct a residual plot and verify that there is no
pattern (other than a straight-line pattern) and also
verify that the residual plot does not become thicker (or
thinner).

SIS 1037Y 97 2020/2021


Complete Regression Analysis

3. Use a histogram and/or normal quantile plot to confirm


that the values of the residuals have a distribution that is
approximately normal.

4. Consider any effects of a pattern over time.

SIS 1037Y 98 2020/2021


Correlation and Regression
Review and Preview
Correlation
Regression
Prediction Intervals and Variation
Multiple Regression
Nonlinear Regression

SIS 1037Y 99 2020/2021


Key Concept
In this part we look at a method for constructing a prediction
interval, which is an interval estimate of a predicted value of y.
(Interval estimates of parameters are confidence intervals,
but interval estimates of variables are called prediction
intervals.)

SIS 1037Y 100 2020/2021


Requirements
For each fixed value of x, the corresponding sample values of
y are normally distributed about the regression line, with the
same variance.

SIS 1037Y 101 2020/2021


Formulas
For a fixed and known x0, the prediction interval for an
individual y value is:

yˆ  E  y  yˆ  E

with margin of error:

n  x0  x 
2
1
E  t /2 se 1  
n n   x 2     x 2

SIS 1037Y 102 2020/2021


Formulas
The standard error estimate is:

 y  yˆ 
2

se 
n2

(It is suggested to use technology to get prediction intervals.)

SIS 1037Y 103 2020/2021


Example
If we use the 40 pairs of shoe lengths and heights, construct a
95% prediction interval for the height, given that the shoe print
length is 29.0 cm.

Recall :
yˆ  80.9  3.22 x
x0  29.0
se  5.943761
yˆ  174.3
t /2  2.024 (Table A-3, df = 38, 0.05 area in two tails)

SIS 1037Y 104 2020/2021


Example - Continued
The 95% prediction interval is 162 cm < y < 186 cm.

This is a large range of values, so the single shoe print doesn’t give us
very good information about a someone’s height.

SIS 1037Y 105 2020/2021


Explained and Unexplained Variation
Assume the following:

•There is sufficient evidence of a


linear correlation.
•The equation of the line is
ŷ = 3 + 2x
•The mean of the y-values is 9.
•One of the pairs of sample data is x
= 5 and y = 19.
•The point (5,13) is on the
regression line.

SIS 1037Y 106 2020/2021


Explained and Unexplained Variation

The figure shows (5,13) lies on the regression line, but (5,19) does not.

We arrive at:

Total Deviation (from y  9) of the point (5, 19)  y  y  19  9  10.


Explained Deviation (from y  9) of the point (5, 19)  yˆ  y  13  9  4.
Unexplained Deviation (from y  9) of the point (5, 19)  y  yˆ  19  13  6.

SIS 1037Y 107 2020/2021


Relationships
(total deviation) = (explained deviation) + (unexplained deviation)

( y  y) = ( yˆ  y ) + ( y  yˆ )

(total variation) = (explained variation) + (unexplained variation)

( y  y ) =
2
( y  y ) +
ˆ 2
( y  yˆ ) 2

SIS 1037Y 108 2020/2021


Definition
The coefficient of determination is the amount of the
variation in y that is explained by the regression line.

explained variation
r 
2

total variation

The value of r2 is the proportion of the variation in y that is


explained by the linear relationship between x and y.

SIS 1037Y 109 2020/2021


Example
If we use the 40 paired shoe lengths and heights from the
data, we find the linear correlation coefficient to be r =
0.813.

Then, the coefficient of determination is

r2 = 0.8132 = 0.661

We conclude that 66.1% of the total variation in height can


be explained by shoe print length, and the other 33.9%
cannot be explained by shoe print length.

SIS 1037Y 110 2020/2021


Correlation and Regression
Review and Preview
Correlation
Regression
Prediction Intervals and Variation
Multiple Regression
Nonlinear Regression

SIS 1037Y 111 2020/2021


Key Concept

This part presents a method for analyzing a linear relationship


involving more than two variables.
We focus on these key elements:
1. Finding the multiple regression equation.
2. The values of the adjusted R2, and the P-value as
measures of how well the multiple regression equation fits the
sample data.

SIS 1037Y 112 2020/2021


Part 1: Basic Concepts of a
Multiple Regression Equation
Definition
A multiple regression equation expresses a linear relationship
between a response variable y and two or more predictor
variables

( x1 , x2 , x3 ..., xk )
The general form of the multiple regression equation
obtained from sample data is

yˆ  b0  b1 x1  b2 x2  ...  bk xk

SIS 1037Y 113 2020/2021


Notation
yˆ  b0  b1 x1  b2 x2  ...  bk xk
(General form of the multiple regression equation)

n = sample size
k = number of predictor variables
ŷ = predicted value of y
are the predictor variables
( x1 , x2 , x3 ..., xk )

SIS 1037Y 114 2020/2021


Example
The given table includes a random sample of
heights of mothers, fathers, and their
daughters (based on data from the National
Health and Nutrition Examination).

Find the multiple regression equation in which


the response (y) variable is the height of a
daughter and the predictor (x) variables are
the height of the mother and height of the
father.

SIS 1037Y 115 2020/2021


Example
The results from a statistical package/programming are shown here:

SIS 1037Y 116 2020/2021


Definition
 The multiple coefficient of determination, R2, is a measure
of how well the multiple regression equation fits the
sample data. A perfect fit would result in R2 = 1.

 The adjusted coefficient of determination, R2, is the


multiple coefficient of determination modified to account
for the number of variables and the sample size.

SIS 1037Y 117 2020/2021


Adjusted Coefficient of
Determination
(n  1)
Adjusted R  1
2
(1  R )
2

[n  (k  1)]

where n = sample size


k = number of predictor (x) variables

SIS 1037Y 118 2020/2021


Example

The preceding technology display shows the


adjusted coefficient of determination as R-
Sq(adj) = 63.7%.

When we compare this multiple regression


equation to others, it is better to use the
adjusted R2 of 63.7%

SIS 1037Y 119 2020/2021


P-Value
The P-value is a measure of the overall significance of the
multiple regression equation.
The displayed technology P-value of 0.000 is small, indicating
that the multiple regression equation has good overall
significance and is usable for predictions.
That is, it makes sense to predict the heights of daughters based
on heights of mothers and fathers.

The value of 0.000 results from a test of the null hypothesis that
β1 = β2 = 0, and rejection of this hypothesis indicates the equation
is effective in predicting the heights of daughters.
SIS 1037Y 120 2020/2021
Finding the Best Multiple
Regression Equation

1. Use common sense and practical considerations to include


or exclude variables.

2. Consider the P-value. Select an equation having overall


significance, as determined by the P-value found in the
computer display.

SIS 1037Y 121 2020/2021


Finding the Best Multiple
Regression Equation
3. Consider equations with high values of adjusted R2 and
try to include only a few variables.
• If an additional predictor variable is included, the value of adjusted R2
does not increase by a substantial amount.
• For a given number of predictor (x) variables, select the equation
with the largest value of adjusted R2.
• In weeding out predictor (x) variables that don’t have much of an
effect on the response (y) variable, it might be helpful to find the
linear correlation coefficient r for each of the paired variables being
considered.

SIS 1037Y 122 2020/2021


Example
One of the Data Set includes the age, foot length, shoe print length, shoe
size, and height for each of 40 different subjects.

Using those sample data, find the regression equation that is the best for
predicting height.

The table on the next slide includes key results from the combinations of
the five predictor variables.

SIS 1037Y 123 2020/2021


Example - Continued

SIS 1037Y 124 2020/2021


Example - Continued
Using critical thinking and statistical analysis:
1. Delete the variable age.
2. Delete the variable shoe size, because it is really a rounded form of
foot length.
3. For the remaining variables of foot length and shoe print length,
select foot length because its adjusted R2 of 0.7014 is greater than
0.6520 for shoe print length.
4. Although it appears that only foot length is best, we note that
criminals usually wear shoes, so shoe print lengths are likely to be
found than foot lengths.

SIS 1037Y 125 2020/2021


Part 2: Dummy Variables and
Logistic Equations

SIS 1037Y 126 2020/2021


Part 2: Dummy Variables and
Logistic Equations
Many applications involve a dichotomous variable which has
only two possible discrete values (such as male/female,
dead/alive, etc.).
A common procedure is to represent the two possible
discrete values by 0 and 1, where 0 represents “failure” and 1
represents “success”.
A dichotomous variable with the two values 0 and 1 is called a
dummy variable.

SIS 1037Y 127 2020/2021


Example
The data in the table also includes the dummy variable of sex
(coded as 0 = female and 1 = male).

Given that a mother is 63 inches tall and a father is 69 inches


tall, find the regression equation and use it to predict the
height of a daughter and a son.

SIS 1037Y 128 2020/2021


Example

SIS 1037Y 129 2020/2021


Example - Continued
Using technology, we get the regression equation:

We substitute in 0 for the height, 63 for the mother, and 69


for the father, and predict the daughter will be 62.8 inches
tall.
We substitute in 1 for the height, 63 for the mother, and 69
for the father, and predict the son will be 67 inches tall.

SIS 1037Y 130 2020/2021


Logistic Regression

We can use the methods of this section if the


dummy variable is the predictor variable.
If the dummy variable is the response variable we
need to use a method known as logistic regression.
As the name implies logistic regression involves the
use of natural logarithms.

SIS 1037Y 131 2020/2021


Topics
 Review and Preview
 Correlation
 Regression
 Prediction Intervals and Variation
 Multiple Regression
 Nonlinear Regression

SIS 1037Y 132 2020/2021


Key Concept

Whereas all preceding sections dealt with linear


relationships, this section is a brief introduction to methods
for finding some nonlinear functions that fit sample data.

We focus only on technology for finding these mathematical


models.

SIS 1037Y 133 2020/2021


Three Basic Rules for Identifying a
Good Mathematical Model
Look for a pattern in the graph: Examine the graph of
the plotted points and compare the basic pattern to the
examples show.

Find and compare values of R2: Select functions that


result in larger values of R2, because such larger values
correspond to functions that better fit the observed
points.

Think: Use common sense. Don’t use a model that leads


to predicted values known to be totally unrealistic.
SIS 1037Y 134 2020/2021
Generic Models
Linear: y  a  bx

Quadratic: y  ax 2  bx  c
Logarithmic: y  a  b ln x
Exponential: y  ab x
Power: y  ax b

SIS 1037Y 135 2020/2021


SIS 1037Y 136 2020/2021
SIS 1037Y 137 2020/2021
SIS 1037Y 138 2020/2021
SIS 1037Y 139 2020/2021
SIS 1037Y 140 2020/2021
Example
The table lists the population of the U.S. for different
years.

Find a mathematical model for the population size, then


predict the size of the U.S. population in the year 2020.

SIS 1037Y 141 2020/2021


Example - Continued
Using technology for the following analysis:

SIS 1037Y 142 2020/2021


Example - Continued
Based on its R2 value of 0.9992, the quadratic model
appears to be best.

We substitute x = 12 (for the year 2020) into the equation


to arrive at a population estimate at 337 million in the year
2020.

y  2.77 12   6.00 12   10.01  337


2

SIS 1037Y 143 2020/2021


Summary

SIS 1037Y 144 2020/2021


Questions?
Comments?

You might also like