0% found this document useful (0 votes)
20 views53 pages

Correlation and Regression - October 25 - 2022

Uploaded by

Rezwana Sultana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views53 pages

Correlation and Regression - October 25 - 2022

Uploaded by

Rezwana Sultana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

CHAPTER

Correlation and Regression


Outline

by
Md Hasinur Rahaman Khan, PhD
Professor of Applied Statistics
Institute of Statistical Research and Training (ISRT)
University of Dhaka, Dhaka 1000
E-mail: [email protected]
CHAPTER

Correlation and Regression

Outline

10-1 Scatter Plots and Correlation


10-2 Regression
10-3 Coefficient of Determination and Standard
Error of the Estimate
10-4 Multiple Regression
Introduction
n Correlation is a statistical method used
to determine whether a linear relationship
between variables exists.

n Regression is a statistical method used


to describe the nature of the relationship
between variables—that is, positive or
negative, linear or nonlinear.

Bluman Chapter 10 3
Introduction
n The purpose of this chapter is to answer
these questions statistically:
1. Are two or more variables related?
2. If so, what is the strength of the
relationship?
3. What type of relationship exists?
4. What kind of predictions can be
made from the relationship?

Bluman Chapter 10 4
Introduction
1. Are two or more variables related?
2. If so, what is the strength of the
relationship?
To answer these two questions, statisticians use
the correlation coefficient, a numerical
measure to determine whether two or more
variables are related and to determine the
strength of the relationship between or among
the variables.

Bluman Chapter 10 5
Introduction
3. What type of relationship exists?
There are two types of relationships: simple and
multiple.
In a simple relationship, there are two variables:
an independent variable (predictor variable)
and a dependent variable (response variable).
In a multiple relationship, there are two or more
independent variables that are used to predict
one dependent variable.
Bluman Chapter 10 6
Introduction
4. What kind of predictions can be made
from the relationship?
Predictions are made daily in all areas. Examples
include weather forecasting, stock market
analyses, sales predictions, crop predictions,
gasoline price predictions, and sports predictions.
Some predictions are more accurate than others,
due to the strength of the relationship. That is, the
stronger the relationship is between variables, the
more accurate the prediction is.
Bluman Chapter 10 7
10.1 Scatter Plots and Correlation
n A scatter plot is a graph of the ordered
pairs (x, y) of numbers consisting of the
independent variable x and the
dependent variable y.

Bluman Chapter 10 8
Example 10-1: Car Rental Companies
Construct a scatter plot for the data shown for car rental
companies in the United States for a recent year.

Step 1: Draw and label the x and y axes.


Step 2: Plot each point on the graph.

Bluman Chapter 10 9
Example 10-1: Car Rental Companies

Positive Relationship

Bluman Chapter 10 10
Example 10-2: Absences/Final Grades
Construct a scatter plot for the data obtained in a study on
the number of absences and the final grades of seven
randomly selected students from a statistics class.

Step 1: Draw and label the x and y axes.


Step 2: Plot each point on the graph.
Bluman Chapter 10 11
Example 10-2: Absences/Final Grades

Negative Relationship

Bluman Chapter 10 12
Example 10-3: Age and Wealth
A researcher wishes to see if there is a relationship
between the ages and net worth of the wealthiest people
in America. The data for a specific year are shown.

Step 1: Draw and label the x and y axes.


Step 2: Plot each point on the graph.
Bluman Chapter 10 13
Example 10-3: Age and Wealth

Very Weak Relationship

Bluman Chapter 10 14
Correlation
n The correlation coefficient computed from
the sample data measures the strength and
direction of a linear relationship between two
variables.
n There are several types of correlation
coefficients. The one explained in this section
is called the Pearson product moment
correlation coefficient (PPMC).
n The symbol for the sample correlation
coefficient is r. The symbol for the population
correlation coefficient is r.
Bluman Chapter 10 15
Correlation
n The range of the correlation coefficient is from
-1 to +1.
n If there is a strong positive linear
relationship between the variables, the value
of r will be close to +1.
n If there is a strong negative linear
relationship between the variables, the value
of r will be close to -1.

Bluman Chapter 10 16
Correlation

Bluman Chapter 10 17
Correlation Coefficient
The formula for the correlation coefficient is

n ( å xy ) - ( å x )( å y )
r=
é n ( x 2 ) - ( x )2 ù é n ( y 2 ) - ( y )2 ù
êë å å úû êë å å úû
where n is the number of data pairs.

Rounding Rule: Round to three decimal places.

Bluman Chapter 10 18
Example 10-4: Car Rental Companies
Compute the correlation coefficient for the data in
Example 10–1.
Cars x Income y
Company (in 10,000s) (in billions) xy x2 y2
A 63.0 7.0 441.00 3969.00 49.00
B 29.0 3.9 113.10 841.00 15.21
C 20.8 2.1 43.68 432.64 4.41
D 19.1 2.8 53.48 364.81 7.84
E 13.4 1.4 18.76 179.56 1.96
F 8.5 1.5 12.75 72.25 2.25
Σx = Σy = Σxy = Σx2 = Σy2 =
153.8 18.7 682.77 5859.26 80.67

Bluman Chapter 10 19
Example 10-4: Car Rental Companies
Compute the correlation coefficient for the data in
Example 10–1.
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26,
Σy2 = 80.67, n = 6
n ( å xy ) - ( å x )( å y )
r=
é n ( x 2 ) - ( x )2 ù é n ( y 2 ) - ( y )2 ù
êë å å úû êë å å úû
r=
( 6 )( 682.77 ) - (153.8)(18.7 )
é( 6 )( 5859.26 ) - (153.8)2 ù é( 6 )( 80.67 ) - (18.7 )2 ù
ë ûë û
r = 0.982 (strong positive relationship)

Bluman Chapter 10 20
10.2 Regression
n If the value of the correlation coefficient is
significant, the next step is to determine
the equation of the regression line
which is the data’s line of best fit.

Bluman Chapter 10 21
Regression
n Best fit means that the sum of the
squares of the vertical distance from
each point to the line is at a minimum.

Bluman Chapter 10 22
Regression Line y¢ = a + bx
( å y ) ( å x ) - ( å x )( å xy )
2

a=
n (å x ) - (å x)
2 2

n ( å xy ) - ( å x )( å y )
b=
n (å x ) - (å x)
2 2

where
a = y¢ intercept
b = the slope of the line.

Bluman Chapter 10 23
Example 10-9: Car Rental Companies
Find the equation of the regression line for the data in
Example 10–4, and graph the line on the scatter plot.
Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26,
Σy2 = 80.67, n = 6
( å y ) ( å x ) - ( å x )( å xy )
2

a=
n (å x ) - (å x)
2 2

=
( 18.7 )( 5859.26 ) - (153.8 )( 682.77 )
= 0.396
6 ( 5859.26 ) - (153.8 )
2

n ( å xy ) - ( å x )( å y ) 6 ( 682.77 ) - (153.8 )(18.7 )


b= = = 0.106
n (å x ) - (å x) 6 ( 5859.26 ) - (153.8 )
2 2 2

y¢ = a + bx ® y¢ = 0.396 + 0.106 x

Bluman Chapter 10 24
Example 10-9: Car Rental Companies
Find two points to sketch the graph of the regression line.

Use any x values between 10 and 60. For example, let x


equal 15 and 40. Substitute in the equation and find the
corresponding y value.
y¢ = 0.396 + 0.106 x y¢ = 0.396 + 0.106 x
= 0.396 + 0.106 (15 ) = 0.396 + 0.106 ( 40 )
= 1.986 = 4.636
Plot (15,1.986) and (40,4.636), and sketch the resulting
line.

Bluman Chapter 10 25
Example 10-9: Car Rental Companies
Find the equation of the regression line for the data in
Example 10–4, and graph the line on the scatter plot.

y¢ = 0.396 + 0.106 x

( 40, 4.636 )

(15, 1.986 )

Bluman Chapter 10 26
Example 10-11: Car Rental Companies
Use the equation of the regression line to predict the
income of a car rental agency that has 200,000
automobiles.

x = 20 corresponds to 200,000 automobiles.


y¢ = 0.396 + 0.106 x
= 0.396 + 0.106 ( 20 )
= 2.516
Hence, when a rental agency has 200,000 automobiles, its
revenue will be approximately $2.516 billion.

Bluman Chapter 10 27
Regression
n The magnitude of the change in one variable
when the other variable changes exactly 1 unit
is called a marginal change. The value of
slope b of the regression line equation
represents the marginal change.
n For valid predictions, the value of the
correlation coefficient must be significant.
n When r is not significantly different from 0, the
best predictor of y is the mean of the data
values of y.

Bluman Chapter 10 28
Assumptions for Valid Predictions
1. For any specific value of the independent
variable x, the value of the dependent variable
y must be normally distributed about the
regression line.

Bluman Chapter 10 29
Assumptions for Valid Predictions
2. The standard deviation of each of the
dependent variables must be the same for
each value of the independent variable.

Bluman Chapter 10 30
Extrapolations (Future Predictions)
n Extrapolation, or making predictions beyond
the bounds of the data, must be interpreted
cautiously.
n Remember that when predictions are made,
they are based on present conditions or on the
premise that present trends will continue. This
assumption may or may not prove true in the
future.

Bluman Chapter 10 31
Procedure Table
Finding the Correlation Coefficient and the Regression Line
Equation
Step 1 Make a table, as shown in step 2.

Step 2 Find the values of xy, x2, and y2. Place them in the
appropriate columns and sum each column.
Procedure Table
Finding the Correlation Coefficient and the Regression Line
Equation
Step 3 Substitute in the formula to find the value of r.

Step 4 When r is significant, substitute in the formulas to


find the values of a and b for the regression line
equation y' = a + bx.
10.3 Coefficient of Determination
and Standard Error of the Estimate
The total variation å ( y - y ) is the
2
n
sum of the squares of the vertical
distances each point is from the mean.
n The total variation can be divided into two
parts: that which is attributed to the
relationship of x and y, and that which is
due to chance.

Bluman Chapter 10 34
Variation
n The variation obtained from the
relationship (i.e., from the predicted y'
values) is å ( y¢ - y ) and is called the
2

explained variation.
n Variation due to chance, found by
å ( y¢ - y ) , is called the unexplained
2

variation. This variation cannot be


attributed to the relationships.

Bluman Chapter 10 35
Variation

Bluman Chapter 10 36
Coefficient of Determiation
n The coefficient of determination is the
ratio of the explained variation to the total
variation.
n The symbol for the coefficient of
determination is r 2.
n 2 explained variation
r =
total variation
n Another way to arrive at the value for r 2
is to square the correlation coefficient.
Bluman Chapter 10 37
Coefficient of Nondetermiation
n The coefficient of nondetermination is
a measure of the unexplained variation.
n The formula for the coefficient of
nondetermination is 1.00 – r 2.

Bluman Chapter 10 38
Standard Error of the Estimate
n The standard error of the estimate,
denoted by sest is the standard deviation
of the observed y values about the
predicted y' values. The formula for the
standard error of estimate is:

å ( y - y¢ )
2

sest =
n-2

Bluman Chapter 10 39
Chapter 10
Correlation and Regression

Section 10-3
Example 10-12
Page #570

Bluman Chapter 10 40
Example 10-12: Copy Machine Costs
A researcher collects the following data and determines
that there is a significant relationship between the age of a
copy machine and its monthly maintenance cost. The
regression equation is y ¢ = 55.57 + 8.13x. Find the
standard error of the estimate.

Bluman Chapter 10 41
Example 10-12: Copy Machine Costs
Age x Monthly
Machine (years) cost, y y¢ y–y¢ (y – y ¢)2
A 1 62 63.70 –1.70 2.89
B 2 78 71.83 6.17 38.0689
C 3 70 79.96 –9.96 99.2016
D 4 90 88.09 1.91 3.6481
E 4 93 88.09 4.91 24.1081
F 6 103 104.35 –1.35 1.8225
169.7392
y¢ = 55.57 + 8.13 x
y¢ = 55.57 + 8.13 (1) = 63.70 å ( y - y¢ )
2

sest =
y¢ = 55.57 + 8.13 ( 2 ) = 71.83 n-2
y¢ = 55.57 + 8.13 ( 3) = 79.96 169.7392
sest = = 6.51
y¢ = 55.57 + 8.13 ( 4 ) = 88.09 4
y¢ = 55.57 + 8.13 ( 6 ) = 104.35

Bluman Chapter 10 42
10.4 Multiple Regression (Optional)
In multiple regression, there are several
independent variables and one dependent
variable, and the equation is

y¢ = a + b1x1 + b2 x2 + ! + bk xk

where
x1 , x2 ,!, xk = independent variables.

Bluman Chapter 10 43
Assumptions for Multiple Regression
1. normality assumption—for any specific value of the
independent variable, the values of the y variable are
normally distributed.
2. equal-variance assumption—the variances (or
standard deviations) for the y variables are the same
for each value of the independent variable.
3. linearity assumption—there is a linear relationship
between the dependent variable and the independent
variables.
4. nonmulticollinearity assumption—the independent
variables are not correlated.
5. independence assumption—the values for the y
variables are independent.
Bluman Chapter 10 44
Multiple Correlation Coefficient
n In multiple regression, as in simple
regression, the strength of the
relationship between the independent
variables and the dependent variable is
measured by a correlation coefficient.
n This multiple correlation coefficient is
symbolized by R.

Bluman Chapter 10 45
Multiple Correlation Coefficient
The formula for R is

r +r
2 2
- 2ryx1 × ryx2 × rx1x2
R= yx1 yx2

1- r 2
x1 x2

where
ryx1 = correlation coefficient for y and x1
ryx2 = correlation coefficient for y and x2
rx1x2 = correlation coefficient for x1 and x2

Bluman Chapter 10 46
Example 10-15: State Board Scores
A nursing instructor wishes to see whether a student’s
grade point average and age are related to the student’s
score on the state board nursing examination. She
selects five students and obtains the following data.
Find the value of R.

Bluman Chapter 10 47
Example 10-15: State Board Scores
A nursing instructor wishes to see whether a student’s
grade point average and age are related to the student’s
score on the state board nursing examination. She
selects five students and obtains the following data.
Find the value of R.

The values of the correlation coefficients are


ryx1 = 0.845
ryx2 = 0.791
rx1x2 = 0.371

Bluman Chapter 10 48
Example 10-15: State Board Scores
ryx2 1 + ryx2 2 - 2ryx1 × ryx2 × rx1x2
R=
1 - rx21x2

( 0.845 ) + ( 0.791) - 2 ( 0.845 )( 0.791)( 0.371)


2 2

R=
1 - ( 0.371)
2

R = 0.989
Hence, the correlation between a student’s grade point
average and age with the student’s score on the nursing
state board examination is 0.989. In this case, there is a
strong relationship among the variables; the value of R is
close to 1.00.
Bluman Chapter 10 49
F Test for Significance of R
The formula for the F test is
2
R k
F=
(1 - R ) ( n - k - 1)
2

where
n = the number of data groups
k = the number of independent variables.
d.f.N. = n – k
d.f.D. = n – k – 1
Bluman Chapter 10 50
Example 10-16: State Board Scores
Test the significance of the R obtained in Example 10–15
at α = 0.05.
R2 k
F=
(1 - R 2
) ( n - k - 1)
0.978 2
F= = 44.45
(1 - 0.978) ( 5 - 2 - 1)
The critical value obtained from Table H with a 0.05,
d.f.N. = 3, and d.f.D. = 2 is 19.16. Hence, the decision is
to reject the null hypothesis and conclude that there is a
significant relationship among the student’s GPA, age,
and score on the nursing state board examination.

Bluman Chapter 10 51
Adjusted R2
The formula for the adjusted R2 is

R 2
= 1-
(1 - R ) ( n - 1) 2

adj
n - k -1

Bluman Chapter 10 52
Example 10-17: State Board Scores
Calculate the adjusted R2 for the data in Example 10–16.
The value for R is 0.989.

R = 1-
2 (1 - R 2
) ( n - 1)
adj
n - k -1

R 2
= 1-
(1 - 0.989 ) ( 5 - 1)2

= 0.956
adj
5 - 2 -1
In this case, when the number of data pairs and the
number of independent variables are accounted for, the
adjusted multiple coefficient of determination is 0.956.

Bluman Chapter 10 53

You might also like