Topic 7.1_Correlation and Simple Linear Regression
Topic 7.1_Correlation and Simple Linear Regression
Topic 7
Correlation,
Simple Linear Regression, and
Model Building
The basic idea of correlation analysis is to report the association between two
variables. The usual first step is to plot the data in a scatter diagram.
Example
Copier Sales of America sells copier to businesses of all sizes throughout the United
States and Canada. Ms Marcy Bancer was recently promoted to the position of
national sale manager. At the upcoming sales meeting, the sales representatives
from all over the country will be in attendance. She would like to impress upon
them the importance of making that extra sales call each day. She decides to gather
some information on the relationship between the number of sales calls and the
number of copier sold.
She selected a random sample of 10 sales representatives and determined the
number of sales calls they made last month and the number of copiers they sold.
The sale information is reported in the following table.
What observations can you make about the relationship between the number of sales
calls and the number of copiers sold? Develop a scatter diagram to display the
information.
Sales Number of Sales Number of Copiers
Representative Calls (X) Sold (Y)
Tom Keller 20 30
Jeff Hall 40 60
Brian Virost 20 40
Greg Fish 30 60
Susan Welch 10 30
Carlos Ramirez 10 40
Rich Niles 20 40
Mike Kiel 20 50
Mark Reynolds 20 30
Soni Jones 30 70
Based on the information in table, Ms. Bancer suspects there is relationship
between the number of sales calls made in a month and the number of copiers
sold.
Soni Jones sold the most copiers last month, and she was one of three
representatives making 30 or more sales call.
Susan Welch and Carlos Ramirez made only 10 calls last month. Ms. Welch
had the lowest number of copiers sold among the sampled representatives.
The implication is that the number of copies sold is related to the number of sales
calls made. As the number of sales calls increases, it appears the number of
copiers sold also increases.
Common Practice to draw the scatter diagram
Independent variable (number of sales calls) – Horizontal or X-axis
Dependent variable (copiers sold) - Vertical or Y-axis
Independent variable – the variable that provides the basis for estimation.
It is the predictor variable = the number of sales calls
Dependent variable – the variable that is being predicted or estimated
= the number of copiers sold
The scatter diagram shows graphically that the sales representatives who make
more calls tend to sell more copiers.
Note that while there appears to be a positive relationship between the two
variables, all the points do not fall on a line.
In the following section you will measure the strength and direction of this
relationship between two variables by determining the coefficient of correlation.
7.2 The Coefficient of Correlation (r)
Coefficient of Correlation – describes the strength of the relationship between
two sets of interval-scaled or ratio-scaled variables.
ρ Population coefficient of correlation
r Sample coefficient of correlation
1.00 r 1.00
A correlation coefficient of -1 or +1 indicates perfect correlation.
If there is absolutely no linear relationship between the two sets of variables,
Person’s r is zero.
A coefficient of correlation r close to 0 (say 0.08) shows that the linear
relationship is quite weak. The same conclusion is drawn if r = -0.08.
Coefficients of -0.91 and +0.91 have equal strength; both indicate very strong
correlation between the two variables. Thus the strength of the correlation does
not depend on the direction (either + or -).
= CORREL (x, y)
Coefficient of Correlation (r) – A measure of the strength of the linear
relationship between two variables.
The sample coefficient of correlation is identified by the lower-case letter (r).
It shows the direction and strength of the linear (straight line) relationship
between two variables.
Sales Representative Calls (X) Sales (Y) XX YY X X Y Y
Tom Keller 20 30 -2 -15 30
Jeff Hall 40 60 18 15 270
Brian Virost 20 40 -2 -5 10
Greg Fish 30 60 8 15 120
Susan Welch 10 30 -12 -15 180
Carlos Ramirez 10 40 -12 -5 60
Rich Niles 20 40 -2 -5 10
Mike Kiel 20 50 -2 5 -10
Mark Reynolds 20 30 -2 -15 30
Soni Jones 30 70 8 25 200
X 22 Y 45 X X (Y Y) = 900
r
(X X)(Y Y)
900
0.759
(n 1)SXSY (10 1)(9.189)(14.337)
Positive, it confirms our reasoning based on the scatter diagram, fairly close to 1, so the association
is strong.
Positive Correlation
1 2
( x - x, y – y )
( x, y )
3 4
In the quadrant [2], both (x – X ) (+) and (Y – Y ) (+) will be positive (++=+),
Clare Morris: Quantitative Approaches in Business Studies, 6/e © Clare Morris 2003
while in the quadrant [3], both (x – X ) (-) and (Y – Y ) (-) then (--=+) will be
positive.
The products (x – X ) (Y – Y ) will therefore nearly all be positive, as will the sum
∑(x – X ) (Y - Y ) over all the points.
Negative Correlation No Linear Relationship – Zero Correlation
1 2 1 2
3 4
3 4
Clare Morris: Quantitative Approaches in Business Studies, 6/e © Clare Morris 2003 For no correlation, the points are pretty
In the quadrant [1], where (x – X ) is negative Clare Morris: Quantitative Approaches in Business Studies, 6/e © Clare Morris 2003
uniformly scattered throughout all four
and (Y – Y ) is positive. (-+=-) while in the
quadrants, so the product (x – X )(Y – Y )
quadrant [4], where (x – X ) is positive and will be fairly evenly balanced between
(Y – Y ) is negative. positive and negative.
r
(X X)(Y Y)
XY nXY
(n 1)SX S Y ( n 1) S x S y
Which factor (X1 or X2) has a higher correlation with Annual Salary (Y)?
Y X1 X2
Y 1
X1 0.813164 1
X2 0.962216 0.924995 1
7.3 Simple Linear Regression
In this section we wish to develop an equation to express the linear relationship
between two variables.
The technique used to develop the equation and provide the estimates is called
regression analysis.
Regression Analysis – An equation that expresses the linear relationship between
two variables.
The scatter diagram is reproduced with a line drawn with a ruler through the dots
to illustrate that a straight line would probably fit the data.
Judgment is eliminated by determining the regression line using mathematical
method called the least squares principle, this method gives us the “best-fitting”
line.
Least Squares Principle – Determining a regression equation by minimizing
the sum of the squares of the vertical distances between the actual Y values and
the predicted value of Yˆ .
ŷ a b x
Where
ŷ read Y hat, is the predicted value of the Y variable for a selected X value.
b is the slope of the line, or the average change in ŷ for each change of one
unit (either increase or decrease) in the independent variable x.
Y intercept:
a Y bX
Use the least square method to determine a linear equation to express the relationship
between the two variables. What is the expected number of copiers sold by a
representative who made 20 calls?
r
(X X)(Y Y)
900
0.759
(n 1)SXSY (10 1)(9.189)(14.337)
Sy 14.337
b r 0.759 1.1842
Sx 9.189
a Y bX 45 - (1.1843)22 18.9476
Thus, the regression equation is ŷ 18.9476 1.1842 x , and it can be shown on the
scatter diagram.
7.3.2 Drawing the Line of Regression
ŷ 18.9476 1.1842 x
ŷ 42.6316
The a value of 18.9476 is the point where the equation crosses the Y-axis. A
literal translation is that if no sales calls are made, that is, X = 0, 18.9476 copiers
will be sold.
The b value of 1.1842 means that for each additional sales call made the sales
representative can expect to increase the number of copier sold by about 1.2.