Correlation and Regression
Correlation and Regression
Examples
Marketers are interested in questions such as, is sales related to advertising? Or is there a
relationship between a person’s age and his or her purchasing power? These are just some of the
questions that can be answered by using the technique of correlation and regression analysis.
1.2 20
1
15
0.8
0.6 10
Y
0.4
5
0.2
0 0
0 0.2 0.4 0.6 0.8 1 1.2 0 2 4 6 8 10
X X
C - Scatter Diagram
20
15
10
Y
5
0
0 2 4 6 8 10
X
The range of the correlation coefficient is from -1 to +1, written as −1<r <+1 .
a). If r =+1 , there is a strong positive linear relationship between the variables.
b). If r=−1 , there is a strong negative linear relationship between the variables
c). If r is close to 0 , dispersion is wide and variables are uncorrelated or no linear
relationship between the variables.
∑ ( X i − X̄ )( Y i −Ȳ )
r= 1
{∑ ( )(
X 2 −n X̄ 2 Y 2 −n Ȳ 2
i i )} 2
Where
n , is the number of data pairs
n ∑ xy −∑ x ∑ y
r= 1
Or
[ {n∑ x −(∑ x) }{n ∑ y −(∑ y ) }]
2 2 2 2 2
Example
The following data refers to the amount of money spent by 10 customers who visited a
supermarket in a certain year and their social class index.
Calculate
i). Correlation coefficient
ii). Coefficient of determination
Solution
When you are faced with a mathematical or statistical problem that has a formula, check the
parameters you require. In our case for the Correlation coefficient, we need what is in the table
below
x y xy 2 2
x y
57 113 6441 3249 12769
54 111 5994 2916 12321
49 107 5243 2401 11449
42 103 4326 1764 10609
38 100 3800 1444 10000
32 96 3072 1024 9216
30 94 2820 900 8836
24 84 2016 576 7056
20 74 1480 400 5476
18 76 1368 324 5776
364 958 36560 14998 93508
Make sure you have the summations down on the last row of the table as shown above.
Note that in the table above, we have calculated the only parameters important in our calculation
of the value of r.
Replace the formula with the values in our table above.
What is the size (n) of the sample we are using?
( 10×36560 )−( 364×958)
1
1688. 8
1
2
= {1748 . 4×1731 .6 }
16888
=
17399 .797
= 0.9706
There is a high relationship between amount of money spent in supermarket and the
social class index (a positive relationship)
This means that 94.2% of the variation of the social class (dependent variable) can be
explained by the variation of the amount of money spent in the supermarket every year
(independent variable), and 5.8% is determined by other factors
Example
The table below shows the marks of students for Business Statistics I (Stats 1) and Business
Statistics II (Stats 2). Find R
80 2 80 2 0 0
60 4 50 5 -1 1
65 3 60 3 0 0
50 5 55 4 1 1
35 6 45 6 0 0
30 7 30 7 0 0
90 1 95 1 0 0
Regression analysis is a statistical procedure that can be used to develop a mathematical equation
showing how variables are related.
5.3.1 Simple Linear Regression Model
A single variable is used to predict another variable on the assumption of linear relationship,
Y =a+bX
Where
Y , is the dependent or response variable
X , is the independent or explanatory or regressor variable.
a , represents the Y-intercept
b , the slope of the regression line and indicates the amount of change of dependent
variable for a unit change in the independent variable.
5.3.2 Determination of the Regression Line Equation
In algebra, as we considered the topic on graphs, the equation of a line is usually given as
y=mx +b , where m is the slope of the line and b is the y intercept.
The equation of the regression line is written as Y =a+bX . There are several methods for
finding the regression line but we consider one method.
n ( ∑ xy )−( ∑ x )( ∑ y )
b= 2
n ( ∑ x2 )−( ∑ x )
a=
∑ y −a ∑ x
a= ȳ−a x̄ Or n n
Example
i). Find the equation of the regression line for the data below which is obtained in the study of
age and blood pressure
2
Subject Age ( x ) Pressure ( y ) xy x
2
y
A 43 128 5,504 1,849 16,384
B 48 120 5,760 2,304 14,400
C 56 135 7,560 3,136 18,225
D 61 143 8,723 3,721 20,449
E 67 141 9,447 4,489 19,881
F 70 152 10,640 4,900 23,104
345 819 47,634 20,399 112,443
Thus ∑ x=345 , ∑ y=819 , ∑ xy=47 ,634 , ∑ x 2=20 , 399, ∑ y 2=112 ,443 and n=6
819 345
a= −0 .96438 =81 .048
6 6
The regression equation can be used to estimate the pressure given the age
ii). Find the blood pressure for a person who is aged 50 years. This means that the value of
x=50
Other methods used to determine the regression equation is the method of least squares
considered in the next part.
The fitted line should pass through the points of the scatter diagram in such a manner that the
sum of the squares of the vertical deviations of these points from the line will be minimum.
Since some deviations are negative and others positive, we eliminate the signs by squaring each
observation, then use the two normal equations to work out the values of a & b .
We have the normal equations
∑ y=na+b ∑ x
∑ xy=a ∑ x+b ∑ x2
Example
Apply the method of least squares to fit a straight line relationship (Regression of Y on X) for the
following points
Solution
2
Use the normal equations and find x and xy
x y 2 xy 2
x y
-2.4 -5.0 5.76 12.0 25.0
-0.8 -1.5 0.64 1.2 2.25
0.3 2.5 0.09 0.75 6.25
1.9 6.4 3.61 12.16 40.96
3.2 11.0 10.24 35.2 121.0
2.2 13.4 20.34 61.31 195.46
Age x 18 26 39 48 53 58
Amount of Money ($) 16 12 9 5 6 2
y
It is important to record both the numerical value and the time period associated with each
measurement. This information is then used to construct a time series plot or a run chart, with the
measurements on the vertical axis and time on the horizontal axis.
1. Trend (T)
Long term pattern of development of data or the course which the data has followed over a
considerable period (several years)
Influenced by changes in technology, population, wealth, value, etc
Y 1 +Y 2 +.. .+Y L
MA ( L)=
Moving Average, L
The moving average is then centered on the middle value of the time series
Example
Given the data on a factory output, calculate the 5day moving average.
5 Day MA –
Week Day Output (Y) 5 Day Total Trend (T)
Week 1 Monday 80
Tuesday 104
Wednesday 94 460 92.00
Thursday 120 462 92.40
Friday 62 468 93.60
Week 2 Monday 82 471 94.20
Tuesday 110 476 95.20
Wednesday 97 478 95.60
Thursday 125 480 96.00
Friday 64 486 97.20
Week 3 Monday 84 489 97.80
Tuesday 116 494 98.80
Wednesday 100 496 99.20
Thursday 130
Friday 66
Note:
On the 4th column, the totals are obtained as follows
o 460=80+104+ 94+120+62
o 462=104+94 +120+62+82
o 468=94+120+62+82+110 , etc
On the 5th column, the trend or moving averages are obtained as follows
460
92=
o 5
462
92 . 40=
o 5 , etc
Can you explain why we are using 5 and not any other number?
It is because we are working with a 5 day moving average for our data.