4. Correlation and Regression Analysis
4. Correlation and Regression Analysis
Types of Correlation:
i. Positive or negative
ii. Simple or multiple
iii. Linear or non-linear
The following are the important methods of ascertaining simple linear correlation:
Page 1 of 17
High degree of Positive High degree of Negative
Correlation Correlation
No Correlation
No Correlation
Interpretation of r:
The values of the correlation coefficient lie between -1 and +1.
r = +1 r = -1
r close to +1 r close to -1
r=0 but r = 0
Page 3 of 17
Example: Find Karl Pearson’s correlation coefficient between the sales and expenses from
the data given below and interpret its value:
Firm 1 2 3 4 5 6 7 8 9 10
Sales (Lakhs) 50 50 55 60 65 65 65 60 60 50
Expenses(Lakhs) 11 13 14 16 16 15 15 14 13 13
Solution:
Sales ( ̅) ( ̅) Expenses ( ̅) ( ̅) ( ̅)( ̅)
X Y
50 -8 64 11 -3 9 +24
50 -8 64 13 -1 1 +8
55 -3 9 14 0 0 0
60 +2 4 16 +2 4 +4
65 +7 49 16 +2 4 +14
65 +7 49 15 +1 1 +7
65 +7 49 15 +1 1 +7
60 +2 4 14 0 0 0
60 +2 4 13 -1 1 -2
50 -8 64 13 -1 1 +8
∑ 580 ∑ 0 ∑ 360 ∑ 140 ∑ 0 ∑ 22 ∑ 70
∑ ∑
̅ ̅
∑ a strong linear relationship
between them with positive
√∑ √∑ √ slope
Hence, there is a high degree of positive correlation between the two variables i.e., as the
value of sales goes up, the expenses also goes up.
Example: Find Karl Pearson’s correlation coefficient between the sales and expenses from
the data given below and interpret its value:
Advertising 10 12 15 23 20
Expenses(Lakhs)
Sales (Lakhs) 14 17 23 25 21
ANS: r = +0.865
Page 4 of 17
Example: Show that the coefficient of correlation lies between -1 and +1.
Ans:
∑( ̅ )( ̅)
√∑( ̅ ) √∑( ̅)
( ̅) ( ̅)
Let
̅)
and ̅)
√∑( √∑(
Now
∑( ) ∑ ∑ ∑ ( )
i.e.
∑ ∑
( ) ( )
Example: Two managers are asked to rank a group of employee in order of potential for
eventually becoming top managers. The rankings are as follows:
Employee A B C D E F G H I J
Ranking by Manager 1 10 2 1 4 3 6 5 8 7 9
Ranking by Manager 2 9 4 2 3 1 5 6 8 7 10
Compute the coefficient of rank correlation and comment on the value.
Page 5 of 17
Solution:
Employee ( )
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
N=10 ∑ =14
∑
( )
Thus we find that there is a high degree of positive correlation in the ranks assigned by the
two managers.
Example: Compute the rank correlation coefficient for the following data od 2 tests given to
candidates for a critical job and comment on the value.
Preliminary test 92 89 87 86 83 77 71 63 53 50
Final test 86 83 91 77 68 85 52 82 37 57
Solution:
Preliminary test Final test ( )
92 10 86 9 1
89 9 83 7 4
87 8 91 10 4
86 7 77 5 4
83 6 68 4 4
77 5 85 8 9
71 4 52 2 4
63 3 82 6 9
53 2 37 1 1
50 1 57 3 4
N=10 ∑ =44
∑
( )
Thus we find that there is a high degree of positive correlation between preliminary test and
final test rs ranges from -1 to +1.
+1: Perfect positive correlation (as one variable increases, the other increases proportionally).
-1: Perfect negative correlation (as one variable increases, the other decreasesPage 6 of 17
proportionally)
0.8 to 1.0: Strong positive correlation..
Tie in Ranks:
An adjustment of the above formula is made when ranks are equal
{∑ ( ) ( ) }
( )
Example: An examination of eight applicants for a post was taken by a firm. From the
marks obtained the applicants in the Bangla and English papers, Compute the rank
correlation coefficient:
Applicant A B C D E F G H
Marks in Bangla 15 20 28 12 40 60 20 80
Marks in English 40 30 50 30 20 10 30 60
Solution:
15 2 40 6 16
20 3.5 30 4 0.25
28 5 50 7 4
12 1 30 4 9
40 6 20 2 16
60 7 10 1 36
20 3.5 30 4 0.25
80 8 60 8 0
N=8 ∑ =81.5
Page 7 of 17
Regression Analysis
The regression analysis is a technique of studying the dependence of one variable (called
dependent variable), on one or more variables (called independent variables), with a view to
estimating or predicting the average value of the dependent variable in terms of the known or
fixed values of the independent variables.
Page 8 of 17
A deterministic relationship A non-deterministic relationship
Where ̂ denotes the predicted value of Y, a is the intercept and b is the slope of the straight
line. In regression terminology, b is the regression coefficient of Y on X. This straight line is
called the fitted line of Y.
In practice, the observed value of Y would almost always invariably deviate from the
expectation. If this discrepancy denoted by a quantity . Then
Page 9 of 17
The least-Squares Method:
The least-squares method is a technique for minimizing the sum of the squares of the
differences between the observed values and estimated values of the dependent variable.
That is the least-squares line is the line that minimizes
∑ ∑( ̂) ∑( )
To minimizes SSE with respect to a and b, from calculus we know that the partial derivatives
of SSE with respect to a and b must be 0. Then
∑( )
∑( )
Which concludes
∑ ∑
and
Page 10 of 17
∑ ∑ ∑
∑ ∑ ∑
∑ (∑ )
and ̅ ̅
Example: The following table show distance to transmitter (X) and corresponding wireless
signal strength (Y).
Distance to 13 1 17 19 14 15 15 8 13 3
transmitter (m)
wireless signal 34.4 38.4 30.4 29.7 30.1 33.9 32.8 35.2 34.9 36.8
strength (dB)
i. Find the regression line of Y on X.
ii. Predict what the signal strength would be if the distance was 10 meters.
Solution:
̅ ̅
Page 11 of 17
The regression line of Y on X is
̂
∑ ∑ ∑
∑ (∑ )
̅ ̅
Page 12 of 17
In deterministic relationship SSE = 0 i.e. for a perfect fitting estimation line SST = SSR and
hence SSR/SST=1. For the worst case of data SSR = 0 i.e. SSE = SST and hence
SSR/SST=0.
So the ratio SSR/SST evaluate how good the estimated regression line is, values of this ratio
closer to 1 would imply better fitting estimated line. Thus the ratio SSR/SST is known as the
coefficient of determination.
∑(̂ ̅)
∑( ̅)
Page 13 of 17
is a non-negative value and it’s limit are . Verbally, measures the
percentage of the total variation in the dependent variable explained by the regression model.
Example: The following table shows the hardness (X) and tensile strength (Y) of 5
samples of metal:
X 146 152 158 164 170
Y 75 78 77 89 82
Find the regression line Y on X. Is this linear model adequate for the given data set, Justify
your result?
∑ ∑( ̂) ∑( )
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
Page 14 of 17
( ) ( ) ( ) ( )
( ) ( ) * ( )+
( ) ( ) ( ) ( )
( ) ( ) * ( )+
̅ ̅ ̅
Here
(∑ ) ∑ ∑
( ) ∑ and ( ) ∑
Page 15 of 17
(∑ ) ( )
( ) ∑
(∑ ) ( )
( ) ∑
∑ ∑
( ) ∑
∑ ∑
( ) ∑
∑ ∑
( ) ∑
( ) ( ) ( ) ( )
( ) ( ) * ( )+
* +
( ) ( ) ( ) ( )
( ) ( ) * ( )+
* +
̅ ̅ ̅
∑ ∑ ∑
∑ ∑ ∑ ∑
Page 16 of 17
∑ ∑ ∑ ∑
We get
( ) ( ) ( ) ( )
( ) ( ) * ( )+
( ) ( ) ( ) ( )
( ) ( ) * ( )+
̅ ̅ ̅̅̅
Example: A test was made to different doses of nitrogen (X) on rice field for
observing rice production. The following data were recorded:
Nitrogen 0 1 2 3 4 5
Dose
Rice 15 25 40 55 52 43
Production
Compute a second degree polynomial to the data.
Solution:
N-Dose R-Production ( )
(X) (Y)
0 15 0 0 0 0 0
1 25 1 1 1 25 25
2 40 4 16 8 80 160
3 55 9 81 27 165 495
4 52 16 256 64 208 832
5 43 25 625 125 215 1075
15 225 693 2587
( ) ; ( ) ; ( ) ; ( ) ; ( )