Correlation Notes
Correlation Notes
For example, we want to measure the degree of association between rainfall and yield of rice.
Are they positively related, i.e., high value of rainfall is associated with high value of yield of
rice or are they negatively related or does there not exist any relationship between them? If
higher values of the one variable are associated with higher values of the other or when lower
values of the one are accompanied by the lower values of the other (in other words,
movements of the two variables are in the same direction) it is said that there exists positive
or direct correlation between the variables. If on the other hand, the higher values of one
variable are associated with the lower values of the other (i.e., when the movements of two
variables are in opposite directions), the correlation between those variables are said to be
negative or inverse. For example, investment is likely to be negatively correlated with rate of
interest. E.g., if the number of pests in a crop increases, the yield will decrease.
TYPES OF CORRELATION:
According to the direction of change in variables there are two types ofcorrelation:
1. Positive Correlation
2. Negative Correlation
1. Positive Correlation
Correlation between two variables is said to be positive if the values of the variables change
in the same direction i.e. if the values of one variable increase (or decrease) then the values of
1
other variable also increase (or decrease). Some examples of positive correlation are
correlation between:
i. Heights and weights of group of persons;
ii. House hold income and expenditure;
iii. Amount of rainfall and yield of crops; and
iv. Expenditure on advertising and sales revenue.
2. Negative Correlation
Correlation between two variables is said to be negative if the values of variables change in
opposite direction i.e. if the values of one variable increase (or decrease) then the values of other
variable decrease (or increase). Some examples of negative correlations are correlation between
1. Volume and pressure of perfect gas;
2. Price and demand of goods;
3. Literacy and poverty in a country; and
4. Time spent on watching TV and marks obtained by students in examination.
MEASUREMENT OF CORRELATION:
Following methods are used to measure simple correlation between two variables:
1) Scatter Diagram
2) Karl Pearson’s Coefficient of Correlation
3) Spearman’s Rank Correlation Coefficient
1. SCATTER DIAGRAM: Scatter diagram is a statistical tool for determining the
potentiality of correlation between dependent variable and independent variable. Scatter
diagram does not tell about exact relationship between two variables but it indicates whether
they are correlated or not. Let (xi , yi ); (i= 1,2,...,n) be the bivariate distribution. If the values
of the variable Y are plotted against corresponding values of the variable X in the XY plane,
such diagram of dots is called scatter diagram or dot diagram. It is to be noted that scatter
diagram is not suitable for large number of observations.
When there is a positive correlation between the variables, the dots on the scatter diagram run
from left hand bottom to the right hand upper corner. In case of perfect positive correlation
all the dots will lie on a straight line.
When a negative correlation exists between the variables, dots on the scatter diagram run
from the upper left hand corner to the bottom right hand corner. In case of perfect negative
correlation, all the dots lie on a straight line.
2
2. Karl Pearson’s Coefficient of Correlation:
Scatter diagram tells us whether variables are correlated or not. But it does not indicate the extent
of which they are correlated. Coefficient of correlation gives the exact idea of the extent of which
they are correlated. Coefficient of correlation measures the intensity or degree of linear
relationship between two variables. It was given by British Biometrician Karl Pearson (1867-
1936).
If X and Y are two random variables then correlation coefficient between X and Y is denoted
by r and defined as
Cov (x , y )
r xy =corr ( x , y )= ……….(1)
√ var (x )var ( y )
Corr(x, y) is indication of correlation coefficient between two variables X and Y.
Where, Cov(x, y) the covariance between X and Y which is defined as:
n
1
Cov ( x , y ) = ∑ ( x −x́ ) ( y i− ý )
n i=1 i
and V(x) the variance of X, is defined as:
n
1
V ( x )= ∑ ( x −x́ )2,
n i=1 i
Similarly, V(y) the variance of Y is defined by
n
1
V ( y )= ∑ ( y − ý )2.
n i=1 i
where, n is number of paired observations.
Then, the correlation coefficient “r” may be defined as:
3
n
1
∑ ( x −x́ ) ( y i− ý )
n i=1 i
r xy =corr ( x , y )= …(2)
n n
√ 1 2 1
( ∑ ( x i− x́ ) ) ∑ ( y i − ý )
n i=1 n i=1 (
2
)
Karl Pearson’s correlation coefficient r is also called product moment correlation coefficient.
Expression in equation (2) can be simplified in various forms. Some of them are
n
∑ ( x i−x́ )( y i− ý )
i=1
r xy = …(3)
n n
√ ∑(
(
i=1
2
x i− x́ ) ) ∑ ( y i− ý )
i=1
2
Or
n
1
∑ x y − x́ ý
n i=1 i i
r xy = …(4)
n n
√{ 1
∑
n i=1
x 2i − x́ 2
1
∑
n i=1 }{
y 2i − ý 2 }
or
∑ x i yi −n x́ ý
i=1
r xy = …(5)
n n
√{∑ }{∑
i=1
2
x −n x́
i
2
i=1
2
y −n ý
i
2
}
Assumptions for Correlation Coefficient
1. Assumption of Linearity
Variables being used to know correlation coefficient must be linearly related. You can see the
linearity of the variables through scatter diagram.
2. Assumption of Normality
Both variables under study should follow Normal distribution. They should not be skewed in
either the positive or the negative direction.
3. Assumption of Cause and Effect Relationship
There should be cause and effect relationship between both variables, for example, Heights
and Weights of children, Demand and Supply of goods, etc. When there is no cause and
effect relationship between variables then correlation coefficient should be zero. If it is non-
zero then correlation is termed as chance correlation or spurious correlation. For example,
Correlation coefficient between:
4
i. Weight and income of a person over periods of time; and
ii. Rainfall and literacy in a state over periods of time.
5
PROPERTIES OF CORRELATIONCOEFFICIENT
i. Coefficient of Correlation lies between -1 and +1:
The coefficient of correlation cannot take value less than -1 or more than one +1.
Symbolically,
−1≤r≤+1∨¿ r ∨¿ 1.
ii. Coefficients of Correlation are independent of Change of Origin:
This property reveals that if we subtract any constant from all the values of X and Y, it will
not affect the coefficient of correlation.
iii. Coefficients of Correlation possess the property of symmetry:
The degree of relationship between two variables is symmetric as shown below:
r xy =r yx
iv. Coefficient of Correlation is independent of Change of Scale:
This property reveals that if we divide or multiply all the values of X and Y, it will not affect
the coefficient of correlation.
v. Co-efficient of correlation measures only linear correlation between X and Y.
vi. If two variables X and Y are independent, coefficient of correlation between them will
be zero.
As correlation measures the degree of linear relationship, different values of coefficient of
correlation can be interpreted as below:
Value of correlation coefficient Correlation is
+1 : Perfect Positive Correlation
-1 : Perfect Negative Correlation
0 : There is no Correlation
0 - 0.25 : Weak Positive Correlation
0.75 - (+1) : Strong Positive Correlation
−0.25 - 0 : Weak Negative Correlation
−0.75 - (−1) : Strong Negative Correlation
6
by the variation of the other variable. The coefficient of determination is used to compare 2
correlation coefficients.
|r|
t=
1−r 2
√ n−2
Problem: Compute Pearson’s coefficient of correlation between plant height (cm) and yield
(Kgs) as per the data given below:
X Y XY X2 Y2
39 47 1833 1521 2209
65 53 3445 4225 2809
62 58 3596 3844 3364
90 86 7740 8100 7396
82 62 5084 6724 3844
75 68 5100 5625 4624
25 60 1500 625 3600
98 91 8918 9604 8281
36 51 1836 1296 2601
78 84 6552 6084 7056
∑ X 2=¿476
∑ X=¿650 ∑ Y =¿660 ∑ XY =¿45604 48 ∑ Y 2=¿ ¿45784
Solution:
Ho: The correlation coefficient r is not significant
H1: The correlation coefficient r is significant.
Level of significance 5%
From the data,∑ X=¿ 650, ∑ Y =¿660, ∑ XY =¿45604, ∑ X 2=¿47648 and ∑ Y 2=¿ ¿45784
n
1
xi y i −
∑ x∑ y
∑
n i=1 n
r xy = `
2 2
√{ 1
∑
n i=1
n
x 2i −
(∑ x i )
n }{
(650)(660)
1
∑
n
n i=1
y 2i −
(∑ y i )
n }
45604−
10
r xy =
( 650 )2 ( 660 )2
√ 47648−
45604−42900
10 √45784−
10
¿
(73.47)( 47.1)
¿ 0.7804
7
This means that there is high and positive correlation between plant height and yield.
|r|
t= ( n−2 ) d . f .
1−r 2
√ n−2
0.7804
t= =3.530
2
1−( 0.7804 )
√ 10−2
ttab=t(10-2, 5%los)=2.306
Inference
t >t tab , we reject null hypothesis.
∴The correlation coefficient r is significant. (i.e.,) there is a relation between plant heights
and yield.
Problem: Calculate the Correlation coefficient of given data:
x 50 51 52 53 54
y 3.1 3.2 3.3 3.4 3.5
Solution:
Here n = 5
x 50 51 52 53 54
y 3.1 3.2 3.3 3.4 3.5
xy 155 163.2 171.6 180.2 189
x2 2500 2601 2704 2809 2916
y2 9.61 10.24 10.89 11.56 12.25
By substituting all the values in formula, we get r = 1. This shows a perfect positive
correlation between x and y.
8
x 12 15 18 21 27
y 2 4 6 8 12
Solution:
Here n = 5
x 12 15 18 21 27
y 2 4 6 8 12
xy 24 60 94 168 324
x2 144 225 324 441 729
y2 4 16 36 64 144
We have, r = 0.84