Correlation
Correlation
Correlation
Correlation
Introduction
Bivariate Distribution: A distribution in which each unit of a series assumes two values
Multivariate Distribution: A distribution in which each unit of a series assumes more than
one value.
Correlation is a statistical tool which studies relationship between two variables.
The presence of correlation between two variables X and Y simply means that when the
value of one variable is found to change in one direction, the value of the other variable is
found to change either in the same direction (i.e., positive change) or in the opposite
direction (i.e., negative change), but in a definite way.
It is an analysis of the covariance between two variables.
It helps to measure the magnitude and direction of relationship between two variables.
Types of correlation
o Positive and negative correlation:
When the variables move in the same direction, these variables are said to
be correlated positively and
if they move in the opposite direction they are said to be negatively
correlated (for e.g., price and demand of a commodity, sale of woollen
garments and day temperature)
o Linear and non-linear correlation:
With a unit change in one variable there is a constant change in other
variable over the entire range of values then the correlation between the
variables is linear
If corresponding to a unit change in one variable, the other variable does not
change at a constant rate but at a fluctuating rate then the correlation is said
to be non-linear or curvilinear
Causation implies correlation but reverse is not true i.e., Correlation doesn’t imply
causation. For e.g., ice-cream sales and sun glasses sales could be positively correlated
but there is no causation between them.
Correlation analysis fails to reflect the cause-and-effect relationship between the
variables. It only tells the degree of association.
In bivariate distribution, if the variables have the cause-and-effect relationship, they bound to
vary in sympathy with each other.
Correlation only implies co-variation.
Reasons of high degree of correlation
o Mutual dependence
o Both the variables being influenced by the same external factors
o Pure chance
A high value of r is neither necessary nor sufficient for a causal relationship between
X and Y.
2
Correlation
Not necessary because r is close to 0 yet X and Y can have causal relationship. This is possible
if the relationship between X and Y is non-linear since r only measures straight line
relationships. E.g., Y = X 2
Degrees of Correlation
A scatter diagram helps to have a visual or graphical idea about the nature of association
between two variables.
For example, if two variables, X and Y are plotted along the X-axis and Y-axis respectively in
the x-y plane of a graph sheet, the resultant diagram of dots is known as scatter diagram.
The various possible situations are:
3
Correlation
s
Positive Non-linear relation Negative Non-linear relation
Σ ( X i− X ) ( Y i−Y )
Cov4( X , Y ) =
Σxy ∑ XY
= N
= N
- XY
N
√
r= = =
√
2 2
∑ ( X i −X ) ∑ ( Y i−Y )
2
∑ ( X i− X )
⋅
∑ ( Y i−Y )
2
N
N
⋅
N
√∑ ( X − X ) ∑ ( Y −Y )
i
2
i
2
N N
N ∑ XY −∑ XΣY
N
r=
√ √
2 2
2 (∑ X ) 2 (∑Y )
∑X − ⋅ ∑Y − ⋅
N N
N ∑ XY −∑ XΣY
r=
√ [N ∑ X −( ∑ X ) ] [ NΣ Y −( ΣY ) ]
2 2 2 2
Where,
N = number of observations
Sign of covariance (X,Y) gives sign of r as the standard deviations are always positive.
For example,
The correlation has to be determined between the rainfall and the yield of the vegetable
sown from the given data:
Rainfall (mm) 12 9 8 10 11 13 7
Yield (Kg) 14 8 6 9 11 12 3
Solution:
= 0.949 i.e., high positive correlation between rainfall and plant yield.
5
Correlation
The value of r does not depend on which of the two variables under study is labelled X
and which is labelled Y, i.e.; it does not depend upon which variable is dependent /
independent {rxy = ryx}.
Limit of correlation coefficient: The correlation coefficient value ranges between –1
and +1[ -1 ≤ r ≤1].
r = 1 if and only if all ( X i , Y i) pairs lie on a straight line with positive slope and r = -1 if
and only if all ( X i , Y i ) pairs lie on a straight line with negative slope. In other words, all
the points in the scatter are collinear and the correlation is perfect.
If r = 0 the two variables are uncorrelated. There is no linear relation between them.
However, other types of relation may be there.
The correlation coefficient is independent of change of origin and scale i.e., if X and Y
ae transformed into new variables U [U= (X-a)/h] and V [ V = (Y-b)/k] by changing the
origin and scale, then the correlation coefficient between X and Y is same as the
correlation coefficient between U and V.
6
Correlation
Corollary: If X and Y are random variables and a,b,c,d are any numbers provided only that a ≠
0, c ≠ 0 , then r (aX + b, cY + d) = [ ac / │a││c│] r (X,Y). In other words, r is affected by
change of sign. If a and c have different signs, sign of r would change.
7
Correlation
The two independent variables are uncorrelated but the reverse is not true. A 0
coefficient of correlation only implies absence of a “linear” relationship between them.
8
Correlation
Interpretation of r
1−r 2
S.E. (r) =
√n
9
Correlation
Reason for taking 0.6745 is that in a normal distribution 50% of the observation lies in
the range µ ± 0.6745 σ , where µ is mean and σ is standard deviation
Use of Probable Error
o To determine the limits within which the population correlation coefficient may be
expected to lie [Limits for population correlation coefficient are r ± P.E (r) ].
o To test if an observed value of sample correlation coefficient is significant of any
correlation in the population
If r < P.E (r) correlation is non-significant
If r > σ [P.E (r)], correlation is definitely significant
Other case, significance of r is not known.
P.E. can be used only if data is drawn from a normal population
The sample is drawn using random sampling
For small sample size, P.E. may lead to fallacious conclusion. In that case, a rigorous
test for testing the significance of an observed sample correlation coefficient is provided
by student’s t test
Student’s t test: The test statistics is given by
This t is distributed as Student’s t distribution with (n-2) degrees of freedom.
Note –
The symbol for the population correlation coefficient is ρ, the Greek letter "rho."
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)
For example,
The correlation coefficient between infant mortality rate and mother’s year of schooling is -0.12
based on a sample of 12 towns. Can we conclude that there is a negative correlation between
the two variables?
Solution:
X (deaths/1000 births)
Y (years)
r = -0.12, n = 12
Ho : ρ = 0
H a: Ρ < 0
10
Correlation
Test Statistic
t = -0.12
√ 12−2
1−(−0.12)2
= -0.382
At n-2 = 12-2 = 10 degrees of freedom and 5% level of significance the critical t value is 1.812
Since, -0.382 is not less than -1.812, we can’t reject the null hypothesis and the test statistic is
insignificant. We can’t conclude that there is a negative correlation between the two variables.
For a fairly large bivariate distribution, the data may be summarized in form of a two-way
frequency table.
For each variable the values are grouped into different classes.
If there are m classes for X variable and n classes for Y variable then there will be m*n cells
in that two-way frequency table.
The formula for calculating r for bivariate frequency table is given by
r xy = ruv =
11
Correlation
Where xi is the rank of ith individual in A character and yi is the rank of ith individual in B
character and n is the number of pairs (Both the series are ranked separately; largest value
gets the first rank and so on.)
If there is a tie, take the average of the ranks they would have otherwise occupied and use
the following formula:
12
Correlation
The occurrence of ties causes no problem in the calculation of the Spearman correlation
coefficient when the Pearson formula is used with the ranks.
Where, c is the number of pairs of concurrent deviations and m is the number of pair of
deviations. Also, m is one less than the number of pairs of observations.
The quantity inside the square root must be positive otherwise r will be imaginary which is
not possible.
Thus, if (2c-m) is positive we take + sign in and outside the square root and if (2c-m) is
negative we take - sign in and outside the square root.
13
Correlation
Coefficient of determination
It gives the percentage variation in the dependent variable that is accounted for by the
independent variable.
o Example: If r2 is 0.72, it implies that on the basis of the sample, 72% of the variation
in one variable is caused by the variation of the other variable.
It gives the ratio of the explained variance to the total variance.
It is given by the square of the correlation coefficient.
It is always non-negative and does not tell us about the direction of relationship (+ve or -ve)
between the two series.
Expalined variance
Coefficient of determination = r2 =
Total variance
Coefficient of non-determination (K2): It is the ratio of unexplained variation to the total
variation
Expalined variance
K2 =1- r2 = 1-
Total variance
14