0% found this document useful (0 votes)
9 views

Correlation

Uploaded by

aditidocmoc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Correlation

Uploaded by

aditidocmoc
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Study Notes

Correlation
Correlation

Introduction

 Bivariate Distribution: A distribution in which each unit of a series assumes two values
 Multivariate Distribution: A distribution in which each unit of a series assumes more than
one value.
 Correlation is a statistical tool which studies relationship between two variables.
 The presence of correlation between two variables X and Y simply means that when the
value of one variable is found to change in one direction, the value of the other variable is
found to change either in the same direction (i.e., positive change) or in the opposite
direction (i.e., negative change), but in a definite way.
 It is an analysis of the covariance between two variables.
 It helps to measure the magnitude and direction of relationship between two variables.
 Types of correlation
o Positive and negative correlation:
 When the variables move in the same direction, these variables are said to
be correlated positively and
 if they move in the opposite direction they are said to be negatively
correlated (for e.g., price and demand of a commodity, sale of woollen
garments and day temperature)
o Linear and non-linear correlation:
 With a unit change in one variable there is a constant change in other
variable over the entire range of values then the correlation between the
variables is linear
 If corresponding to a unit change in one variable, the other variable does not
change at a constant rate but at a fluctuating rate then the correlation is said
to be non-linear or curvilinear

Correlation and Causation

 Causation implies correlation but reverse is not true i.e., Correlation doesn’t imply
causation. For e.g., ice-cream sales and sun glasses sales could be positively correlated
but there is no causation between them.
 Correlation analysis fails to reflect the cause-and-effect relationship between the
variables. It only tells the degree of association.
 In bivariate distribution, if the variables have the cause-and-effect relationship, they bound to
vary in sympathy with each other.
 Correlation only implies co-variation.
 Reasons of high degree of correlation
o Mutual dependence
o Both the variables being influenced by the same external factors
o Pure chance
 A high value of r is neither necessary nor sufficient for a causal relationship between
X and Y.

2
Correlation

Not necessary because r is close to 0 yet X and Y can have causal relationship. This is possible
if the relationship between X and Y is non-linear since r only measures straight line
relationships. E.g., Y = X 2

Not sufficient because it may be because of Spurious correlation


 Chance correlation, e.g., Increase in Hippopotamus population and steel
production.
 X and Y may be affected by a third variable (“common response variable”,
“confounding factor”, “lurking variable”) without being related to each other. For e.g.,
ice-cream sales and sun glasses sales could be positively correlated and rise in
temperature may be the cause of such correlation.

Degrees of Correlation

Methods of studying correlation

1. Scattered diagram method


2. Karl Pearson’s coefficient of correlation (Covariance method)
3. Two-way frequency table (Bivariate correlation method)
4. Spearman Rank Order Correlation Method
5. Concurrent deviation method

Scatter Diagram Method

 A scatter diagram helps to have a visual or graphical idea about the nature of association
between two variables.
 For example, if two variables, X and Y are plotted along the X-axis and Y-axis respectively in
the x-y plane of a graph sheet, the resultant diagram of dots is known as scatter diagram.
 The various possible situations are:

Perfect Positive High Positive Low Positive No Correlation


Correlation Correlation Correlation

3
Correlation

Non-linear/ Low Negative High negative Perfect Negative


Curvilinear Correlation Correlation Correlation

s
Positive Non-linear relation Negative Non-linear relation

Karl Pearson’s Coefficient of Correlation

 It is a mathematical method for measuring the intensity or magnitude of linear


relationship between two variables
 Suggested by Karl Pearson, a British Bio-metrician and statistician
 Karl Pearson’s correlation coefficient is also called as product moment correlation
coefficient.
 Karl Pearson’s measures, known as Pearsonian correlation coefficient between two
variables series X and Y is denoted by ‘r (X,Y)’ or ‘rxy’ or ‘r’
 It is numerical measure of linear relationship between two variables. When there is a
non-linear relation between X and Y, then calculating the Karl Pearson’s coefficient of
correlation can be misleading.
 It can be defined as the ratio of the covariance between X and Y, to the product of the
standard deviation of X and Y i.e.

Covariance between X and Y is defined as

Σ ( X i− X ) ( Y i−Y )
Cov4( X , Y ) =
Σxy ∑ XY
= N
= N
- XY
N

where, N = number of observations


Correlation

 Alternatively, r is given as:


1
Σ ( X i −X ) ( Y i−Y ) Σ ( X i− X ) ( Y i−Y )
N Σ ( X i− X ) ( Y i−Y )


r= = =

2 2
∑ ( X i −X ) ∑ ( Y i−Y )
2
∑ ( X i− X )

∑ ( Y i−Y )
2
N
N

N
√∑ ( X − X ) ∑ ( Y −Y )
i
2
i
2

N N

N ∑ XY −∑ XΣY
N
r=

√ √
2 2
2 (∑ X ) 2 (∑Y )
∑X − ⋅ ∑Y − ⋅
N N
N ∑ XY −∑ XΣY
r=
√ [N ∑ X −( ∑ X ) ] [ NΣ Y −( ΣY ) ]
2 2 2 2

Where,
N = number of observations
 Sign of covariance (X,Y) gives sign of r as the standard deviations are always positive.

For example,

The correlation has to be determined between the rainfall and the yield of the vegetable
sown from the given data:
Rainfall (mm) 12 9 8 10 11 13 7
Yield (Kg) 14 8 6 9 11 12 3

Solution:

Rainfall (X) Yield (Y) XY X^2 Y^2


12 14 168 144 196
9 8 72 81 64
8 6 48 64 36
10 9 90 100 81
11 11 121 121 121
13 12 156 169 144
7 3 21 49 9
70 63 676 728 651

r = (7 * 676 – 70*63)/ ( (7*728 - 702 ) (7*651 – 632))

= 0.949 i.e., high positive correlation between rainfall and plant yield.

Properties of correlation coefficient

 The value of r is independent of the units in which X and Y are measured.


 A negative value of r indicates an inverse relation. A change in one variable is
associated with change in the other variable in the opposite direction.
 If r is positive the two variables move in the same direction.

5
Correlation

 The value of r does not depend on which of the two variables under study is labelled X
and which is labelled Y, i.e.; it does not depend upon which variable is dependent /
independent {rxy = ryx}.
 Limit of correlation coefficient: The correlation coefficient value ranges between –1
and +1[ -1 ≤ r ≤1].

 r = 1 if and only if all ( X i , Y i) pairs lie on a straight line with positive slope and r = -1 if
and only if all ( X i , Y i ) pairs lie on a straight line with negative slope. In other words, all
the points in the scatter are collinear and the correlation is perfect.
 If r = 0 the two variables are uncorrelated. There is no linear relation between them.
However, other types of relation may be there.
 The correlation coefficient is independent of change of origin and scale i.e., if X and Y
ae transformed into new variables U [U= (X-a)/h] and V [ V = (Y-b)/k] by changing the
origin and scale, then the correlation coefficient between X and Y is same as the
correlation coefficient between U and V.

6
Correlation

Corollary: If X and Y are random variables and a,b,c,d are any numbers provided only that a ≠
0, c ≠ 0 , then r (aX + b, cY + d) = [ ac / │a││c│] r (X,Y). In other words, r is affected by
change of sign. If a and c have different signs, sign of r would change.

7
Correlation

 The two independent variables are uncorrelated but the reverse is not true. A 0
coefficient of correlation only implies absence of a “linear” relationship between them.

8
Correlation

 If variable X and Y are correlated by the linear equation aX + bY + c = 0, then the


correlation between X and Y is (+1) if the signs of a and b are opposite and (-1) if the
signs of a and b are alike.
 The square of the sample correlation coefficient is equal to the coefficient of
determination resulting from fitting the simple regression model.
 r measures only linear relationships. Ex- Y = X 2 . Here, r = 0.

Assumptions underlying Karl Pearson’s Correlation Coefficient

 Each variable is affected by a large number of independent contributory causes of such


nature so as to produce normal distribution
 The variables X and Y under the study are linearly related.
 The forces operating on each of the variable series are not independent of each other
but are related in a casual fashion.

Interpretation of r

Correlation Coefficient, r Relationship between variables


r=1 Perfect positive correlation
r>0 Positive correlation
r=0 No correlation
r<0 Negative correlation
r = -1 Perfect negative correlation

 The reliability of the significance of the value of correlation coefficient depends up on a


number factors.
 One way of testing the significance of r is finding probable error
 Which in addition to the value of r takes into account the size of the sample.
 Another more useful measure of interpreting the value of r is coefficient of
determination. It is observed there that the closeness of the relationship between two
variables as determined by correlation coefficient r is not proportional.

Testing the Significance of r

 A hypothesis test of the "significance of the correlation coefficient" is performed to


decide whether the linear relationship in the sample data is strong enough to represent
the relationship in the population.
 Probable Error: The probable error of the correlation coefficient is an amount which if
added to and subtracted from the mean correlation coefficient, produces amounts within
which the chances are even that a coefficient of correlation from a series selected at
random will fall

P.E (r) = 0.6745 X S.E. (r)

1−r 2
S.E. (r) =
√n

9
Correlation

 Reason for taking 0.6745 is that in a normal distribution 50% of the observation lies in
the range µ ± 0.6745 σ , where µ is mean and σ is standard deviation
 Use of Probable Error
o To determine the limits within which the population correlation coefficient may be
expected to lie [Limits for population correlation coefficient are r ± P.E (r) ].
o To test if an observed value of sample correlation coefficient is significant of any
correlation in the population
 If r < P.E (r) correlation is non-significant
 If r > σ [P.E (r)], correlation is definitely significant
 Other case, significance of r is not known.
 P.E. can be used only if data is drawn from a normal population
 The sample is drawn using random sampling
 For small sample size, P.E. may lead to fallacious conclusion. In that case, a rigorous
test for testing the significance of an observed sample correlation coefficient is provided
by student’s t test
 Student’s t test: The test statistics is given by


 This t is distributed as Student’s t distribution with (n-2) degrees of freedom.

Note –
The symbol for the population correlation coefficient is ρ, the Greek letter "rho."
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)

For example,

The correlation coefficient between infant mortality rate and mother’s year of schooling is -0.12
based on a sample of 12 towns. Can we conclude that there is a negative correlation between
the two variables?

Solution:

X (deaths/1000 births)

Y (years)

r = -0.12, n = 12

Ho : ρ = 0

H a: Ρ < 0

10
Correlation

Test Statistic

t = -0.12
√ 12−2
1−(−0.12)2

= -0.382

At n-2 = 12-2 = 10 degrees of freedom and 5% level of significance the critical t value is 1.812
Since, -0.382 is not less than -1.812, we can’t reject the null hypothesis and the test statistic is
insignificant. We can’t conclude that there is a negative correlation between the two variables.

Two-way frequency table

 For a fairly large bivariate distribution, the data may be summarized in form of a two-way
frequency table.
 For each variable the values are grouped into different classes.
 If there are m classes for X variable and n classes for Y variable then there will be m*n cells
in that two-way frequency table.
 The formula for calculating r for bivariate frequency table is given by

r xy = ruv =

 Where u = (x-a)/ h and v = (y-b)/k


 And h and k are widths of x classes and y classes respectively and a and b are constant.

Spearman’s Rank Order Correlation Method

 It was developed by the British psychologist C.E. Spearman.


 It is used when the variables under consideration are arranged in a serial order.
 Useful while dealing with qualitative characters.
 Non parametric version of Pearson product moment correlation or Pearson correlation
coefficient.
 Spearman’s correlation is equivalent to calculating the Pearson correlation coefficient on the
ranked data.
 Measures strength and direction of monotonic relationship (in a monotonic relationship,
one variable increases, the other tends to either increase or decrease (not both) but not
necessarily at a constant rate) between two variables.

11
Correlation

 Can run Spearman’s rank correlation on a non-monotonic relationship to determine if


there is a monotonic component to the association.
 Spearman’s rank correlation coefficient can be used in some cases where there is a relation
whose direction is clear but which is nonlinear.
 Spearman’s correlation coefficient is not affected by extreme values. In this respect, it is
better than Karl Pearson’s correlation coefficient. Thus, if the data contains some extreme
values, Spearman’s correlation coefficient can be very useful.
 Assumption: Need two variables that are either ordinal, interval, ratio or continuous.
 It is a distribution free measure
 Its value lies between 1 and –1.
 Whether the order in which employees complete a test exercise is related to the number of
months they have been employed or correlation between the IQ of a person with the number
of hours spent in front of TV per week are some examples use case.
 Example: To find out relationship between two variables, A say Intelligence and B say
Beauty, first we have to arrange a group of individuals in order of merit with respect to
proficiency in these two characteristics. Let X and Y denotes the rank in A and B
characteristics respectively. Considering no ties, the correlation between X and Y (known
as spearman’s rank correlation) can be given by

 Where xi is the rank of ith individual in A character and yi is the rank of ith individual in B
character and n is the number of pairs (Both the series are ranked separately; largest value
gets the first rank and so on.)
 If there is a tie, take the average of the ranks they would have otherwise occupied and use
the following formula:

12
Correlation

 The occurrence of ties causes no problem in the calculation of the Spearman correlation
coefficient when the Pearson formula is used with the ranks.

Where i = paired score.


 The fundamental difference between the Pearson and Spearman correlation coefficients is
that the Pearson coefficient works with a linear relationship between the two variables
whereas the Spearman Coefficient works with monotonic relationships as well.
 One more difference is that Pearson works with raw data values of the variables whereas
Spearman works with rank-ordered variables.

Method of concurrent deviation

 Very casual method of determining correlation


 Used when precision is not required
 It is based on the principle that if the short time fluctuations of the time series are
positively correlated i.e., if their deviation is concurrent, their curves would move in the
same direction and indicate positive relation between them
 Based on signs of deviations of the values of variables from its preceding value
o We put a + sign if value of variable is greater than preceding value
o We put a - sign if value of variable is less than preceding value
o We put a = sign if value of variable is same as preceding value
 The deviation is said to be concurrent if they have same sign, i.e., either both deviations are
positive or both are negative or both are equal.
 The formula for calculating correlation coefficient using this method is


 Where, c is the number of pairs of concurrent deviations and m is the number of pair of
deviations. Also, m is one less than the number of pairs of observations.
 The quantity inside the square root must be positive otherwise r will be imaginary which is
not possible.
 Thus, if (2c-m) is positive we take + sign in and outside the square root and if (2c-m) is
negative we take - sign in and outside the square root.

13
Correlation

Coefficient of determination

 It gives the percentage variation in the dependent variable that is accounted for by the
independent variable.
o Example: If r2 is 0.72, it implies that on the basis of the sample, 72% of the variation
in one variable is caused by the variation of the other variable.
 It gives the ratio of the explained variance to the total variance.
 It is given by the square of the correlation coefficient.
 It is always non-negative and does not tell us about the direction of relationship (+ve or -ve)
between the two series.

Expalined variance
Coefficient of determination = r2 =
Total variance
Coefficient of non-determination (K2): It is the ratio of unexplained variation to the total
variation
Expalined variance
K2 =1- r2 = 1-
Total variance

Coefficient of Alienation (K): It is given by square root of Coefficient of non-determination


K= ± √ 1−r 2

14

You might also like