Chapter 2 - Correlation and Regression
Chapter 2 - Correlation and Regression
Correlation
If increase in one variable causes a proportionate increase in the other variable, then the variables
are said to be positively correlated.
If increase in one variable causes a proportionate decrease in the other variable, then the variables
are said to be negatively correlated.
Methods of studying correlation:
(i) Scatter diagram method
(ii) Karl Pearson’s correlation coefficient
(iii)Spearman’s rank correlation coefficient
The given data are plotted on a graph in the form of dots, i.e, for each pair of X and Y, we put dots
and looking at the scatter of the various points, we form an idea as to whether the two variables are
related or not. The more the plotted points scatter over a chart, the lesser is the degree of
relationship between the two variables. The nearer the points come to a line, the higher the
relationship. If the points lie in a haphazard manner, it shows the absence of any relationship
between the variables.
Positive correlation Negative correlation No correlation
• As the number of trees cut down increases, the probability of erosion increases.
• In archaeology, a more stable landform means more site visibility.
• As the temperature decreases, the speed at which molecules move decreases.
• As the speed of a wind turbine increases, the amount of electricity that is generated
increases.
• As the amount of moisture increases in an environment, the growth of mold spores
increases.
• As algae increased in the lake a certain species of algae eating fish increased.
• As the percentage of salt in salty water increases, buoyancy increases.
• As you eat more antioxidants, your immune system improves.
Negative Correlation
• COX activity decreases with larger body sizes
No correlation
• Between height and IQ
Levels of correlation
Example of negatively correlated variables
Correlation is a statistical measure (expressed as a number) that describes the size and direction of
a relationship between two or more variables. A correlation between variables, however, does not
automatically mean that the change in one variable is the cause of the change in the values of the
other variable.
Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a
causal relationship between the two events. This is also referred to as cause and effect.
Theoretically, the difference between the two types of relationships are easy to identify — an
action or occurrence can cause another (e.g. smoking causes an increase in the risk of developing
lung cancer), or it can correlate with another (e.g. smoking is correlated with alcoholism, but it
does not cause alcoholism). In practice, however, it is difficult to clearly establish cause and effect,
compared with establishing correlation.
Karl Pearson’s correlation coefficient
This gives us a measure of correlation which indicates the degree of correlation in quantitative
terms. It is defined as
1
Cov( X , Y ) n
(
x−x y− y )( ) (x − x)(y − y )
r (x, y ) = rxy = = =
x y x y n x y
Note:
1
n
( )( )
x − x y − y is called the covariance between X and Y (Cov (X,Y).
n xy − x y
rxy =
n x 2 − ( x ) n y 2 − ( y )
2 2
(ii) The correlation coefficient is independent of change of scale and origin of the variables X
X −a Y −b
and Y. i.e., if U = ,V = where a, b, h, k, are constants, h > 0, k > 0, then
h k
r(X,Y) = r(U,V)
Examples
1. Find Karl Pearson’ correlation coefficient for the following heights in inches of fathers
(x) and their sons (y):
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Ans:
x Y x2 y2 Xy
65 67 4225 4489 4355
66 68 4356 4624 4488
67 65 4489 4225 4355
67 68 4489 4624 4556
68 72 4624 5184 4896
69 72 4761 5184 4968
70 69 4900 4761 4830
72 71 5184 5041 5112
x = 544 y = 552 x 2
= 37028 y 2
= 38132 xy = 37560
n= 8 , x = 544 , y = 552 , x 2
= 37028 , y 2 = 38132 , xy =37560
n xy − x y
rxy =
n x − ( x ) n y − ( y )
2 2 2 2
=
(8 37560) − (544 552) = 0.603
(8 37028) − (544)2 (8 38132) − (552)2
Ans:
x Y x2 y2 xy
25 18 625 324 450
30 20 900 400 600
28 21 784 441 588
29 16 841 256 464
32 14 1024 196 448
24 13 576 169 312
36 22 1296 484 792
28 15 784 22 420
27 19 729 361 513
21 12 441 144 252
n = 10 , x = 280 , y =170 , x 2
= 8000 , y 2 = 3000 , xy =4839
n xy − x y
rxy =
n x 2 − ( x ) n y 2 − ( y )
2 2
=
(10 4839) − (280 170) = 0.5955
(10 8000) − (280)2 (10 3000) − (170)2
3. A computer while calculating rxy from 25 pairs of observations , obtained the following
showed that two pairs of values (6, 14), (8, 6) were wrong, while the correct values were (8, 12),
(6,8). Obtain the correct value of correlation coefficient.
Ans:
Correct value of x = 125 – 6 - 8 + 8 + 6 = 125
Correct value of x 2
= 650 -36 – 64 + 64 + 36 = 650
Correct value of y 2
= 460 – 196 – 36 + 144 + 64 = 436
Rank Correlation
Sometimes we have to deal with problems in which data cannot be quantitatively measured but
qualitative measurement is possible. Here, we give ranks to the values in each series separately
and calculate Spearman’s rank correlation coefficient as
6 d 2
= 1−
(
n n2 −1 ) where d is the difference between the ranks of paired items in the two series.
2. Spearman’s rank correlation coefficient and Karl Pearson’s correlation coefficient for
a given data, are usually different.
3. Spearman’s rank correlation coefficient has the same value as Karl Pearson’s correlation
coefficient between the ranks.
.
Repeated ranks
If two or more individuals have the same value in a series, then each individual is given the
average of ranks. Then rank correlation coefficient is
6 d 2 +
1 3
m −m + (
1 3
)
m − m + ...... ( )
= 1−
12 12
( )
where m is the number of items whose ranks
n n −1
2
are equal.
Examples
1. Calculate the rank correlation coefficient between marks in the selection test (X) and the
proficiency test (Y) of 9 recruits.
Sl.No. 1 2 3 4 5 6 7 8 9
X: 10 15 12 17 13 16 24 14 22
Y: 30 42 45 46 33 34 40 35 39
Ans:
n = 9.
d 2
= 72
6 d 2 6 72
= 1− 1− = 0.4
(
n n −12
) =
9(81 − 1)
2. Ten competitors in a music competition are ranked by three judges in the following order:
Competitor: 1 2 3 4 5 6 7 8 9 10
Judge A: 1 6 5 10 3 2 4 9 7 8
Judge B: 3 5 8 4 7 10 2 1 6 9
Judge C: 6 4 9 8 1 2 3 10 5 7
Using rank correlation coefficient, determine which pair of judges have common taste in music.
Ans:
d = 200 d = 214 d = 60
2 2 2
1 2 2
n = 10
6 d 2
2
6 214
2 = 1 − = 1− = −0.297
(
n n −1
2
) 10(100 − 1)
6 d 3
2
6 60
3 = 1 − = 1− = 0.636
(
n n −1
2
) 10(100 − 1)
d 2
= 72
In X series,
75 is repeated 2 times (m = 2 )
64 is repeated 3 times (m = 3 )
In Y series
68 is repeated 2 times (m = 2 )
6 d 2 +
1 3
m −m + (
1 3
)
m − m + ...... ( )
= 1− =
12 12
n n −1
2
( )
672 +
1 3
2 −2 + (1 3
)
3 −3 +
1 3
2 −2 ( ) ( )
1− = 0.545
12 12 12
10 10 − 1
2
( )
Exercises:
1. Calculate Karl Pearson’s Coefficient of correlation between price and supply of a commodity
from the following data:
Price (Rs.): 17 18 19 20 21 22 23 24 25 26
Supply (Kg): 38 37 38 33 32 33 34 29 26 23
2. Compute the coefficient of correlation between the corresponding values of x and y in the
following table:
X: 2 4 5 6 8 11
Y: 18 12 10 8 7 5
3. Calculate Karl Pearson’s correlation coefficient from the data:
Roll No. 1 2 3 4 5 6 7 8 9 10
Marks in Economics: 78 36 98 25 75 82 90 62 65 39
Marks in Maths: 84 51 91 60 68 62 86 58 53 47
4. Calculate the coefficient of correlation from the following data:
X: 9 8 7 6 5 4 3 2 1
Y: 15 16 14 13 11 12 10 8 9
5. Calculate the correlation coefficient between infant gestational age and birth weight from the
following table:
Infant ID: 1 2 3 4 5 6 7 8 9 10 11
Gest. age: 34.7 36.0 29.3 40.1 35.7 42.4 40.3 37.3 40.9 38.3
38.5
Birth weight:1895 2030 1440 2835 3090 3827 3260 2690 3285 2920
3430
Infant ID: 12 13 14 15 16 17
Gest. age: 41.4 39.7 39.7 41.1 38.0 38.7
Birth weight: 3657 3685 3345 3260 2680 2005
8. The coefficient of rank correlation between the debenture prices and share prices of a company
was + 0.8. If the sum of the squares of the difference in ranks was 33, find the value of n.
9. If covariance between X and Y is 10 and the variance of X and Y are respectively 16 and 9, find
the coefficient of correlation
10. Calculate Karl Pearson’s correlation between X and Y from the following data:
N = 13, X = 117 , X 2
= 1313 , Y = 260 , Y 2
= 6580 , XY = 2827
11. In two sets of variables X and Y with 50 observations each, the following data were observed:
Mean of X = 10, S.D. of X = 3
Mean of Y =6, S..D. of Y = 2
Coefficient of correlation between X and Y is 0.3. However, on subsequent verification, it
was found that one value of X (=10) and Y (=6) were inaccurate and hence weeded out. With
the remaining 49 pairs of values, how is the original value of correlation coefficient affected?
Regression
Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data. It provides a mechanism for predicting or
forecasting.
If two variables X and Y are correlated, we see that the scatter diagram will be more or less
concentrated around a curve, called the curve of regression. If this curve is a straight line,
then it is called line of regression.
When there is a reasonable amount of scatter, we can draw two different regression
lines depending upon which variable we consider to be the most accurate- the regression
line of Y on X and the regression line of X on Y.
The regression line of Y on X gives the most probable value of Y for given values of X.
The regression line of X on Y gives the most probable value of X for given values of Y.
The equation of the line of regression of Y on X is
r y r y
y−y=
x
(x − x ) where b yx =
x
is the regression coefficient of y on x.
x−x =
r x
y
(y − y ) where bxy =
r x
y
is the regression coefficient of x on y.
Note:
1. A regression equation allows us to express the relationship
between two (or more) variables algebraically. It indicates the
nature of the relationship between two (or more) variables. In
particular, it indicates the extent to which you can predict some
variables by knowing others, or the extent to which some are
associated with others.
2. A regression line is a line drawn through the points on a scatterplot
to summarize the relationship between the variables being studied.
When it slopes down (from top left to bottom right), this indicates a
negative or inverse relationship between the variables; when it
slopes up (from bottom right to top left), a positive or direct
relationship is indicated.
n xy − x y
3. b yx =
n x 2 − ( x )
2
n xy − x y
bxy =
n y 2 − ( y )
2
( )
4. Both the regression lines pass through the point x, y . Hence, by solving the two
regression equations, we can find the means of X and Y.
5. Both the regression coefficients will have the same sign; either both will be positive or both
will be negative.
6. Correlation coefficient is the geometric mean between the regression coefficients.
i.e., rxy = b yx bxy
If both the regression coefficients are positive, r will be positive; if both the regression
coefficients are negative, r will be negative.
7. Regression coefficients are independent of the change of origin, but not of scale.
2. If r = 1, then tan = 0 , = 0 or , i.e., the two regression lines coincide. They
Examples
1. Find the correlation coefficient and the equations of the regression lines for the
following data:
X: 1 2 3 4 5
Y: 2 5 3 8 7
Ans:
x y x2 y2 Xy
1 2 1 4 2
2 5 4 25 10
3 3 9 9 9
4 8 16 64 32
5 7 25 49 35
15 25 55 151 88
n= 5 , x = 15 , y = 25 , x 2
= 55 , y 2 = 151 , xy =88
n xy − x y
rxy = = 0.8062
n x − ( x ) n y − ( y )
2 2 2 2
n xy − x y
b yx = = 1.3
n x 2 − ( x )
2
n xy − x y
bxy = = 0.5
n y 2 − ( y )
2
r y
The equation of the line of regression of Y on X is y − y =
x
(x − x )
y − 5 = 1.3 (x − 3)
y = 1.3 x +1.1
2. Marks obtained by 10 students in Mathematics (x) and Statistics (y) are given below:
X: 60 34 40 50 45 40 22 43 42 64
Y: 75 32 33 40 45 33 12 30 34 51
Find the two regression lines. Also find y when x = 55
Ans:
x y x2 y2 Xy
60 75 3600 5625 4500
34 32 1156 1024 1088
40 33 1600 1089 1320
50 40 2500 1600 2000
45 45 2025 2025 2025
40 33 1600 1089 1320
22 12 484 144 264
43 30 1849 900 1290
42 34 1764 1156 1428
64 51 4096 2601 3264
n = 10
x=
x = 440 = 44
n 10
y=
y = 385 = 38.5
n 10
n xy − x y
b yx = = 1.1865
n x 2 − ( x )
2
n xy − x y
bxy = = 0.6414
n y 2 − ( y )
2
r y
The equation of the line of regression of Y on X is y − y =
x
(x − x )
y = 1.1865x − 13.706
x−x =
r x
y
(y − y )
x = 0.6414 y + 19.3061
3. For the following data, find the most likely price at Madras corresponding to the price 70
at Bombay and that at Bombay corresponding to the price 68 at Madras
Madras Bombay
Average price 65 67
S.D. of price 0.5 3.5
S.D. of the difference between the prices at Madras and Bombay is 3.1
Ans:
Let X dente the price at Madras and Y denote the price at Bombay.
Given x = 65 , y = 67 , x = 0.5 , y = 3.5 , x − y = 3.1
b yx = 5.81
b xy = 0.12
Solution:
(a) Since both the regression lines pass through the point (x , y )
8 x − 10 y + 66 = 0
40 x − 18 y − 214 = 0
x = 35
y = 31
r=
(x − x )(( y − y )) = 0.485
(x − x ) (y − y )
2 2
b yx =
(x − x )(( y − y )) = 0.568
(x − x )
2
Exercise:
1. From the following data, obtain the two regression equations
Sales: 91 97 108 121 67 124 51 73 111 57
Purchase: 71 75 69 97 70 91 39 61 80 47
2. Two variables gave the following data: x = 20, y = 15, x = 4, y = 3, r = +0.7 . Obtain
the two regression equations and find the most likely value of Y when X = 24.
3. You are given the following information about advertising and sales:
4.. The equations of the two lines of regression for a bivariate data are Y = 10(X – 5) and
X = 2.5(Y – 14). Find the arithmetic means of X and Y as well as the coefficient of
correlation between X and Y.