Correlation and Regression Notes
Correlation and Regression Notes
In many situations, the outcome of a random experiment will have two measurable
characteristics, viz., will result in two random variables 𝑋 and 𝑌. Often we will be
interested in finding whether the two different random variables are related to each
other. If they are related, we will try to determine the nature of the relationship and
degree of relationship (correlation). Assuming that there is some correlation between 𝑋
and 𝑌, we will then try to find a formula expressing the relationship and use this formula
to predict the most likely value of one random variable corresponding to any given value of
the other random variable.
Scatter diagram:
Let (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) be pairs of values of 𝑋 and 𝑌. Then we plot the points with
co-ordinates (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) on the graph paper. The simple figure consisting of
the plotted points is called scatter diagram.
From the scatter diagram, we can form a fairly good idea of relationship between 𝑋 and 𝑌.
If the points are dense or closely packed, we may conclude that 𝑋 and 𝑌 are correlated. On
the other hand, if the points are widely scattered throughout the graph paper, we may
conclude that 𝑋 and 𝑌 are either not correlated or poorly correlated.
Further if the points in the scatter diagram appear to lie near a straight line, we assume
that the random variables have linear correlation. If they cluster around a well-defined
curve other than a straight line, the random variables are assumed to be non-linear.
For example:
1. The number of cigarettes smoked per day (X) and the chance of lung cancer (Y)
have a positive correlation, i.e., as one variable increases the other also increases.
2. The temperature of a city (X) and the number of sweaters sold (Y) have a negative
correlation, i.e., as one variable increases the other decreases and vice versa.
3. Rainfall in a region (X) and number of cars sold (Y) have no correlation.
In this section we will assume linear correlation between 𝑋 and 𝑌 and discuss how to
measure the degree of linear correlation.
Covariance:
1|Page
𝑛
1 1
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ( ∑ 𝑥𝑖 𝑦𝑖 ) − 𝑥̅ 𝑦̅
𝑛 𝑛
𝑖=1
If (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are the pair of points of 𝑋 and 𝑌 then the Karl Pearson’s
coefficient of correlation is given by
𝒄𝒐𝒗(𝑿, 𝒀)
𝒓=
𝝈𝒙 𝝈𝒚
Note that: −1 ≤ 𝑟 ≤ 1.
The following image shows the scatter plot and the value of 𝑟 for different types of data:
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-
Correlation_examples2.svg.png
2|Page
1. If 𝑟 = 1 then there is perfect positive correlation between 𝑋 and 𝑌.
2. If 𝑟 = −1 then there is perfect negative correlation between 𝑋 and 𝑌.
3. If 𝑟 = 0 then there is no correlation between 𝑋 and 𝑌
Challenge 01: Compute Karl Pearson’s coefficient of correlation between 𝑋 and 𝑌 where
𝑋 3 5 4 6 2
𝑌 3 4 5 2 6
Solution:
𝑥𝑖 𝑦𝑖 𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖2
3 3 9 9 9
5 4 20 25 16
4 5 20 16 25
6 2 12 36 4
2 6 12 4 36
∑ 𝑥𝑖 = 20 ∑ 𝑦𝑖 = 20 ∑ 𝑥𝑖 𝑦𝑖 = 73 ∑ 𝑥𝑖2 = 90 ∑ 𝑦𝑖2 = 90
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅ 73 − 5 × 4 × 4 7
∴𝑟= = =− = −0.7.
√(90 − 5 × 42 )(90 − 5× 42 ) 10
√(∑𝑛𝑖=1 𝑥𝑖2 − 𝑛𝑥̅ 2 )(∑𝑛𝑖=1 𝑦𝑖2 − 𝑛𝑦̅ 2 )
3|Page
Challenge 02: Compute Karl Pearson’s coefficient of correlation between 𝑋 and 𝑌 where
𝑋 65 66 67 67 68 69 70 72
𝑌 67 68 65 68 72 72 69 71
Solution:
𝑥𝑖 𝑦𝑖 𝑥 = 𝑥𝑖 − 𝑥̅ 𝑦 = 𝑦𝑖 − 𝑦̅ 𝑥2 𝑦2 𝑥𝑦
= 𝑥𝑖 − 68 = 𝑦𝑖 − 69
65 67 −3 −2 9 4 6
66 68 −2 −1 4 1 2
67 65 −1 −4 1 16 4
67 68 −1 −1 1 1 1
68 72 0 3 0 9 0
69 72 1 3 1 9 3
70 69 2 0 4 0 0
72 71 4 2 16 4 8
∑ 𝑥𝑖 = 544 ∑ 𝑦𝑖 = 552 ∑ 𝑥 2 = 36 ∑ 𝑦 2 = 44 ∑ 𝑥𝑦 = 24
∑𝑛𝑖=1 𝑥𝑦 24
𝑟= = = 0.6030.
√36 × 44
√(∑𝑛𝑖=1 𝑥 2 )(∑𝑛𝑖=1 𝑦 2 )
Challenge 03: Compute Karl Pearson’s coefficient of correlation between 𝑋 and 𝑌 where
𝑥𝑖 𝑦𝑖 𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖2
100 110 11000 10000 12100
102 100 10200 10404 10000
108 104 11232 11664 10816
111 108 11988 12321 11664
115 112 12880 13225 12544
116 116 13456 13456 13456
118 120 14160 13924 14400
∑ 𝑥𝑖 𝑦𝑖 ∑ 𝑥𝑖2 ∑ 𝑦𝑖2
∑ 𝑥𝑖 = 770 ∑ 𝑦𝑖 = 770
= 84916 = 84994 = 84980
4|Page
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅ 84916 − 7 × 110 × 110
∴𝑟= =
√(∑𝑛𝑖=1 𝑥𝑖2 − 𝑛𝑥̅ 2 )(∑𝑛𝑖=1 𝑦𝑖2 − 𝑛𝑦̅ 2 ) √(84994 − 7 × 1102 )(84980 − 7 × 1102 )
216
= = 0.7528.
√294 × 280
If (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are the pair of points of 𝑋 and 𝑌 and if no 𝑥𝑖 ′𝑠 and 𝑦𝑖 ′𝑠
are repeated, then the Spearman’s rank correlation coefficient is given by
∑𝑛𝑖=1 𝑑𝑖2
𝑅 = 1 − 6( ),
𝑛3 − 𝑛
where 𝑑𝑖 = 𝑅1𝑖 − 𝑅2𝑖 and 𝑅1𝑖 and 𝑅2𝑖 are the ranks assigned to the values of 𝑋 and 𝑌
respectively.
Challenge 01: Compute Spearman’s rank correlation coefficient between 𝑋 and 𝑌 where
𝑋 53 98 95 81 75 61 59 55
𝑌 47 25 32 37 30 40 39 45
Solution:
We can assign ranks to the values of 𝑋 and 𝑌 either in the ascending order or in the
descending order. The following assignment is in descending order of the values.
2
𝑥𝑖 𝑦𝑖 𝑅1𝑖 𝑅2𝑖 𝑑𝑖2 = (𝑅1𝑖 − 𝑅2𝑖 )
53 47 8 1 49
98 25 1 8 49
95 32 2 6 16
81 37 3 5 4
75 30 4 7 9
61 40 5 3 4
59 39 6 4 4
55 45 7 2 25
∑ 𝑑𝑖2 = 160
∑𝑛𝑖=1 𝑑𝑖2 160
∴ 𝑅 = 1 − 6( 3 ) = 1 − 6( 3 ) = −0.9048.
𝑛 −𝑛 8 −8
5|Page
Challenge 02: Compute Spearman’s rank correlation coefficient between 𝑋 and 𝑌 where
𝑋 105 110 112 108 111 116 120 104 115 125
𝑌 39 41 45 38 48 58 60 35 54 69
Solution:
2
𝑥𝑖 𝑦𝑖 𝑅1𝑖 𝑅2𝑖 𝑑𝑖2 = (𝑅1𝑖 − 𝑅2𝑖 )
105 39 9 8 1
110 41 7 7 0
112 45 5 6 1
108 38 8 9 1
111 48 6 5 1
116 58 3 3 0
120 60 2 2 0
104 35 10 10 0
115 54 4 4 0
125 69 1 1 0
∑ 𝑑𝑖2 = 4
∑𝑛𝑖=1 𝑑𝑖2 4
∴ 𝑅 = 1 − 6( 3 ) = 1 − 6( 3 ) = 0.9758
𝑛 −𝑛 10 − 10
If (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are the pair of points of 𝑋 and 𝑌 and if 𝑥𝑗 is repeated
𝑚1 times, 𝑥𝑘 is repeated 𝑚2 times,… and if 𝑦𝑝 is repeated 𝑚3 times, 𝑦𝑞 is repeated
𝑚4 times and so on, then the spearman’s rank correlation is given by
6|Page
1 1
∑𝑛𝑖=1 𝑑𝑖2 + (𝑚3 ) (𝑚3 )
𝑅 = 1−6( 12 1 − 𝑚1 + 12 2 − 𝑚2 + ⋯),
𝑛3 − 𝑛
where 𝑑𝑖 = 𝑅1𝑖 − 𝑅2𝑖 and 𝑅1𝑖 and 𝑅2𝑖 are the ranks assigned to the values of 𝑋 and 𝑌
respectively.
Challenge 03: Compute Spearman’s rank correlation coefficient between 𝑋 and 𝑌 where
𝑋 32 55 49 60 43 37 43 49 10 20
𝑌 40 30 70 20 30 50 72 60 45 25
Solution:
2
𝑥𝑖 𝑦𝑖 𝑅1𝑖 𝑅2𝑖 𝑑𝑖2 = (𝑅1𝑖 − 𝑅2𝑖 )
32 40 8 6 4
55 30 2 7.5 30.25
49 70 3.5 2 2.25
60 20 1 10 81
43 30 5.5 7.5 4
37 50 7 4 9
43 72 5.5 1 20.25
49 60 3.5 3 0.25
10 45 10 5 25
20 25 9 9 0
∑ 𝑑𝑖2 = 176
1 1 1
∑𝑛𝑖=1 𝑑𝑖2 + (𝑚3 ) (𝑚3 ) (𝑚3 )
∴ 𝑅 = 1 − 6( 12 1 − 𝑚1 + 12 2 − 𝑚2 + 12 3 − 𝑚3 )
𝑛3 − 𝑛
1 1 1
176 + 12 (23 − 2) + 12 (23 − 2) + 12 (23 − 2)
= 1 − 6( )
103 − 10
= −0.076.
7|Page
Challenge 04: Compute Spearman’s rank correlation coefficient between 𝑋 and 𝑌 where
𝑋 105 110 112 108 111 116 120 104 116 125 116
𝑌 39 41 45 38 48 58 60 38 54 69 50
Solution:
2
𝑥𝑖 𝑦𝑖 𝑅1𝑖 𝑅2𝑖 𝑑𝑖2 = (𝑅1𝑖 − 𝑅2𝑖 )
105 39 10 9 1
110 41 8 8 0
112 45 6 7 1
108 38 9 10.5 2.25
111 48 7 6 1
116 58 4 3 1
120 60 2 2 0
104 38 11 10.5 0.25
116 54 4 4 0
125 69 1 1 0
116 50 4 5 1
∑ 𝑑𝑖2 = 7.5
1 1
∑𝑛𝑖=1 𝑑𝑖2 + (𝑚3 ) (𝑚3 )
∴ 𝑅 = 1 − 6( 12 1 − 𝑚1 + 12 2 − 𝑚2 )
3
𝑛 −𝑛
1 1
7.5 + 12 (33 − 3) + 12 (23 − 2)
∴ 𝑅 = 1 −6( )
113 − 11
∴ 𝑅 = 0.9545.
8|Page
More challenges:
Challenge 01: Compute both Karl Pearson’s coefficient of correlation (𝑟) and Spearman’s
rank correlation coefficient (𝑅) between 𝑋 and 𝑌 where
𝑋 30 33 25 10 33 75 40 85 90 95
𝑌 68 65 80 85 70 30 55 18 15 10
Challenge 02: Compute both Karl Pearson’s coefficient of correlation (𝑟) and Spearman’s
rank correlation coefficient (𝑅) between 𝑋 and 𝑌 where
𝑋 100 98 85 92 90 84 88 90 93 95
𝑌 500 610 700 630 670 800 800 750 700 690
Challenge 03: Compute both Karl Pearson’s coefficient of correlation (𝑟) and Spearman’s
rank correlation coefficient (𝑅) between 𝑋 and 𝑌 where
𝑋 2 3 5 7 11 13 17 19
𝑌 53 47 43 41 37 31 29 23
Challenge 04: Compute both Karl Pearson’s coefficient of correlation (𝑟) and Spearman’s
rank correlation coefficient (𝑅) between 𝑋 and 𝑌 where
𝑋 57 42 38 42 45 42 44 40 46 44 43 40
𝑌 10 26 41 29 27 27 19 18 19 31 29 33
9|Page
LINEAR REGRESSION:
If two variables are correlated, we can draw a line through the data that best fits the
scatter diagram to estimate the values of one variable while given the value of the other.
𝐶𝑜𝑣(𝑋, 𝑌)
𝑏𝑌𝑋 =
𝜎𝑥2
10 | P a g e
Properties:
1. 𝑟 2 = 𝑏𝑌𝑋 𝑏𝑋𝑌
Since −1 ≤ 𝑟 ≤ 1, we have 0 ≤ 𝑟 2 ≤ 1
∴ 𝑏𝑌𝑋 and 𝑏𝑋𝑌 are either both positive or both negative.
𝜎𝑦
2. 𝑏𝑌𝑋 = 𝑟 𝜎
𝑥
𝜎𝑦
3. 𝑏𝑋𝑌 = 𝑟 𝜎
𝑥
a) Estimate the Gross Domestic Savings as a percentage of the GDP if the annual
growth of National Income is 15.5.
b) Estimate the Annual growth of National Income if the Gross Domestic Savings as a
percentage of the GDP is 26.35.
11 | P a g e
Solution:
𝑿 𝒀 𝑿𝟐 𝒀𝟐 𝑿𝒀
14 24 196 576 336
17 23 289 529 391
18 26 324 676 468
17 27 289 729 459
16 25 256 625 400
12 25 144 625 300
16 23 256 529 368
11 25 121 625 275
8 24 64 576 192
10 23 100 529 230
14 24 196 576 336
∑ 𝑿 = 139 ∑ 𝒀 = 245 ∑ 𝑿 = 2039
𝟐 ∑ 𝒀 = 6019
𝟐 ∑ 𝑿𝒀 = 3419
Mean of X:
∑ 𝑋 139
𝑋̅ = = = 13.9
𝑛 10
Mean of Y:
∑ 𝑌 245
𝑌̅ = = = 24.5
𝑛 10
Covariance of X and Y:
1 3419
𝐶𝑜𝑣(𝑋, 𝑌) = ( ∑ 𝑋𝑌) − 𝑋̅𝑌̅ = − (13.9 ∗ 24.5) = 1.35
𝑛 10
Variance of X:
∑ 𝑋2 2039
𝜎𝑥2 = − 𝑋̅ 2 = − 13.92 = 10.69
𝑛 10
Variance of Y:
∑ 𝑌2 6019
𝜎𝑦2 = − 𝑌̅ 2 = − 24.52 = 1.65
𝑛 10
Co-efficient of regression of Y on X:
𝐶𝑜𝑣(𝑋, 𝑌) 1.35
𝑏𝑌𝑋 = 2 = = 0.1263
𝜎𝑥 10.69
Co-efficient of regression of X on Y:
𝐶𝑜𝑣(𝑋, 𝑌) 1.35
𝑏𝑋𝑌 = = = 0.8182
𝜎𝑦2 1.65
12 | P a g e
∴ 𝑦̂ = 0.1263𝑥 − 0.1263 ∗ 13.9 + 24.5
∴ 𝑦̂ = 0.1263𝑥 + 22.7444
Hence, when 𝑥 = 15.5, we get 𝑦̂ = 22.7021
The estimated value of the Gross Domestic Savings as a percentage of the GDP (Y)
when the annual growth of National Income (X) is 15.5 is 22.7021.
2𝑥 + 3𝑦 + 8 = 0
𝑥 + 2𝑦 − 5 = 0
Find
Solution:
13 | P a g e
∴ Re-writing the lines as,
1 5 1
𝑦=− 𝑥+ => 𝑏𝑌𝑋 = −
2 2 2
3 3
𝑥 =− 𝑦−4 => 𝑏𝑋𝑌 =−
2 2
Now,
1 3 3
𝑟 2 = 𝑏𝑌𝑋 𝑏𝑋𝑌 = − ∗ − = = 0.75
2 2 4
∴ 𝑟 = ± √0.75 = ±0.866
Since, 𝑏𝑌𝑋 < 0 and 𝑏𝑋𝑌 < 0 ∴ 𝑟 < 0
𝑟 = −0.866
More Challenges:
1. The following table shows the age (in years) of 10 children and a quantitative
measure of their aggressive behaviour (measured on a scale of 0 to 10). Determine
the regression line of aggressive behaviour according to age. From that line
determine the value of aggressive behaviour that would correspond to a child of 7.2
years.
Age 6 6 6.7 7 7.4 7.9 8 8.2 8.5 8.9
Aggressive 9 6 7 8 7 4 2 3 3 1
Behaviour
2. Find the means of X and Y, and the coefficient of correlation between them from
the following two equations of lines of regression. (Mention all possible cases)
4𝑋 − 5𝑌 + 33 = 0, 20𝑋 − 9𝑌 − 107 = 0
14 | P a g e