U4 m4
U4 m4
CORRELATION
It refers the combination of two words ‘co’ (together) and relation (connection) between
two quantities. Correlation is the statistical tool to measure the degree to which two
variables are linearly related to each other.
i.e., To study the relationship between two variables.
If the quantities (X,Y) vary in such way that change in one variable corresponds to change
in other variable, then the variable X and Y are correlated.
Example: Price of crude oil and stock price of an oil producing company.
Price of commodity and amount of demand.
Years of Experience and Salary of Employees
Dividend and Premium of Shares
Population and National Income etc.
Types of Correlation:
(i) Positive Correlation
If for an increase in the value of one variable there is also an increase in the
value of other variable and vice versa. (Same Direction)
Examples: The more time spend running on a treadmill, the more
calories burn out.
Time spend on Marketing and Customers etc.
Temperature and Ice cream sales.
(i) Negative Correlation:
If for an increase in the value of one variable there is a decrease in the value of
the other variable and vice versa.(Opposite Direction)
Examples: Quantity of a commodity demanded and its price are
Negatively correlated.
Tax and dividend.
(ii) No correlation:
If the change in the value of one variable has no connection with the change in
the value of other variables.
Examples: Shoe size and Salary
Weight of person and Colour of his hair.
Correlation coefficient: A Correlation coefficient is a numerical measure of
some type of correlation. It is a statistical measure of the strength of the
relationship between the two variables.
Properties of correlation coefficient
The coefficient of correlation lies between -1 and +1.
When r is positive, the variables x and y increase or decrease together.
If r = +1 then there is a perfect positive correlation.
When r is negative the variables x and y move in opposite direction.
If r = -1 then there is a perfect negative correlation.
If r = 0 then the variables are uncorrelated.
Problems
Q1: Calculate the correlation coefficient for the following heights (in inches)
of fathers (X) and their sons (Y).
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Solution:
Formula (Karl – Pearson’s Cofficient of Correlation)
n∑XY- (∑X)( ∑Y)
rXY = -
SQRT[n∑X -(∑X) ]SQRT[n∑Y -(∑Y) ]
2 2 2 2
X Y X2 Y2 XY
65 67 4225 4489 4355
66 68 4356 4624 4488
67 65 4489 4225 4355
67 68 4489 4624 4556
68 72 4624 5184 4896
69 72 4761 5184 4968
70 69 4900 4761 4830
72 71 5184 5041 5112
∑X=544 ∑Y=552 ∑X =37028 ∑Y =38132 ∑XY=37560
2 2
Here n=8.
Substituting above values in the formula,
We get
8(37560) - (544)(552)
rxy =
SQRT [8(37028)-(544)2] SQRT [8(38132)-(552)]2
= 0.603
There is a positive correlation between x and y.
Q2. A computer while calculating the correlation coefficient between x and
y from 25 pairs of observations, obtained the following
It was however, later discovered at the time of checking that they had
copied down two pairs as (6,14) and (8,6) while the correct values
were (8,12) and (6,8). Obtain the correct value of the correlation
coefficient.
Solution:
The correct values are ∑x=125-6-8+8+6=125
∑y=100-14-6+12+8=100
∑x2=650-62-82+82+62=650
∑y2=460-142-62+122+82=436
∑xy=508-(6x14)-(8x6)+(8x12)+(6x8)=520
Therefore,
The correct value of correlation coefficient
(25)(520)-(125)(100)
rXY =
SQRT[(25)(650)-(125)2]SQRT[(25)(436)-(100)2]
= 0.667.
RANK CORRELATION
It is a Qualitative assessment measurement of analyzing data arranged in order
of merit in possession of two characteristics A and B.
In general the assumption that the values of variables are exactly measurable.
In some situations, it may not be possible to give precise values for the
variables. In such cases we can use another measure of correlation coefficient
called rank correlation.
Let (xi,yi) i =1,2,3,…. n be the ranks of n individuals in the group for two
characteristics A and B respectively. The correlation coefficient between the
xi,yi is called the rank correlation.
Spearman’s Rank Correlation coefficient
6∑di2
ρxy = 1-
n(n2-1)
where di = xi - yi and n is the number of pairs of observations.
Types:
1. When ranks are given
2. When the ranks are not given
3. When equal ranks are given.
PROBLEMS
Q1. When ranks are given:
The following are the ranks obtained by 10 students in statistics and
mathematics. To what extent is knowledge of students in statistics related to
knowledge in mathematics?
Rank of Stats : 1 2 3 4 5 6 7 8 9 10
Rank of Maths :2 4 1 5 3 9 7 10 6 8
Solution:
Rank in Rank in
Statistics(R1) Mathematics ( R2) d=x-y d2
1 2 -1 1
2 4 -2 4
3 1 2 4
4 5 -1 1
5 3 2 4
6 9 -3 9
7 7 0 0
8 10 -2 4
9 6 3 9
10 8 2 4
-
∑d =40
2
-
6∑di 2
6x40
ρxy = 1- ------ = 1 ------------- = 0.76.
n(n2-1) 10(100-1)
Q2. When ranks are not given:
Calculate Spearman’s rank correlation for the following data.
X: 53 98 95 81 75 71 59 55
Y: 47 25 32 37 30 40 39 45
Solution:
6∑di2 6x160
ρXY= 1- ------- = 1- ------------ = -0.9048.
n(n2-1) 8(64-1)
There is very high negative correlation between X and Y.
Equal Ranks:
Q3.Find the rank correlation coefficient for the following data
x 92 89 87 86 86 77 71 63 53 50
y 86 83 91 77 68 85 52 82 37 57
Solution:
Let R1 and R2 denote the ranks in x and y respectively.
x y R1 R2 d=R1-R2 d2
92 86 1 2 -1 1
89 83 2 4 -2 4
87 91 3 1 2 4
77 85 6 3 3 9.00
71 52 7 9 -2 4.00
63 82 8 5 3 9.00
53 37 9 10 -1 1.00
50 57 10 8 2 4.00
∑di 2 = 44.50
m(m2-1)
6[∑di2 + ----------- + .........]
12
ρxy = 1 ------------------------------------------
n(n2-1)
where d=R1-R2 and ‘m’ is the number of times, an items is repeated.
Here n=10 and an item 86 is repeated twice i.e. m=2.
2(22-1)
6[44.5+ -----------+…]
12
ρxy = 1-
10x99
6(44.5+0.5) 6x45
= 1 - =1-
990 990
= 0.727.
There is high positive Correlation between x and y.
REGRESSION
Regression is the measure of the average relationship between two or
more variables in terms of original units of data.
Example:
If the sales and advertisement are correlated we can find out expected amount
of sales for a given advertising expenditure or the amount needed for attaining
the given amount of sales.
Lines of regression
We shall have two regression lines as the regression line of X on Y and the
regression line of Y on X.
The regression line of Y on X gives the most probable value of Y for given
values of X and the regression line of X on Y gives the most probable values
of X for given values of Y.
Formula:
Regression Equations:
(i) Equations of line of regression of Y on X
y-ӯ = byx(x-x)
∑(x-x)(y- ӯ)
where byx = ---------------------------------
∑(x-x)2
(ii) Equations of line of regression of X on Y.
(i) x- x = byx(y- ӯ)
Marks in43 46 49 41 36 32 31 30 33 39
Statistics(y)
Solution:
x y x- x y- ӯ (x-x)2 (y- ӯ)2 (x-x)(y- ӯ)
= x-32 = y-38
25 43 -7 5 49 25 -35
28 46 -4 8 16 64 -32
35 49 3 11 9 121 33
32 41 0 3 0 9 0
31 36 -1 -2 1 4 2
36 32 4 -6 16 36 -24
29 31 -3 -7 9 49 21
38 30 6 -8 36 64 -48
34 33 2 -5 4 25 -10
32 39 0 1 0 1 0
-93
320 380 0 0 140 398 -93
Here x = ∑ x / n ӯ=∑y/n
= 320 / 10 = 380 / 10
= 32 = 38
Equations of line of regression of Y on X
y-ӯ = byx(x-x)
∑(x-x)(y- ӯ)
where byx = ---------------------------------
∑(x-x)2
= -93 / 140
= -0.6643
Therefore y – 38 = -0.6643(x-32)
y = - 0.6643x + 38 + 0.6643 * 32
y = - 0.6642x + 59.257
Equations of line of regression of X on Y.
x-x =bxy(y- ӯ)
∑(x-x)(y- ӯ) where
bxy = ---------------------------------
∑(y- ӯ) 2
= - 93 / 398
= - 0.2337
Therefore x – 32 = -0.2337(y-38)
x = -0.2337y + 40.88
coefficient of correlation
r2 = byx * bxy
= - 0.6643 * (-0.2337)
= 0.1552
r = sqrt(0.1552)
= 0.394
Now we have to find the most likely marks in statistics (y) when marks in
economics (x) are 30. We use the line of regression of y on x.
i.e. y = -0.6643x + 59.2575
put x = 30, we get y = 39.32
y = 39(appr.)
The most likely marks in statistics (y) when marks in economics (x) are 30
calculated as 39.
CURVE FITTING
The first one is a rough method and in the second method evaluation of
constants may vary. So we adopt another method called the method of least
squares which gives a unique set of values to the constants in the equation of
fitting curves.
METHOD OF LEAST SQUARES
The least squares method is a statistical procedure to find best fit for a set of
data points by minimizing the sum of the residuals of points from the plotted
curve.
TYPES OF CURVE
1. Fitting of a straight line : y = a + bx
y→Dependent Variable
x→ Independent Variable
a,b →Constants
The normal equations are
∑y =n a + b∑x
∑xy = a∑x + b∑x2
x y xy x2
3 168 504 9
7 120 840 49
9 72 648 81
10 73 730 100
Solution:
Let the straight line be y=a+bx ............. (1)
The normal equations are
∑y =n a + b∑x……………………….…..(2) and
∑xy = a∑x + b∑x2 ................................... (3)
Since n=6(even), we take the origin to be 1973.5
Year y(sales) x=year – 1973.5 x2 xy
1 2 1 1 1 2 2
2 6 4 8 16 12 24
3 7 9 27 81 21 63
4 8 16 64 256 32 128
Here n=9.
year y x=year-1980 x2 x3 x4 xy x2 y
1979 85 -1 1 -1 1 -85 85
1980 82 0 0 0 0 0 0
1981 75 1 1 1 1 75 75
Q3.The price of a commodity during 1993-98 are givien below. Fit a parabola
y=a+bx+cx2 to these data. Calculate the trend values. Estimate the price of
commodity for the year 1999.
Year: 1993 1994 1995 1996 1997 1998
Price: 100 107 128 140 181 192.
Solution:
The required parabola is y=a+bx+cx2 .................. (1)
The normal equations are
∑y = na + b∑x + c∑x2 ........................... (2)
∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
Year Price y x=year- x2 x3 x4 xy x2y
-1995.5
1993 100 -2.5 6.25 -15.625 39.062 -250 625
1994 107 -1.5 2.25 -3.375 5.0625 -160.5 240.75
1995 128 -0.5 0.25 -0.125 0.0625 -64 32
1996 140 0.5 0.25 0.125 0.0625 70 35
1997 181 1.5 2.25 3.375 5.0625 271.5 407.2
1998 192 2.5 6.25 15.625 39.0625 480 1200
∑y= 848 ∑x= 0 ∑x =17.5 ∑x =0 ∑x =88.37 ∑xy
2 3 4
∑x2y
=347 =2540
The normal equations are
848=6a+17.5c ------------- (5)
347=17.5b (6)
2540=17.5a+88.37c ------ (7)
Solving,
a=136.12, b=19.83, c=1.786
Hence required parabola is y=136.12+19.83x+1.786x2.
The trend values are calculated in the table
For the year 1999, x=1999-1995.5=3.5
Therefore, Price in 1999= 136.12+(19.83x3.5)+(1.786x3.5x3.5)
Rs.227.4035