0% found this document useful (0 votes)
10 views

U4 m4

The document covers the concepts of correlation and regression in statistics, defining correlation as the relationship between two variables and detailing types such as positive, negative, and no correlation. It also introduces the correlation coefficient, rank correlation, and regression equations, providing examples and problems for calculation. Additionally, it discusses curve fitting as a method for constructing a mathematical function that best fits a set of data points.

Uploaded by

akshithreddy849
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

U4 m4

The document covers the concepts of correlation and regression in statistics, defining correlation as the relationship between two variables and detailing types such as positive, negative, and no correlation. It also introduces the correlation coefficient, rank correlation, and regression equations, providing examples and problems for calculation. Additionally, it discusses curve fitting as a method for constructing a mathematical function that best fits a set of data points.

Uploaded by

akshithreddy849
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SMTB1402-PROBABILITY & STATISTICS

UNIT 4 CORRELATION AND REGRESSION & CURVE FITTING;

CORRELATION
It refers the combination of two words ‘co’ (together) and relation (connection) between
two quantities. Correlation is the statistical tool to measure the degree to which two
variables are linearly related to each other.
i.e., To study the relationship between two variables.
If the quantities (X,Y) vary in such way that change in one variable corresponds to change
in other variable, then the variable X and Y are correlated.
Example: Price of crude oil and stock price of an oil producing company.
Price of commodity and amount of demand.
Years of Experience and Salary of Employees
Dividend and Premium of Shares
Population and National Income etc.

Types of Correlation:
(i) Positive Correlation
If for an increase in the value of one variable there is also an increase in the
value of other variable and vice versa. (Same Direction)
Examples: The more time spend running on a treadmill, the more
calories burn out.
Time spend on Marketing and Customers etc.
Temperature and Ice cream sales.
(i) Negative Correlation:
If for an increase in the value of one variable there is a decrease in the value of
the other variable and vice versa.(Opposite Direction)
Examples: Quantity of a commodity demanded and its price are
Negatively correlated.
Tax and dividend.
(ii) No correlation:
If the change in the value of one variable has no connection with the change in
the value of other variables.
Examples: Shoe size and Salary
Weight of person and Colour of his hair.
Correlation coefficient: A Correlation coefficient is a numerical measure of
some type of correlation. It is a statistical measure of the strength of the
relationship between the two variables.
Properties of correlation coefficient
The coefficient of correlation lies between -1 and +1.
When r is positive, the variables x and y increase or decrease together.
If r = +1 then there is a perfect positive correlation.
When r is negative the variables x and y move in opposite direction.
If r = -1 then there is a perfect negative correlation.
If r = 0 then the variables are uncorrelated.
Problems
Q1: Calculate the correlation coefficient for the following heights (in inches)
of fathers (X) and their sons (Y).
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71
Solution:
Formula (Karl – Pearson’s Cofficient of Correlation)
n∑XY- (∑X)( ∑Y)
rXY = -
SQRT[n∑X -(∑X) ]SQRT[n∑Y -(∑Y) ]
2 2 2 2

X Y X2 Y2 XY
65 67 4225 4489 4355
66 68 4356 4624 4488
67 65 4489 4225 4355
67 68 4489 4624 4556
68 72 4624 5184 4896
69 72 4761 5184 4968
70 69 4900 4761 4830
72 71 5184 5041 5112
∑X=544 ∑Y=552 ∑X =37028 ∑Y =38132 ∑XY=37560
2 2

Here n=8.
Substituting above values in the formula,
We get
8(37560) - (544)(552)
rxy =
SQRT [8(37028)-(544)2] SQRT [8(38132)-(552)]2

= 0.603
There is a positive correlation between x and y.
Q2. A computer while calculating the correlation coefficient between x and
y from 25 pairs of observations, obtained the following

n=25, ∑x=125, ∑x2=650, ∑y=100, ∑y2=460, ∑xy=508.

It was however, later discovered at the time of checking that they had
copied down two pairs as (6,14) and (8,6) while the correct values
were (8,12) and (6,8). Obtain the correct value of the correlation
coefficient.

Solution:
The correct values are ∑x=125-6-8+8+6=125
∑y=100-14-6+12+8=100
∑x2=650-62-82+82+62=650
∑y2=460-142-62+122+82=436
∑xy=508-(6x14)-(8x6)+(8x12)+(6x8)=520
Therefore,
The correct value of correlation coefficient

n∑XY- (∑X)( ∑Y)


rXY =
SQRT[n∑X2-(∑X)2]SQRT[n∑Y2-(∑Y)2]

(25)(520)-(125)(100)
rXY =
SQRT[(25)(650)-(125)2]SQRT[(25)(436)-(100)2]

= 0.667.
RANK CORRELATION
It is a Qualitative assessment measurement of analyzing data arranged in order
of merit in possession of two characteristics A and B.

Examples: Honesty, Beauty, Intelligence etc.,

In general the assumption that the values of variables are exactly measurable.
In some situations, it may not be possible to give precise values for the
variables. In such cases we can use another measure of correlation coefficient
called rank correlation.

Let (xi,yi) i =1,2,3,…. n be the ranks of n individuals in the group for two
characteristics A and B respectively. The correlation coefficient between the
xi,yi is called the rank correlation.
Spearman’s Rank Correlation coefficient
6∑di2
ρxy = 1-
n(n2-1)
where di = xi - yi and n is the number of pairs of observations.

Types:
1. When ranks are given
2. When the ranks are not given
3. When equal ranks are given.
PROBLEMS
Q1. When ranks are given:
The following are the ranks obtained by 10 students in statistics and
mathematics. To what extent is knowledge of students in statistics related to
knowledge in mathematics?

Rank of Stats : 1 2 3 4 5 6 7 8 9 10
Rank of Maths :2 4 1 5 3 9 7 10 6 8

Solution:
Rank in Rank in
Statistics(R1) Mathematics ( R2) d=x-y d2
1 2 -1 1
2 4 -2 4
3 1 2 4
4 5 -1 1
5 3 2 4
6 9 -3 9
7 7 0 0
8 10 -2 4
9 6 3 9
10 8 2 4
-
∑d =40
2

-
6∑di 2
6x40
ρxy = 1- ------ = 1 ------------- = 0.76.
n(n2-1) 10(100-1)
Q2. When ranks are not given:
Calculate Spearman’s rank correlation for the following data.
X: 53 98 95 81 75 71 59 55
Y: 47 25 32 37 30 40 39 45
Solution:

X Y Rank X Rank Y d=(R1-R2) d2


(R1) (R2)
53 47 8 1 7 49
98 25 1 8 -7 49
95 32 2 6 -4 16
81 37 3 5 -2 4
75 30 4 7 -3 9
71 40 5 3 2 4
59 39 6 4 2 4
55 45 7 2 5 25
-
2
∑di =160

6∑di2 6x160
ρXY= 1- ------- = 1- ------------ = -0.9048.
n(n2-1) 8(64-1)
There is very high negative correlation between X and Y.

Equal Ranks:
Q3.Find the rank correlation coefficient for the following data

x 92 89 87 86 86 77 71 63 53 50
y 86 83 91 77 68 85 52 82 37 57
Solution:
Let R1 and R2 denote the ranks in x and y respectively.
x y R1 R2 d=R1-R2 d2

92 86 1 2 -1 1

89 83 2 4 -2 4

87 91 3 1 2 4

86 77 4.5 6 -1.5 2.25

86 68 4.5 7 -2.5 6.25

77 85 6 3 3 9.00

71 52 7 9 -2 4.00

63 82 8 5 3 9.00

53 37 9 10 -1 1.00

50 57 10 8 2 4.00
∑di 2 = 44.50

m(m2-1)
6[∑di2 + ----------- + .........]
12
ρxy = 1 ------------------------------------------
n(n2-1)
where d=R1-R2 and ‘m’ is the number of times, an items is repeated.
Here n=10 and an item 86 is repeated twice i.e. m=2.
2(22-1)
6[44.5+ -----------+…]
12
ρxy = 1-
10x99

6(44.5+0.5) 6x45
= 1 - =1-
990 990
= 0.727.
There is high positive Correlation between x and y.
REGRESSION
Regression is the measure of the average relationship between two or
more variables in terms of original units of data.
Example:
If the sales and advertisement are correlated we can find out expected amount
of sales for a given advertising expenditure or the amount needed for attaining
the given amount of sales.
Lines of regression
We shall have two regression lines as the regression line of X on Y and the
regression line of Y on X.
The regression line of Y on X gives the most probable value of Y for given
values of X and the regression line of X on Y gives the most probable values
of X for given values of Y.
Formula:
Regression Equations:
(i) Equations of line of regression of Y on X
y-ӯ = byx(x-x)

∑(x-x)(y- ӯ)
where byx = ---------------------------------
∑(x-x)2
(ii) Equations of line of regression of X on Y.

(i) x- x = byx(y- ӯ)

where bxy = ∑(x-x)(y- ӯ)


----------------------
∑(y- ӯ) 2

Q4. From the following data, find


(i) The two regression equations
(ii) The coefficient of correlation between the marks in Economics
and Statistics
(iii) The most likely marks in statistics when marks in Economics are
30
Marks in25 28 35 32 31 36 29 38 34 32
Economics(x)

Marks in43 46 49 41 36 32 31 30 33 39
Statistics(y)
Solution:
x y x- x y- ӯ (x-x)2 (y- ӯ)2 (x-x)(y- ӯ)
= x-32 = y-38
25 43 -7 5 49 25 -35
28 46 -4 8 16 64 -32
35 49 3 11 9 121 33
32 41 0 3 0 9 0
31 36 -1 -2 1 4 2
36 32 4 -6 16 36 -24
29 31 -3 -7 9 49 21
38 30 6 -8 36 64 -48
34 33 2 -5 4 25 -10
32 39 0 1 0 1 0
-93
320 380 0 0 140 398 -93
Here x = ∑ x / n ӯ=∑y/n
= 320 / 10 = 380 / 10
= 32 = 38
Equations of line of regression of Y on X
y-ӯ = byx(x-x)
∑(x-x)(y- ӯ)
where byx = ---------------------------------
∑(x-x)2

= -93 / 140
= -0.6643
Therefore y – 38 = -0.6643(x-32)
y = - 0.6643x + 38 + 0.6643 * 32
y = - 0.6642x + 59.257
Equations of line of regression of X on Y.
x-x =bxy(y- ӯ)

∑(x-x)(y- ӯ) where
bxy = ---------------------------------
∑(y- ӯ) 2

= - 93 / 398
= - 0.2337
Therefore x – 32 = -0.2337(y-38)
x = -0.2337y + 40.88
coefficient of correlation
r2 = byx * bxy
= - 0.6643 * (-0.2337)
= 0.1552
r = sqrt(0.1552)
= 0.394
Now we have to find the most likely marks in statistics (y) when marks in
economics (x) are 30. We use the line of regression of y on x.
i.e. y = -0.6643x + 59.2575
put x = 30, we get y = 39.32
y = 39(appr.)
The most likely marks in statistics (y) when marks in economics (x) are 30
calculated as 39.

CURVE FITTING

Curve fitting is the process of constructing a curve or a mathematical function


that has the best fit to a set of points possibly subject to constraints.
For example, we measure the rainfall and yield of fields in Tamilnadu and
represent those values by xi and yi,,i=1,2,3…….n.
We would like to know there is any relation between x and y. The empirical
relation is written in the form of an equation y =f(x).
Most of the time we may not able to get an exact relation but may get only
approximate curve. If (xi,yi),i=1,2,3….n are the n paired data which are plotted
on the graph sheet, it is possible to draw a number of smooth curves passing
through the points. The method of finding such approximating curve is called
curve fitting.

To fit a straight line y=a+bx


METHODS OF CURVE FITTING
1. The graphical methods
2. The Methods of Group Averages
3. The Method of Least Squares.

The first one is a rough method and in the second method evaluation of
constants may vary. So we adopt another method called the method of least
squares which gives a unique set of values to the constants in the equation of
fitting curves.
METHOD OF LEAST SQUARES
The least squares method is a statistical procedure to find best fit for a set of
data points by minimizing the sum of the residuals of points from the plotted
curve.
TYPES OF CURVE
1. Fitting of a straight line : y = a + bx
y→Dependent Variable
x→ Independent Variable
a,b →Constants
The normal equations are
∑y =n a + b∑x
∑xy = a∑x + b∑x2

2. Fitting of a parabolic curve: y = a + bx + cx2


The normal equations are
∑y = na + b∑x + c∑x2
∑xy = a∑x + b∑x2 + c∑x3
∑x2y = a∑x2 + b∑x3 + c∑x4

1. FITTING OF A STRAIGHT LINE


Let y = a + bx….(1) be a straight line to be fitted to the given data.
Let (xi , yi), i = 1,2,….n be the n sets of observations, which fit the straight
line (1) We have to select a and b which will best fit the straight line to the
given data.
Working procedure:
 To fit the straight line y = a + bx
 Substitute the observed set of n values in this equation.
 Form the normal equations for each constant i.e.
The normal equations are
∑y =n a + b∑x
∑xy = a∑x + b∑x2
which are got by taking ∑ on both sides of y = a + bx and also taking
∑ on both sides after multiplying by x both sides of equation (1).
Remark: Summing of constants n times will give n times of constant.
 Solve these normal equations as simultaneous equations of a and b.
Substitute the values of a and b in y = a + bx, which is required line of
best fit.
Q1.Fit a straight line of the form y = a + bx by using the methods
of least squares.
x : 3 7 9 10
y : 168 120 72 73
Solution:
Let the straight line be y=a + bx ................. (1)
The normal equations are
∑y = n a + b∑x……………………….…..(2) and
∑xy = a∑x + b∑x2 ..................................... (3)

x y xy x2

3 168 504 9

7 120 840 49

9 72 648 81

10 73 730 100

∑x =29 ∑y =433 ∑xy =2722 ∑x2 =239

Therefore (2) →433 = 4a + 29b (here n = 4) ....................(4)


(3)→2722 = 29a + 239b................................... (5)
Multiply equation (4) by 29 and equation (5) by 4
116a + 841b = 12557
116a + 956b = 10888
Solving above two equations by changing sign we get,
-115b = 1669
b = -14.5
Substituting b value in equation (4) we get
4a + 29 (-14.5) = 433
4a – 420.5 = 433
4a = 433 + 420.5
4a = 853.5
a = 213.375
Substituting a and b values in equation (1)
we get y = 213.375 – 14.5x which is the required curve.

Q2. Fit a straight line y = a + bx, to the following data .

Year: 1991 1992 1993 1994 1995 1996 1997


Sales: 125 128 133 135 140 141 143
(in lakhs)
Solution:
Let the straight line be y=a + bx .............. (1)
The normal equations are
∑y = na + b∑x ......................................... (2) and
∑xy = a∑x + b∑x2 ................................... (3)
year y(sales) x=year – origin x2 xy
(1994)
1991 125 -3 9 -375
1992 128 -2 4 -256
1993 133 -1 1 -133
1994 135 0 0 0
1995 140 1 1 140
1996 141 2 4 282
1997 143 3 9 429
∑y = 945 ∑x = 0 ∑ x2 =28 ∑xy = 87
i.e., ∑x = 0, ∑y = 945, ∑ x2 =28, ∑xy = 87, n = 7
Substituting the above values in the normal equations we get,
945 = 7a + b(0) .........................(4)
87 = a(0) + b(28) ...................... (5)
From (4) a = 945 / 7 = 135
From (5) b = 87 / 28 = 3.11
Therefore the straight line trend equation is y = 135 + 3.11x

Q3. Fit a straight line y = a + bx to the following data.


Year : 1971 1972 1973 1974 1975 1976
Profit : 83 92 71 90 169 191

Solution:
Let the straight line be y=a+bx ............. (1)
The normal equations are
∑y =n a + b∑x……………………….…..(2) and
∑xy = a∑x + b∑x2 ................................... (3)
Since n=6(even), we take the origin to be 1973.5
Year y(sales) x=year – 1973.5 x2 xy

1971 83 -2.5 6.25 -207.5


1972 92 -1.5 2.25 -138.0
1973 71 -0.5 0.25 -35.5
1974 90 0.5 0.25 45.0
1975 169 1.5 2.25 253.5
1976 191 2.5 6.25 477.5
Total ∑y = 696 ∑x = 0 ∑ x2 = 17.5 ∑xy = 395

The normal equations are


696=6a+b(0)
395 =a(0)+b(17.5)
Therefore, a=696/6=116. b=395/17.5=22.57
The straight line trend is given be y=116+22.57x.
FITTING OF A PARABOLA y = a + bx + cx2
Let (xi , yi) , i = 1,2,……n be set of observations of two variables x and y.
Let y = a + bx + cx2 be the equation which fits best the given data.

The normal equations are


na + b∑x + c∑x2 = ∑y ......................(1)
a∑x + b∑x2 + c∑x3 = ∑xy .................. (2)
a∑x2 + b∑x3 + c∑x4 = ∑x2y… ............ (3)

(i) In y = a + bx + cx2,take ∑on both sides


(ii) Multiply by x both sides and then take ∑ on both sides.
(iii) Multiply both sides by x2 and then take ∑ on both sides.
Working Procedure:
 Form Normal Equations
na + b∑x + c∑x2 = ∑y
a∑x + b∑x2 + c∑x3 = ∑xy
a∑x2 + b∑x3 + c∑x4 = ∑x2y
 Solve these as simultaneous equations for a,b,c.
 Substitute the values of a,b,c in y = a + bx + cx2 the required parabola
of the best fit.

Q1.Fit a second degree parabola y = a + bx + cx2 to the following data:


x: 1 2 3 4 5 6 7 8 9
y: 2 6 7 8 10 11 11 10 9
Solution:
Let the parabola be y = a + bx + cx2……(1)
Whose normal equations are
∑y = na + b∑x + c∑x2 ........................... (2)
∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
x y x2 x3 x4 xy x2y

1 2 1 1 1 2 2

2 6 4 8 16 12 24

3 7 9 27 81 21 63

4 8 16 64 256 32 128

5 10 25 125 625 50 250

6 11 36 216 1296 66 396

7 11 49 343 2401 77 539

8 10 64 512 4096 80 640

9 9 81 729 6561 81 729

∑x =45 ∑y =74 ∑x2=2 ∑x3=20 ∑x4= ∑xy = ∑x2y =


85 25
15333 421 2771

Therefore (2) →74 = 9a + 45b + 285c .................... (5) (since n = 9)


(3)→421 = 45a + 285b + 2025c............. (6)
(4)→2771 = 285a + 2025b + 15333c… (7)
Solving Eqn (5) and (6) by multiplying Eqn.(5) by 5 and solving Eqn (6) and
(7) by multiplying Eqn.(6) by 285 and Eqn (7) by 45.
60b + 600c = 51 ............................................................. (8)
220b + 2508c = 104.67… ............................................... (9)
Solving Eqn(8) and (9) by multiplying Eqn (8) by 220 and Eqn (9) by
60
→18480c = -4939.8
→c = - 0.2673
Substituting c value in Eqn.(8)
60b=51-600(-0.2673)
60b=211.38
→b = 3.523
Now by substituting b and c values in Eqn.(5)
a = -0.9283
Therefore, y = -0.9283 + 3.523x – 0.2673x2 is the required parabola.

Q2. Fit a second degree polynomial equation to the following data.


Year(x): 1976 1977 1978 1979 1980 1981 1982 1983 1984
Sales(y): 50 65 70 85 82 75 65 90 95
(in lakhs)
Solution:
The second degree polynomial equation is
y=a+bx+cx2 .......................................... (1)
The normal equations are
∑y = na + b∑x + c∑x2 ........................... (2)
∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)

Here n=9.
year y x=year-1980 x2 x3 x4 xy x2 y

1976 50 -4 16 -64 256 -200 800

1977 65 -3 9 -27 81 -195 585

1978 70 -2 4 -8 16 -140 280

1979 85 -1 1 -1 1 -85 85

1980 82 0 0 0 0 0 0

1981 75 1 1 1 1 75 75

1982 65 2 4 8 16 130 260

1983 90 3 9 K227 81 270 810

1984 95 4 16 64 256 380 1520

Total ∑y=677 ∑x=0 ∑x2 ∑x3 ∑x4 ∑xy ∑x2 y

=60 =0 =708 =235 =4415


Substituting above values we get
677=9a+60c ............. (5)
235=60b….............. (6)
4415=60a+708c…… (7)
Solving we get
a=77.3509, b=3.9167, c=-0.3193
Hence, the second degree polynomial trend equation is
y =77.3509+3.9167x-0.3193x2.

Q3.The price of a commodity during 1993-98 are givien below. Fit a parabola
y=a+bx+cx2 to these data. Calculate the trend values. Estimate the price of
commodity for the year 1999.
Year: 1993 1994 1995 1996 1997 1998
Price: 100 107 128 140 181 192.
Solution:
The required parabola is y=a+bx+cx2 .................. (1)
The normal equations are
∑y = na + b∑x + c∑x2 ........................... (2)
∑xy = a∑x + b∑x2 + c∑x3 ......................(3)
∑x2y = a∑x2 + b∑x3 + c∑x4................... (4)
Year Price y x=year- x2 x3 x4 xy x2y
-1995.5
1993 100 -2.5 6.25 -15.625 39.062 -250 625
1994 107 -1.5 2.25 -3.375 5.0625 -160.5 240.75
1995 128 -0.5 0.25 -0.125 0.0625 -64 32
1996 140 0.5 0.25 0.125 0.0625 70 35
1997 181 1.5 2.25 3.375 5.0625 271.5 407.2
1998 192 2.5 6.25 15.625 39.0625 480 1200
∑y= 848 ∑x= 0 ∑x =17.5 ∑x =0 ∑x =88.37 ∑xy
2 3 4
∑x2y
=347 =2540
The normal equations are
848=6a+17.5c ------------- (5)
347=17.5b (6)
2540=17.5a+88.37c ------ (7)
Solving,
a=136.12, b=19.83, c=1.786
Hence required parabola is y=136.12+19.83x+1.786x2.
The trend values are calculated in the table
For the year 1999, x=1999-1995.5=3.5
Therefore, Price in 1999= 136.12+(19.83x3.5)+(1.786x3.5x3.5)
Rs.227.4035

(2)→4a + 32b + 16c = 56


(3)→32a + 290b + 159c = 505
(4)→16a + 159b + 94c = 276
Solving we get a = 0.6444, b = 1.661, c = 0.0169.
Therefore the required equation is
y = a + bx1 + cx2
y = 0.6444 + 1.661x1 + 0.0169x2.

You might also like