Correlation and Linear Regression
Correlation and Linear Regression
&
Table of Contents
1. Learning outcomes:
Correlation
Types of Correlation
Measurement of Correlation
Line of Best Fit
Correlation Coefficient
Assumptions for Correlation Coefficient
Properties of Correlation Coefficient
Regression
Line of Regression
Linear Regression
Moment Generating Function
Covariance
2. Introduction:
3. Correlation:
In a bivariate distribution if the change in one variable appears, to be
accompanied by a change in other variable and vice-versa, the two
variables are said to be correlated and this relationship is called
correlation.
In other words, the tendency of simultaneous variation of the two
variables is called covariation.
When the increase (or decrease) in one variable results no effect in the
other variable, the correlation is said to be Zero or No correlation. For
example heights of the students and marks obtained in a particular
subject have “zero correlation” or “No correlation”.
It is generally denoted by ρ (rho).
Merits:
The following are the merits of scatter diagram:
1) The scatter diagram gives an idea at glance about the existence or
absence of relationship between two variables.
2) It also exhibits the type of the correlation.
3) It indicates the presence of perfect of perfect negative correlation.
Demerits:
1) It does not indicate a definite measure of the degree of correlation.
2) In case of few observations, the use of scatter diagram is limited.
3) However, with nominal degree of variation, it fails to ascertain the
perfect positive or negative correlation.
Example 1: The local ice cream shop keeps track of how much ice cream
they sell versus the noon temperature on that day. Here are their figures
for the last 12 days:
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
It is now easy to see that warmer weather leads to more sales, but
the relationship is not perfect.
We can also draw a "Line of Best Fit" (also called a "Trend Line") on our
scatter plot:
Try to have the line as close as possible to all points, and as many points
above the line as below.
4. Correlation Coefficient :
To determine the intensity or degree of the linear correlation between two
variables, Karl Pearson,the great British Statistician defined a numerical
measure called Karl person correlation coefficient of simply correlation
coefficient. It is generally denoted by symbol “r” and given by
2 2
1 n 1 n
Var(X) xi x , Var(Y) yi y and
n i 1 n i 1
1 n
COV(X, Y) xi x yi y
n i 1
1 n
x x yi y
n i 1 i
r
1 n 1 n
yi y
2 2
x i x *
n i 1 n i 1
After simplification, we may write,
n n n
n xiyi xi yi
r = i 1 i 1 i 1
n 2 n 2
n 2 n 2
n xi xi * n yi yi
i 1 i 1 i 1 i 1
x i x yi y
r i 1
n n
xi x y y
2 2
* i
i 1 i 1
4.1. Assumptions:
The correlation coefficient is based on the following assumptions:
1) In each of the correlated bivariate population, aLarge number
independent causes are operating so as to produce normal
distribution.
2) The forces so operated are related in a casual way.
3) The relationship between two variables is linear.
5.2. Properties:
1) It is rigidly defined.
2) The linear correlation coefficient rxyis a pure number ( or ratio) and
thus has no unit of measurement. By symmetry it is clear that
rxy=ryx .
3) It is based on all the observations.
4) The correlation coefficient is independent of change of origin and
change of scale.
5) It ranges from -1 to +1
R=-1 when there is a perfect negative correlation.
R=0 when there is no linear correlation.
R=+1 when there is a perfect positive correlation.
xi x
X Y xi x yi y xi x yi y
2 2
yi y
2 10 -4 2 -8 16 4
4 9 -2 1 -2 4 1
6 8 0 0 0 0 0
8 7 2 -1 -2 4 1
10 6 4 -2 -8 16 1
5 5 5 5 5 5
xi x
5
yi y xi x yi y xi x
2
yi
2
xi y y
2
i
i 1 i 1 i 1 i 1 i 1 i 1
i 1
=30 =40 =0 =0 =-20 =40 =10
n n
x i
30 y i
40
x= i
= 6, y i
8
n 5 n 5
and
1 n
x x yi y
n i 1 i =
20
1
r
1 n 1 n 20
n i 1
xi x * y y
n i 1 i
There is perfect negative correlation between the number of study hours
and the number of sleeping hours.
Example 3: Let the random variable X and Y have the joint p.d.f
x y , 0 x 1 and 0 y 1
f(x, y)
0, Otherwise
11 2
7 1
xy(x y )dxdy
00 12 144
Using all the above calculated values ,we can find the correlation between
X and Y as:
1
1
r 144
11 11 11
*
144 144
Exercises 1:
1. What is the relationship between hours studying (X) and scores on a
quiz (Y)? Plot Scatter diagram.
A 1 1
B 1 3
C 3 2
D 4 5
E 6 4
F 7 5
Hours of 5. 6. 7. 8. 6. 7.
8.1 9 9.4 8.7 9.3 9.2
Sleep 4 1 4 5 4 6
Cognitiv
e 10 12 13 11 10 11
Function 0 79 72 62 2 89 76 1 0 92 1 5
5. Regression:
E(Y / x) y.w(y / x)
y
E(X / y) x.f(x / y)
x
Example 4: For the two random variables X and Y the joint density
function is given by
Solution: The joint densityof (X,Y) is given by (1). We can obtain the
marginal density of X by integrating eq. (1) w.r.t Y as follows:
e x , for x 0
g(x)
0, Otherwise
Regression Curve
E(Y/x)=1/x
1.5
0.5
0
0 2 4 6
n x y n x y
f(x, y) .1 .2 .(1 1 2 )
x, y,n x y
n x
n
x, y,n x y .
x y
g(x) 1 .2 .(1 1 2 )n x y
y 0
n x
.1 .(1 1 )n x
x
Since there are 3 equally likely possibilities 1,3 or 5 for each of the 30-x
outcomes that are not even.Then the regression equation can be obtained
as
1
(30 x).
E(Y / x) 6
1
1
2
1
(30 x)
3
Example 6: If the joint density of X1, X2, X3 is given by
x
(x1 x2 )e 3 , for 0 x1 1, 0 x2 1, x3 0
f(x1, x2 , x3 )
0,
Otherwise
Find the regression equation of X2 on X1 and X3
Solution: The joint marginal density oX1 and X3 for the given distribution
is given by
1
1 x3
m(x1, x3 ) (x
0
1 x2 )e x3 dx2 (x1
2
)e
OR
1 x
(x )e 3 , for 0 x1 1, x3 0
m(x1, x3 ) 1 2
0, Otherwise
Therefore,
f(x1, x2 , x3 )
E(X2 / x1, x3 )
x 2
m(x1, x2 )
2
1
x2 (x1 x2 ) x1
dx2 3
1 2x1 1
0 (x1 )
2
We can see that the obtained conditional distribution is depending on x 1
only not on x3.This may be possible because of pairwise independence
between x2 and x3
Let us assume that both E(Y | x) and E(X | y) are linear i.e.
E(Y | x) y
f(y | x) x , (1)
and E(X | y) x
f(x |y) y . (2)
f
1 (x)E(Y | x)dx f1(x)( βx)dx β E(X)
or y f1(x)f (y | x)dydx E( X )
y[f (x, y )]dxdy E( X ) (3)
which shows that the regression line of y on x passes through the mean
of X and y i.e. E( X ) and E(Y ) .
x f1(x)E(Y | x)dx x f1(x)( x)dx
2
or xyf1(x) f (y | x)dydx x f1(x)dx x f1(x)dx
or xyf (x, y)dxdy E(X ) E(X ). (6)
2
y.w(y / x)dy x
Or
2= +1
Since w(y/x)=f(x,y)
Solving 2= +1 and E(XY) 1 E(x2 ) for and and making
use the fact that E(XY) 12 12 and E(X2 ) 12 12 , we find that
2 12 .1 2 2 .1
12 1
12 2
And 2
1 1
Finally, we can write the linear regression equation of Y on X as
E(Y/x) = 2 2 (x 1 )
1
And the linear regression equation of X on Y using similar steps as
E(X/y) = 1 1 (y 2 )
2
Exercises 2:
1. Five children aged 2, 3, 5, 7 and 8 years old weigh 14, 20, 32, 42
and 44 kilograms respectively.
Calculate:
(i) The regression line of y on x.
4. For the two random variables X and Y the joint density function is
given by
2x
, for x 0 and y 0
f(x,y) = (1 x xy )3
0 , otherwise
1
E(Y / x) 1 and that var(Y / x) does not exist.
x
0 1 2
1 1 1
0
12 6 24
1 1 1
1
4 4 40
1 1
2
8 20
1
3
120
(a) Find the conditional distribution of X given Y=1 and then the
regression line E(X/1).
(b) Find the conditional distribution of Y given X=0 and then
regression line E(Y/0).
x 1 y
Show that E(Y / x) and E(X / y )
2 2
10. Given the joint density
24xy, for x 0 , y 0 and x y 1
f(x,y) =
0 , otherwise
2 1 y
Show that E(Y / x) (1 x) and E(X / y )
3 2
MX(t) = E 1 tX
tX
2
tX
3
2! 3!
Practically, in many situations we find variables occurring in pairs rather
than singly. In such a scenario, we are looking at joint probability
distributions which further give way to marginal and conditional
distributions. For such distributions, we define the moment generating
functions as
MX,Y(t,s) = E(etX+sY)
= E(etXesY)
=
E 1 tX
tX 2 tX 3 1 sY sY 2 sY 3
2! 3! 2! 3!
s 2Y 2 ts 2 XY 2
1 sY tX tsXY
2! 2!
E
2 2 2 2 2 2 2 2
t X t sX Y t s X y
2! 2! 2!2
The following results hold true for the multivariate moment generating
functions:
1. MX,Y(0,s) = E(esY) = MY(s)
2. MX,Y(t,0) = E(etX) = MX(t)
3. X andY are independent if and only if MX,Y(t,s) = MX(t) MY(s)
2 M X ,Y t , s
4. EXY
ts t 0, s 0
s 2 XY 2
0 X sXY
M X ,Y t , s 2!
Hint:- E
t 2 2 2 2 2
2tX 2tsX Y 2ts X y
2! 2! 2!2
2sXY 2
XY
2 M X ,Y t , s 2!
E
ts 2 2 2
2tX Y 4tsX Y
2! 2!2
M X ,Y 0, s
5. EY
s s 0
M X ,Y t ,0
6. EX
t t 0
2 M X ,Y 0, s
7.
s 2
EY2
s 0
2 M X ,Y t ,0
8.
t 2
E X2
t 0
r s M X ,Y t , s
9. In general,
t s
r s
E X rY s
t 0 , s 0
8. Covariance:
The Results 4, 5 and 6 can be used to calculate the covariance between
a pair of variables for a bivariate distribution as :
COV(X,Y)=E[XY]-E[X]*E[Y]
2 M t, s M X ,Y t , 0 M X ,Y 0, s
COV ( X , Y ) X ,Y
*
ts
t 0, s 0 t t 0
s s 0
Exercise 3:
e s t 1 es 1
Hint: -Mx,y(s,t) = 2 2
ss t st
e2 1 1
Hint: -Mx(t) = 2 2 2
t t t
se s e s 1
Hint: - My(s) = 2
s2
Partial Solution:
f(x,y) = 2, 0≤x≤y≤1
y sx
1 y 1 1
2 ty sy
Mx,y(s,t) = 2e e dxdy = 2 e e dx dy = e e 1 dy
sx ty ty
0 0 0 0 s0
2
1 1
2 e s t 1 e t 1
= e s t y dy e ty dy =
s 0 0 s s t t
e s t 2st s t e s st s t s t
Hint: -Mx,y(s,t) =
s 2t 2
3te 2 2e 2 t 2
Hint: -Mx(t) =
2t 2
3se s 2e s s 2
Hint: -My(s) =
2s 2
3: (a) If (U,V) ~ BVN (0, 0, 1, 1, ρ), then using MGF, show that
Correlation between U and V is given by ρ.
(b) If (X,Y) ~ BVN (μx, μy, σx2, σy2, ρ), then using MGF, show that
Correlation between X and Y is given by ρ.
(c) If (U,V) ~ BVN (0, 0, 1, 1, ρ), find the distribution of linear
combination of lX + mY using MGF.
2e t 3 e s 1
generating function Mx,y(t,s) =
10
Correlation
Types of Correlation
Measurement of Correlation
Line of Best Fit
Correlation Coefficient
Assumptions for Correlation Coefficient
Properties of Correlation Coefficient
Regression
Line of Regression
Linear Regression
Moment Generating Function
Covariance
References:
Web Links:
www.users.miamioh.edu/claypohm/.../Chapter%209%20correlatio
n.doc
http://www.vitutor.com/statistics/regression/problems_regression.h
tml
http://www.mathsisfun.com/data/scatter-xy-plots.html
http://www.emathzone.com/tutorials/basic-statistics/examples-of-
correlation.html
Suggesting Reading: