Regression Analysis Material
Regression Analysis Material
Having known that two variables X and Y are correlated. We can find mathematical equation
of this relationship between two correlated variables is called regression equation of two
correlated variables. Out of the two variables, we one variable is called as dependent variable
and other as independent variable and we predict the value of dependent variable given the
value of independent variable with the help of regression equation.
Then there are two cases as below.
Case I) Y is dependent variable and X is independent variable then the regression
equation is written as Y = a +bx
Using this equation, we can predict or estimate the value of dependent variable Y
using value of dependent variable X.
It can be proved that for a given bivariate data set (X,Y) this equation is reduced to
y- 𝑦̅ = 𝑏𝑦𝑥 ( 𝑥 - 𝑥̅ ) Where 𝑏𝑦𝑥 is called as regression coefficient of Y on X.
Using this equation, we can’t predict X given Y as this equation contains 𝑏𝑦𝑥 which is
regression coefficient of Y on X and for predicting X given Y and we require equation
𝐶𝑜𝑣(𝑥,𝑦)
based on 𝑏𝑥𝑦 . It can be proved that 𝑏𝑦𝑥 = 𝑉(𝑋)
Case II) X is dependent variable and Y is independent variable, then regression equation
is written as X = a’ + b’ Y using which we predict or estimate value of X given value
of Y.
It can be proved that for a given data set (X, Y) this equation is reduced to
x- 𝑥̅ = 𝑏𝑥𝑦 ( 𝑦- 𝑦̅ ) Where 𝑏𝑥𝑦 is called as regression coefficient of X on Y.
but we can’t estimate Y using this equation as it do not contain 𝑏𝑦𝑥 . It can be proved
𝐶𝑜𝑣(𝑥,𝑦)
that 𝑏𝑦𝑥 = 𝑉(𝑋)
Hence, we require two regression equations one for estimating Y and another for estimating X.
Slope of the line :- for the line of the form Y = mx +c the slope is ‘c’ as its coefficient of X,
If the equation of the line is y- 𝑦̅ = 𝑏𝑦𝑥 ( 𝑥 - 𝑥̅ ) then slope of the line is 𝑏𝑦𝑥 let m1 denote slope of
this line . Hence m1 = 𝑏𝑦𝑥
But if equation of the line is x- 𝑥̅ = 𝑏𝑥𝑦 ( 𝑦- 𝑦̅ )
Then first put the equation in the form as y = mx +c which is as below.
𝑏𝑥𝑦 ( 𝑦- 𝑦̅ ) = x- 𝑥̅
1 1
𝑦- 𝑦̅ = 𝑏 (x- 𝑥̅ ) Hence slope of the line = coefficient of X = 𝑏𝑥𝑦
= m2
𝑥𝑦
𝑚 −𝑚2
Then acute angel 𝜃 between two regression lines is found as tan -1 | 1+1𝑚 |
1 𝑚2
Proof:-
Consider a sample of n pairs of observations (Xi, Yi) on the variables X and Y,i=1,2…n
We consider Y as dependent variable hence equation of the line of regression of Y on X is
Y = a + bX , Where Y is dependent variable and X is independent variable. Where Y is the
observed value of variable Y and Y’ is expected value of variable Y
Hence error component 𝑒𝑖 =Y- Y’ = observed value of Y – expected value of Y
The equation can be written as: Yi = a + bXi + 𝑒𝑖 , i= 1,2,….n where 𝑒𝑖 are errors taking Y as
dependent variable
Let D = ∑ei2 then to obtain the values of a and b such that D is minimum is called fitting of
line Y = a +bx to given bivariate data (Xi, Yi) , i= 1,2,…..n
Let D = ∑ei2 =∑ ( yi- a –bxi)2 or D = ∑ei2 =∑ ( yi- a –bxi)2
Using the method of least squares we fit a straight line Y=a+bX to given bivariate data.
We apply the principle of minima to find values of a and b such as D is minimum.
Conditions are :-
1) to find value ‘a’ such that it satisfies two conditions
∂D ∂2 D
1st condition is = 0 and 2nd condition is >0
∂a ∂a2
∂D ∂2 D
2) to find value of b should be such that it satisfies two conditions 1st ∂b
= 0 and 2nd ∂b2
>0
= (-2) ∑ ( 0 - 0 –xi2) = (2) ∑ xi2 > 0 as it is a positive term. Value of b obtained by equation (2)
will minimize D.
Hence a = 𝑦̅ - b 𝑥̅ ……..(a)
So substituting the value of a in equation (2) we get
∑ 𝒙𝒚 = (𝑦̅ - b 𝑥̅ ) ∑𝒙 + b ∑𝒙𝟐
∑𝒙𝒚 = ∑ 𝑋2
𝑛
= 𝑦̅ 𝑥̅ + b 𝑛
- b𝑥̅ 𝑥̅
∑𝒙𝒚 1
𝑛
- 𝑥̅ 𝑦̅ = b[ 𝑛 ∑𝒙𝟐 − 𝑥̅ 2 ]
1 ∑𝒙𝒚
b[ 𝑛
∑𝒙𝟐 − 𝑥̅ 2 ] = 𝑛
- 𝑥̅ 𝑦̅
∑𝒙𝒚
− 𝑥̅ 𝑦̅ 𝐶𝑜𝑣(𝑥,𝑦)
b= 𝑛
1 = = 𝑏𝑦𝑥 hence the proof
∑𝒙𝟐 −𝑥̅ 2 𝑉(𝑋)
𝑛
𝐶𝑜𝑣(𝑥,𝑦)
since it is regression of y on x we call b as hence 𝑏𝑦𝑥 = 𝑉(𝑋)
…..(b)
To get the equation of regression line of Y on X substitute values of a and b into equation
y = a+bx
𝐶𝑜𝑣(𝑥,𝑦)
we have a = 𝑦̅ - b 𝑥̅ and b= 𝑉(𝑋)
= 𝑏𝑦𝑥
unexplained variation
1= + r2 (coefficient of determination)
Total variation
There are two cases
i) r2 =1 , This happens if unexplained variation is zero then we get
1 = 0+ r2
∑f(u− ̅̅̅
u)2 ∑f(v− ̅̅̅
v)2
Variance (u)= V(u) = n=∑f
, Variance (v)= V(v) = n=∑f
for frequency data
COV(X,Y)
Regression coefficient of Y and X is 𝑏𝑦𝑥 = V(X)
COV(X,Y)
Regression coefficient of X on Y is 𝑏𝑥𝑦 = V(Y)
̅ )(𝒚−𝒚
∑(𝒙−𝒙 ̅ ) ̅ )(𝒗−𝒗
∑(𝒖−𝒖 ̅ )
covariance between X and Y is COV(X,Y) = = 𝒏
also COV(u,v) = 𝒏
𝑦̅ = 𝑏 + 𝑑𝑣̅ ….. 2’
̅̅̅̅ (v−𝑣)
cd (u−𝑢) ̅̅̅
𝒄𝒐𝒗(𝒙,𝒚) c cov (u,v) c
𝒃xy= == 𝒏
= = d buv
𝒗(𝒚) 𝑑2 V(v) d V(v)
Cov(X,Y)
Correlation coefficient by karl Pearson’s (r )=
σy σx
2
𝐶𝑜𝑣(𝑋,𝑌) 𝐶𝑜𝑣(𝑋,𝑌) 𝐶𝑜𝑣(𝑋,𝑌)
Consider 𝑏𝑦𝑥 𝑏𝑥𝑦 = =[ ] = r2 and r2 ≤ 1
𝜎𝑥2 𝜎𝑦2 𝜎𝑦 𝜎𝑥
3 1
r=-√ 𝑥 if both 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are negative
4 5
̅𝟐
∑𝐟(𝐱− 𝐱)
Define sd of x as (𝜎𝑥 ) =+ √ 𝐧=∑𝐟
𝟐
̅
∑𝐟(𝐲− 𝐲)
sd of x as (𝜎𝑦 )=+√ 𝐧=∑𝐟
𝐶𝑜𝑣(𝑋,𝑌)
correlation coefficient (r ) = 𝜎𝑥 𝜎𝑦
𝐶𝑜𝑣(𝑋,𝑌)
Proof :- Consider 𝑏𝑥𝑦 = 𝜎𝑦2
𝐶𝑜𝑣(𝑋,𝑌) 𝜎𝑥 𝐶𝑜𝑣(𝑋,𝑌) 𝜎𝑥
= =
𝜎𝑦2 𝜎𝑥 𝜎𝑥 𝜎𝑦 𝜎𝑦
𝝈𝒙
𝒃𝒙𝒚 = r 𝝈𝒚
𝐶𝑜𝑣(𝑋,𝑌)
Consider 𝑏𝑦𝑥 =
𝜎𝑥2
𝐶𝑜𝑣(𝑋,𝑌) 𝜎𝑦
= 𝜎𝑥2 𝜎𝑦
𝐶𝑜𝑣(𝑋,𝑌) 𝜎𝑦
= 𝜎𝑥 𝜎𝑦 𝜎𝑥
𝜎𝑦
𝑏𝑦𝑥 = r hence the proof
𝜎𝑥
7. If one of the regression coefficients is greater than unity, the other must be
less than unity. Means if 𝑏𝑦𝑥 > 1 then 𝑏𝑥𝑦 < 1 and if 𝑏𝑥𝑦 > 1 then 𝑏𝑦𝑥 < 1
But the converse need not be true. Means if 𝑏𝑦𝑥 < 1 then 𝑏𝑥𝑦 need not be >1
or if 𝑏𝑥𝑦 < 1 then 𝑏𝑦𝑥 need not be >1
𝐶𝑜𝑣(𝑋,𝑌)
Proof :- We have 𝑏𝑦𝑥 = = Regression coefficient of Y on X,
𝜎𝑥2
𝐶𝑜𝑣(𝑋,𝑌)
𝑏𝑥𝑦 = = Regression coefficient of X on Y
𝜎𝑦2
𝐶𝑜𝑣(𝑋,𝑌)
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 (𝑟) =
𝜎𝑦 𝜎𝑥
we know r2 ≤ 1,
r2 = 𝑏𝑦𝑥 . 𝑏𝑥𝑦
𝑏𝑦𝑥 . 𝑏𝑥𝑦 ≤ 1
𝑏𝑦𝑥 ≤ 1/𝑏𝑥𝑦 …. dividing both sides of equation by 𝑏𝑥𝑦 …... [2]
𝑏𝑦𝑥 ≤ 𝑏
1
𝑥𝑦
<1 … From [1]
𝑏𝑦𝑥 < 1 if 𝑏𝑥𝑦 > 1
b) To prove if 𝑏𝑦𝑥 > 1 then 𝑏𝑥𝑦 < 1
1
if 𝑏𝑦𝑥 > 1, => < 1 ……..[1’]
𝑏𝑦𝑥
𝑏𝑥𝑦 ≤
1
𝑏𝑦𝑥
…… divide both sides of equation by 𝑏𝑦𝑥 .[2’]
1
Hence we get 𝑏𝑥𝑦 ≤ < 1 using [1’]
𝑏𝑦𝑥
1
[ 𝑏𝑦𝑥 + 𝑏𝑥𝑦 ] ≥ 𝑟
2
−𝜎𝑥 −𝜎𝑦
𝑏) 𝑏𝑥𝑦 = and 𝑏𝑦𝑥 = if r = -1
𝜎𝑦 𝜎𝑥
So first regression lines become x- 𝑥̅ =𝑏𝑦𝑥 ( y- 𝑦̅ )
− 𝜎𝑥
=> (x- 𝑥̅ )= ( y- 𝑦̅ )
𝜎𝑦
3. Smaller the angle between the regression lines, higher value of the
correlation (r) between the variables X and Y
when r = ±1 the regression lines are same means for maximum values
of r then angle between lines is zero(same lines)
**********