SolutionQuiz1
SolutionQuiz1
Solution to Quiz 1
Prepared by - Sai Vivek and Siddhant
Solution Question 1: Consider a dataset S = (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ) where
x(i) ∈ Rd and y (i) ∈ R for all i=1,2,. . . ,m. Let r := (r1 , r2 , . . . , rm ) be any real value vector. For
any Θ ∈ Rθ and Θ0 ∈ R, consider following weighted mean squared errors
Pm (i)
L(S, r, Θ) = i=1 ri (y − x(i)T Θ − Θ0 )2
1.1 [2 points] Find value of (Θ0 ,Θ) ∈ Rd+1 that minimizes it.
m
X
L(S, r, Θ) = ri (y (i) − x(i)T Θ − Θ0 )2
i=1
(i) (i)
let, Θ = [θ0 ,θ1 , . . . , θd ]T , x(i)T = [1, xd , . . . , xd ]
X m
L(S, r, Θ) = ri (y (i) − x(i)T Θ)2
i=1
m
X √ √
L(S, r, Θ) = ( ri y (i) − ri x(i)T Θ)2
i=1
N ow,
Y = [y 1 , y 2 , . . . , y m ]T
√
r1 0 ... ... ...
0 √
r2 . . . ... ...
R = ... ... ...
... ...
. . . . . . . . . √rm−1
0
√
... ... ... 0 rm
(1)T
x
x(2)T
X= . . .
...
x(m)T
∇L = 0
T T T T T T
0v − X R RY − X R RY + 2X R RXΘ = 0
X T RT RXΘ = X T RT RY
Θ = (X T RT RX)−1 (X T RT RY )
1.2 [2 points] Briefly explain how the vector r affects the optimal values.
Vector r helps to model situations where the variance of residuals is not uniform.
Observations with high residual values are given less weights and observations with low
residual values are assigned high weights.
One of the choices of vector r is ri = 1/σi2 where σi2 is variance of error for xi .
Solution to Question 2 Consider a dataset S = (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ) where
x(i) ∈ Rd and (i) ∈ R for all i=1,2,. . . ,m. The samples are drawn i.i.d. Let
Pmy (i)
ȳ = (1/m) i=1 y be the average of the labels. Consider a linear model with parameters
Θ ∈ Rd and Θ ∈ R, giving predictions yd ( i) = Θ + x(i) T Θ for all i. Does the following
0 0
relationship hold? Prove or disprove.
Pm 2 Pm 2 Pm 2
i=1 y (i) − ȳ = i=1 yb(i) − ȳ + i=1 y (i) − yb(i)
Solution:
For a multivariate linear regression, suppose we have a dataset
S = {(x(1) , y (1) ), (x(2) , y (2) ) · · · , (x(m) , y (m) )} where xi ∈ Rd and yi ∈ R for all i = 1, 2, · · · , m.
Let xij , i ∈ {1, · · · , m}, j ∈ {1, · · · , d} be the j-th feature of the i-th sample.
The predicting equation for ybi is given by
Yb = Xβ + ε
For the ordinary least squares estimation, we want to minimize sum of squared errors SSE,
that is, the objective function is εT ε. If we substitute the above equation to the SSE formula,
we get the target optimization problem represented by
2
min εT ε : ε = Y − Xβ
β
Okay, let’s recall the first order partial derivative in a matrix form, you can expand and verify
the rules below in its scalar form.
If W is symmetric, 0
Rule #1: β T X = β, (W X)0 = W
0
Rule #2: X T W X = 2W X
0
In the special case for Rule #2 when W = I, X T X = 2X
Therefore, for this continuous function of SSE, the first order necessary optimality condition is
0
given by εT ε = 0, that is, by the chain rule,
2X T (Y − Xβ) = 0
b satisfies X T Y = X T X β,
Hence, the optimal β i.e., β b thus we get
−1
b = XT X
β XT Y
and
Yb = X β
b
−1 T
where X T X X is called the left pseudo inverse of X.
Note that for a simple linear regression (only one dependent variable), above reduces to
cov(x, y)
β1 =
var(x)
To see this, we write out the variables in their explicit form.
1 x1 y1
X m×2 = ... .. , Y = ..
. .
1 xm ym
We get
β0 −1
β
b
2×1 = = XT X XT Y
β1
−1
1 x1
y1
1 ··· 1 .. .
.
1 ··· 1 ..
=
. . .
x1 · · · xm x1 · · · xm
1 xm ym
Pm −1 Pm
m i=1 xi i=1 yi
= Pm Pm 2
Pm
i=1 xi i=1 xi i=1 xi yi
Bear in mind that we have
−1
a b 1 d −b
=
c d ad − bc −c a
We can get
3
P P P
m xi yi − xi · yi cov(x, y)
β1 = P 2 P P =
m xi − xi · xi var(x)
We now focus on proving
The SSM/regression sum of squares (RSS), a.k.a. explained sum of squares (ESS), is given by
m
X 2
yb(i) − ȳ = (Yb − Y )T (Yb − Y )
i=1
b − Y )T (X β
= (X β b −Y)
bT XT Xβ
=β bT XT Y
b + Y T Y − 2β
Therefore,
where
−1
b = XT X
β XT Y
We see that
T
Y T Xβ b XT Xβ
b −β b
bT XT Xβ
=Y T X β
b −β b
=Y T X β
b −βbT XT Y
=Y T X β
b − Y T Xβ
b=0
4
T
b X T Y − 2Y T Y = 0
2β
to get SST = SSM + SSE.
We may ask is this true in general??? No! But we do have assumptions when we
conduct Ordinary Least Squares(OLS) regression.
Remember the moment restriction for a simple linear OLS regression.
E (y − β0 − β1 x) = 0 and E [x (y − β0 − β1 x)] = 0
The expected value of the error term should be zero and the error term should be uncorrelated
with the explanatory variables.
b T X T Y − Y T Y = −(Y − X β)
β b T Y = −εT Y = −ȳεT e = 0
1
..
where en×1 = . .
1
Therefore
SST = SSM+SSE
If the assumption that the expected value of the residual term is zero is violated, then
SST 6= SSM+SSE
Reference
Larry Li @ http://www.larrylisblog.net/WebContents/SST EQ RSS PLUS SSE.pdf