0% found this document useful (0 votes)
2 views

SolutionQuiz1

The document provides solutions to Quiz 1 for a course on Machine Learning, focusing on weighted mean squared errors and their minimization. It discusses the role of the vector r in adjusting weights based on residuals, and explores the relationship between total sum of squares, regression sum of squares, and sum of squared errors in linear regression. Additionally, it emphasizes the importance of assumptions in Ordinary Least Squares regression for the validity of the SST = SSM + SSE relationship.

Uploaded by

Anuj Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

SolutionQuiz1

The document provides solutions to Quiz 1 for a course on Machine Learning, focusing on weighted mean squared errors and their minimization. It discusses the role of the vector r in adjusting weights based on residuals, and explores the relationship between total sum of squares, regression sum of squares, and sum of squared errors in linear regression. Additionally, it emphasizes the importance of assumptions in Ordinary Least Squares regression for the validity of the SST = SSM + SSE relationship.

Uploaded by

Anuj Jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

DS303: Introduction to Machine Learning

Solution to Quiz 1
Prepared by - Sai Vivek and Siddhant

Solution Question 1: Consider a dataset S = (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ) where
x(i) ∈ Rd and y (i) ∈ R for all i=1,2,. . . ,m. Let r := (r1 , r2 , . . . , rm ) be any real value vector. For
any Θ ∈ Rθ and Θ0 ∈ R, consider following weighted mean squared errors

Pm (i)
L(S, r, Θ) = i=1 ri (y − x(i)T Θ − Θ0 )2

1.1 [2 points] Find value of (Θ0 ,Θ) ∈ Rd+1 that minimizes it.

m
X
L(S, r, Θ) = ri (y (i) − x(i)T Θ − Θ0 )2
i=1
(i) (i)
let, Θ = [θ0 ,θ1 , . . . , θd ]T , x(i)T = [1, xd , . . . , xd ]
X m
L(S, r, Θ) = ri (y (i) − x(i)T Θ)2
i=1
m
X √ √
L(S, r, Θ) = ( ri y (i) − ri x(i)T Θ)2
i=1
N ow,
Y = [y 1 , y 2 , . . . , y m ]T
√ 
r1 0 ... ... ...
 0 √
 r2 . . . ... ... 

R =  ... ... ...
 ... ... 
 . . . . . . . . . √rm−1

0 

... ... ... 0 rm
 (1)T 
x
 x(2)T 
 
X=  . . . 

 ... 
x(m)T

Loss function in Matrix Notation:

L(S, r, Θ) = (RY − RXΘ)2


= (RY − RXΘ)T (RY − RXΘ)
= (Y T RT − ΘT X T RT )(RY − RXΘ)
= Y T RT RY − Y T RT RXΘ − ΘT X T RT RY + ΘT X T RT RXΘ
Gradient of Loss Function w.r.t Θ to find optimal value:

∇L = 0
T T T T T T
0v − X R RY − X R RY + 2X R RXΘ = 0
X T RT RXΘ = X T RT RY
Θ = (X T RT RX)−1 (X T RT RY )

1.2 [2 points] Briefly explain how the vector r affects the optimal values.

Vector r helps to model situations where the variance of residuals is not uniform.
Observations with high residual values are given less weights and observations with low
residual values are assigned high weights.

1.3 [1 point] Is there any particular way you like to choose r?

One of the choices of vector r is ri = 1/σi2 where σi2 is variance of error for xi .

Solution to Question 2 Consider a dataset S = (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(m) , y (m) ) where
x(i) ∈ Rd and (i) ∈ R for all i=1,2,. . . ,m. The samples are drawn i.i.d. Let
Pmy (i)
ȳ = (1/m) i=1 y be the average of the labels. Consider a linear model with parameters
Θ ∈ Rd and Θ ∈ R, giving predictions yd ( i) = Θ + x(i) T Θ for all i. Does the following
0 0
relationship hold? Prove or disprove.
Pm 2 Pm 2 Pm 2
i=1 y (i) − ȳ = i=1 yb(i) − ȳ + i=1 y (i) − yb(i)

Solution:
For a multivariate linear regression, suppose we have a dataset
S = {(x(1) , y (1) ), (x(2) , y (2) ) · · · , (x(m) , y (m) )} where xi ∈ Rd and yi ∈ R for all i = 1, 2, · · · , m.
Let xij , i ∈ {1, · · · , m}, j ∈ {1, · · · , d} be the j-th feature of the i-th sample.
The predicting equation for ybi is given by

ybi = xi1 · β1 + xi2 · β2 · · · + xid · βd + 1 · β0 + εi , i ∈ {1, · · · , m}


where εi is the i-th error term.  
β0
y1
 
  β1
If we put everything in a matrix form, i.e., let Y =  ...  and β =   and X =
   
..
m
  .
y
βd
1 x11 · · · x1d
   
ε1
 .. . . ..  . 
 and ε =  ..  (vector/matrix will be written in bold form), then we

 . . .
1 xm1 · · · xm
d εm
can get the predicting equation by

Yb = Xβ + ε
For the ordinary least squares estimation, we want to minimize sum of squared errors SSE,
that is, the objective function is εT ε. If we substitute the above equation to the SSE formula,
we get the target optimization problem represented by

2
min εT ε : ε = Y − Xβ

β

= min(Y − Xβ)T (Y − Xβ)


β

Okay, let’s recall the first order partial derivative in a matrix form, you can expand and verify
the rules below in its scalar form.
If W is symmetric, 0
Rule #1: β T X = β, (W X)0 = W
0
Rule #2: X T W X = 2W X
0
In the special case for Rule #2 when W = I, X T X = 2X
Therefore, for this continuous function of SSE, the first order necessary optimality condition is
0
given by εT ε = 0, that is, by the chain rule,

2X T (Y − Xβ) = 0
b satisfies X T Y = X T X β,
Hence, the optimal β i.e., β b thus we get
−1
b = XT X
β XT Y
and

Yb = X β
b
−1 T
where X T X X is called the left pseudo inverse of X.
Note that for a simple linear regression (only one dependent variable), above reduces to

cov(x, y)
β1 =
var(x)
To see this, we write out the variables in their explicit form.
   
1 x1 y1
X m×2 =  ... ..  , Y =  .. 

.   . 
1 xm ym
We get
 
β0 −1
β
b
2×1 = = XT X XT Y
β1
  −1  
 1 x1
   y1
 1 ··· 1  .. .
.
1 ··· 1  .. 
=

 . .   . 
x1 · · · xm x1 · · · xm
1 xm ym
 Pm −1  Pm 
m i=1 xi i=1 yi
= Pm Pm 2
Pm
i=1 xi i=1 xi i=1 xi yi
Bear in mind that we have
 −1  
a b 1 d −b
=
c d ad − bc −c a
We can get

3
P P P
m xi yi − xi · yi cov(x, y)
β1 = P 2 P P =
m xi − xi · xi var(x)
We now focus on proving

SST = SSM + SSE


The total sum of squares (SST) is given by
m 
X 2
y (i) − ȳ = (Y − Y )T (Y − Y )
i=1
T
= Y T Y + Y Y − 2Y T Y
The sum of squared errors (SSE), a.k.a. sum of squared residuals (SSR), is given by
m 
X 2
y (i) − yb(i) = (Y − Yb )T (Y − Yb )
i=1
b T (Y − X β)
= (Y − X β) b
T
= Y T (Y − X β)
b −β b X T (Y − X β)
b
= Y T Y − Y T Xβ
b

The SSM/regression sum of squares (RSS), a.k.a. explained sum of squares (ESS), is given by
m 
X 2
yb(i) − ȳ = (Yb − Y )T (Yb − Y )
i=1
b − Y )T (X β
= (X β b −Y)
bT XT Xβ
=β bT XT Y
b + Y T Y − 2β

Therefore,

SST − SSM − SSE


T
= Y T Y + Y Y − 2Y T Y − Y T Y + Y T X β bT XT Xβ
b −β bT XT Y
b − Y T Y + 2β
b T X T Y − 2Y T Y + Y T X β
= 2β bT XT Xβ
b −β b

where
−1
b = XT X
β XT Y
We see that
T
Y T Xβ b XT Xβ
b −β b
bT XT Xβ
 
=Y T X β
b −β b

=Y T X β
b −βbT XT Y
=Y T X β
b − Y T Xβ
b=0

So, it suffices to prove that

4
T
b X T Y − 2Y T Y = 0

to get SST = SSM + SSE.
We may ask is this true in general??? No! But we do have assumptions when we
conduct Ordinary Least Squares(OLS) regression.
Remember the moment restriction for a simple linear OLS regression.

E (y − β0 − β1 x) = 0 and E [x (y − β0 − β1 x)] = 0

The expected value of the error term should be zero and the error term should be uncorrelated
with the explanatory variables.

b T X T Y − Y T Y = −(Y − X β)
β b T Y = −εT Y = −ȳεT e = 0
 
1
 .. 
where en×1 =  . .
1
Therefore
SST = SSM+SSE
If the assumption that the expected value of the residual term is zero is violated, then

SST 6= SSM+SSE

Reference
Larry Li @ http://www.larrylisblog.net/WebContents/SST EQ RSS PLUS SSE.pdf

You might also like