Lecture 3 - Linear Regression

EE 615: Pattern Recognition & Machine Learning Fall 2016
Lecture 3 — August 11
Lecturer: Dinesh Garg Scribe: Harsha Vardhan Tetali
3.1 Properties of Least Square Regression Model (Cont.)
3.1.1 The sum of the residual errors over training set is zero
For least square regression model, we always have the following hold true
n
i=1
ei = 0 (3.1)
⇒
n
i=1
(ˆyi − yi) = 0 (3.2)
where, ˆyi is the predicted value and yi is the true value of the target variable. In order to
prove this claim, let us consider the Residual Sum of Squares (RSS) or the Sum of Squared
Errors (SSE) parameterized by the weight vector w.
RSS(w) = SSE(w) =
n
i=1
(wT
xi − yi)2
(3.3)
To minimize the function RSS(w), we need to set the gradient vector equal to the zero. That
is,
RSS(w) =






∂RSS(w)
∂w0
∂RSS(w)
∂w1
...
∂RSS(w)
∂wn






= 0 (3.4)
Let us consider the ﬁrst component of the above gradient vector.
∂RSS(w)
∂w0 w∗
= 0 (3.5)
⇒
∂
n
i=1
(wT
xi − yi)2
∂w0
= 0 (3.6)
⇒
n
i=1
(wT
xi − yi) = 0 (3.7)
3-1

EE 615 Lecture 3 — August 11 Fall 2016
From the assumption made on the fitting curve, we have
ˆyi = w0 + w1xi1 + w2xi2 + · · · + wdxid = wT
xi (3.8)
Substituting (3.8) in (3.7) we get,
n
i=1
(ˆyi − yi) = 0 (3.9)
which proves the required claim. Q.E.D.
3.1.2 Total amount of over estimation is equal to the total amount
of under estimation
This fact follows from rewriting the Equation (3.9) in the following equivalent form
i|(ˆyi≥yi)
(ˆyi − yi) =
i|(ˆyi≤yi)
(yi − ˆyi) (3.10)
3.1.3 Vector ˆy is a projection of the vector y on to the column
space of X
While fitting the linear regression model, we ideally want a parameter vector w for which
Xw = y (3.11)
For above linear system to have an exact solution, we must have have y lying in the column
space of the coefficient matrix X. When this system of equations becomes unsolvable (i.e., y
does not lie in the column space of X), we solve this system of equation approximately. By
this, we mean that we try to find a vector ˆy lying in the column space of X which is as close
to y as possible. The corresponding coefficient vector w giving the vector ˆy would be our
desired solution. To find the optimal ˆy vector, we need to solve the following optimization
problem
arg minˆy∈colspace(X) y − ˆy
2
2
(3.12)
Note, any vector ˆy lying in the column space of X, should be of the following form:
ˆy =
d
j=0
αjX∗j (3.13)
for some αj ∈ R; i = 1, 2, . . . , d. Substituting this form of the vector ˆy in the previous
optimization problem, we get the following equivalent problem
arg minα∈Rd+1 y − Xα
2
2
(3.14)
3-2

One can easily verify that the optimal value of alpha would be
α∗
= (X X)−1
X (3.15)
which is the same as w∗
. The matrix (X X)−1
X is known as the projection matrix and
pre-multiplying it to any vector would yield the projection of that vector into the column
space of X.
We can visualize this interpretation of the regression parameters w∗
using ﬁgure below.
Let us begin by considering the matrix equation Ax = b to be solved for the best optimal
solution, where A is a matrix and x and b are column vectors, where x has to be estimated.
The mustard colored hyper-plane passes through the origin of the axes (only three are repre-
sented, because more than three would not accommodate) and represents the column space
of A. Let us assume that the orange colored vectors span the complete mustard colored
hyper-plane (only two have been shown, so that the clarity is not missed) and are placed in
the columns of A. Now the problem, Ax = b, where b is represented as the green vector,
can be restated as ﬁnding the optimal vector on the mustard colored plane nearest to the
green colored vector coming into space. As a solution to this geometric problem, we come
up with the naive solution of dropping the perpendicular from the tip of the green vector
onto the plane. Hence drawing a normal to the plane passing through the tip of the vector
that is to be estimated. The point of intersection of this normal and the plane can is the
3-3

optimal or the best possible vector that can be estimated. In the figure above this optimal
vector is shown in black.
3.2 Mean and Variance of the Target and the Predicted
Variables
In this section, we will define the sample mean and the sample variance of the target variables
as well as the predicted variables, and relate these two variances with the earlier defined RSS
function (evaluated at optimal w∗
).
3.2.1 Mean
Recall the following quantities defined earlier
y = [y1, y2, . . . , yn]
ˆy = [ˆy1, ˆy2, . . . , ˆyn]
X =





1 x11 x12 . . . x1d
1 x21 x22 . . . x2d
...
...
... . . .
...
1 xn1 xn2 . . . xnd





=





x1
x2
...
xn





Using these quantities, we define the following
Sample mean of y := ¯y =
1
n
n
i=1
yi (3.16)
Sample mean of ˆy := ¯ˆy =
1
n
n
i=1
ˆyi (3.17)
(3.18)
From the property of least square regression, we can say that
¯y = ¯ˆy (3.19)
This implies that the mean target value is the same as the mean predicted value for the least
square regression.
3-4

3.2.2 Variance
The variance of the target variable, the predicted variable, and residual error is given as
follows:
Var(y) =
1
n
n
i=1
(yi − y)2
(3.20)
Var(ˆy) =
1
n
n
i=1
(ˆyi − ˆy)2
=
1
n
n
i=1
(ˆyi − y)2
(3.21)
Var(e) =
1
n
n
i=1
(yi − ˆyi)2
= RSS (3.22)
The last expression follow because the mean of residual error vector e is zero as per previous
claim.
Let us write the above two expressions of variances in vector notation. For this we deﬁne
the vector ˆy as follows.
y = [y, y, . . . , y] (3.23)
In view of this vector, we can rewrite the expressions for variance in the following manner.
Var(y) =
1
n
(y − y) (y − y) (3.24)
Var(ˆy) =
1
n
(ˆy − y) (ˆy − y) (3.25)
Var(e) =
1
n
(y − ˆy) (y − ˆy) (3.26)
Below is an important result expressing the relationship between
Lemma 3.1.
Var(y) = Var(ˆy) + Var(e) (3.27)
total variance = explained variance + unexplained variance (3.28)
Proof: Let us start with the following expression
nVar(y) = (y − y) (y − y) (3.29)
= (y − ˆy + ˆy − y) (y − ˆy + ˆy − y) (3.30)
= (y − ˆy) (y − ˆy) + (ˆy − ˆy) (ˆy − ˆy) + 2(y − ˆy) (ˆy − y) (3.31)
= nVar(e) + nVar(ˆy) + 2(y − ˆy) (ˆy − y) (3.32)
Now let us examine the expression
(y − ˆy) (ˆy − y)
3-5

This expression can be written as
(y − ˆy) (ˆy − y) = (y − ˆy) ˆy − (y − ˆy) y (3.33)
Let us analyze each of the term on the RHS one-by-one:
Second Term = (y − ˆy) y
= (y − ˆy)





1
1
...
1





y
= y1 − ˆy1, y2 − ˆy2, · · · yn − ˆyn





1
1
...
1





y
=
n
i=1
(yi − ˆyi) y
= 0
where, the last equation follows from the property of the least square regression discussed
earlier. Now, let us analyze the ﬁrst term. Recall that
First Term = (y − ˆy) ˆy
Substituting, ˆy = Xw∗
= X(X X)−1
X y, we get the following
First Term = y X(X X)−1
X y − y X(X X)−1
X y
= 0
This completes the proof of the desired lemma.
We call the residual sum of squares as unexplained variance as it comes from the error
measured, the error is assumed to be some unexplained process, hence the name.
3-6

Lecture 3 - Linear Regression

Recommended

More Related Content

What's hot (20)

Similar to Lecture 3 - Linear Regression (20)

Recently uploaded (20)

Lecture 3 - Linear Regression