0% found this document useful (0 votes)
18 views

Kayatu

Uploaded by

duxburyjoel096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Kayatu

Uploaded by

duxburyjoel096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CSE 546: Machine Learning Lecture 2

Least Squares

Instructor: Sham Kakade

1 Supervised Learning and Regression

We observe data:
T = (x1 , y1 ), . . . (xn , yn )
from some distribution. Our goal may be to predict the Y give some X. If Y is real, we may wish to learn the
conditional expectation E[Y |Xi ].
Typically, in supervised learning, we are interested in some notion of the our prediction loss. For example, in regres-
sion, the average squared error for a function f is:

Lsquared error (f ) = E(f (X) − Y )2

where the expectation is with respect to a random X, Y pair. (Note: sometimes the average error in machine learning
is referred to as the risk.)
Note that minimizing the squared loss function also corresponds to doing maximum likelihood estimation under the
model Pr(Y |X, f ) = N (f (X), σ 2 ). To see this observe that,
√ (f (X) − Y )2
− log Pr(Y |X, f ) = − log 2πσ 2 +
2σ 2
which is the square loss (up to a linear transformation).
Our goal is to use our training set T to estimate a function fˆ which has low error. Also, note that the lowest possible
squared error is achieved by:
f∗ (X) = E[Y |X]
which is the conditional expectation.

1.1 Risk (and some terminology clarifications)

A learning algorithm (or a decision rule) δ is a mapping from T to some hypothesis space. In this the case of regression
it is a mapping from T to a function f . The notion of risk in statistics measures the quality of this procedure, on
average.
Let f ∗ be the minimizer of L in some set F, e.g.

f ∗ ∈ arg minf ∈F L(f )

The regret of f (sometimes referred to as the loss) is defined as:

L(f ) − L(f ∗ )

1
which is a measure of the sub-optimality of f .
The risk is some measure of the (average) performance of a decision rule; where, importantly, an expectation is take
over the training set T . One natural definition of the risk function is:

Risk(δ) = ET [L(δ(T ))] − L(f∗ )

Note that the expectation is over the training set. Other definitions may also be appropriate (though, technically, the
risk should always refer to the performance of the decision rule δ).

2 Linear Regression

Suppose that X ∈ Rd . Our prediction loss on our training set for a linear predictor is:
n
1X 1
E(Xi · w − Yi )2 = kXw − Yk2
n i=1 n

where X (X in boldface) is defined to be the n × d matrix whose rows are Xi and Y (Y in boldface) is vector where
[Y1 , Y2 , . . . Yn ]> .
The least squares estimator using an outcome Y is just:
1
β̂ = arg minw kY − Xwk2
n
The first derivative condition, often referred to as the normal questions, is that:

X> (Y − Xβ̂) = 0

which is sometimes referred to as the normal equations.


The least squares estimator (the MLE) is then:

β̂least squares = (X> X)−1 X> Y

3 Review: The SVD; the “Thin” SVD; and the pseudo-inverse

Theorem 3.1. (SVD) Let X ∈ Rn×d . there exists U ∈ Rn×n and V ∈ Rd×d orthogonal matrices (e.g. matrices with
orthonormal rows and columns, so that U U > = In and V V > = Id where Ik is the k × k identity matrix) such that:
X
X= λi ui vi> = U diag(λ1 , . . . λmin{n,d} )V >
i

where diag(·) is diagonal Rn×d matrix and the λi ’s are referred to as the the singular values.

For X ∈ Rn×d and Y ∈ Rn , suppose that the equation:

Xβ = Y

has a unique solution. Then:


β = X−1 Y

2
where X−1 is the inverse of X (it exists and is unique since we have assume the linear system has a unique solution).
In regression, there is typically noise, and we find a β which minimizes:
kXβ − Yk2
Clearly, if there is no noise, then a solution is given by β = X−1 Y, assuming no degeneracies. In general though, the
minimizer of this error, referred to as the least squares estimator, is:
β = (X> X)−1 XY . (1)
Furthermore, Equation 1 above only holds if X is of rank d (else X> X−1 would not be invertible).
Now let us define the Moore-Penrose pseudo-inverse.
First, let us define the ’thin’ SVD.
Definition 3.2. We say X = U DV > is the “thin” SVD of X ∈ Rn×p if: U n×r and V p×r have orthonormal columns
(e.g. where r is the number of columns) and D ∈ Rr×r is diagonal, with all it’s diagonal entries being non-zero.
Here, r is the rank of X.

Now we define the pseudo-inverse as follows:


Definition 3.3. Let X = U DV > be the thin SVD of X. The Moore-Penrose pseudo-inverse of X, denoted by X+ , is
defined as:
X+ = V D−1 U >

Let us make some observations:

1. First, if X is invertible (so X is square) then X+ = X−1 .


2. Suppose that X isn’t square and that Xw = Y has a (unique) solution, then w = X+ Y.
3. Now suppose that Xw = Y has (at least one) solution. Then one solution is given by w = X+ Y. This solution
is the minimum norm solution w.
4. (geometric interpretation) The matrix X+ maps any point in the range of X to the minimum norm point in the
domain.

Using this terminology, we can write the least squares estimator in a more interpretable way:
Lemma 3.4. The least squares estimator is:
β = X+ Y
(Note that the above is alway a minimizer, while the solution provided in Equation 1 only holds if X> X is invertible,
in which case the minimizer is unique).

4 Analysis: what is the risk?

We will return to this in the next lecture.

5 What about if d > n?

We will examine this in the next few lectures.

You might also like