Kayatu
Kayatu
Least Squares
We observe data:
T = (x1 , y1 ), . . . (xn , yn )
from some distribution. Our goal may be to predict the Y give some X. If Y is real, we may wish to learn the
conditional expectation E[Y |Xi ].
Typically, in supervised learning, we are interested in some notion of the our prediction loss. For example, in regres-
sion, the average squared error for a function f is:
where the expectation is with respect to a random X, Y pair. (Note: sometimes the average error in machine learning
is referred to as the risk.)
Note that minimizing the squared loss function also corresponds to doing maximum likelihood estimation under the
model Pr(Y |X, f ) = N (f (X), σ 2 ). To see this observe that,
√ (f (X) − Y )2
− log Pr(Y |X, f ) = − log 2πσ 2 +
2σ 2
which is the square loss (up to a linear transformation).
Our goal is to use our training set T to estimate a function fˆ which has low error. Also, note that the lowest possible
squared error is achieved by:
f∗ (X) = E[Y |X]
which is the conditional expectation.
A learning algorithm (or a decision rule) δ is a mapping from T to some hypothesis space. In this the case of regression
it is a mapping from T to a function f . The notion of risk in statistics measures the quality of this procedure, on
average.
Let f ∗ be the minimizer of L in some set F, e.g.
L(f ) − L(f ∗ )
1
which is a measure of the sub-optimality of f .
The risk is some measure of the (average) performance of a decision rule; where, importantly, an expectation is take
over the training set T . One natural definition of the risk function is:
Note that the expectation is over the training set. Other definitions may also be appropriate (though, technically, the
risk should always refer to the performance of the decision rule δ).
2 Linear Regression
Suppose that X ∈ Rd . Our prediction loss on our training set for a linear predictor is:
n
1X 1
E(Xi · w − Yi )2 = kXw − Yk2
n i=1 n
where X (X in boldface) is defined to be the n × d matrix whose rows are Xi and Y (Y in boldface) is vector where
[Y1 , Y2 , . . . Yn ]> .
The least squares estimator using an outcome Y is just:
1
β̂ = arg minw kY − Xwk2
n
The first derivative condition, often referred to as the normal questions, is that:
X> (Y − Xβ̂) = 0
Theorem 3.1. (SVD) Let X ∈ Rn×d . there exists U ∈ Rn×n and V ∈ Rd×d orthogonal matrices (e.g. matrices with
orthonormal rows and columns, so that U U > = In and V V > = Id where Ik is the k × k identity matrix) such that:
X
X= λi ui vi> = U diag(λ1 , . . . λmin{n,d} )V >
i
where diag(·) is diagonal Rn×d matrix and the λi ’s are referred to as the the singular values.
Xβ = Y
2
where X−1 is the inverse of X (it exists and is unique since we have assume the linear system has a unique solution).
In regression, there is typically noise, and we find a β which minimizes:
kXβ − Yk2
Clearly, if there is no noise, then a solution is given by β = X−1 Y, assuming no degeneracies. In general though, the
minimizer of this error, referred to as the least squares estimator, is:
β = (X> X)−1 XY . (1)
Furthermore, Equation 1 above only holds if X is of rank d (else X> X−1 would not be invertible).
Now let us define the Moore-Penrose pseudo-inverse.
First, let us define the ’thin’ SVD.
Definition 3.2. We say X = U DV > is the “thin” SVD of X ∈ Rn×p if: U n×r and V p×r have orthonormal columns
(e.g. where r is the number of columns) and D ∈ Rr×r is diagonal, with all it’s diagonal entries being non-zero.
Here, r is the rank of X.
Using this terminology, we can write the least squares estimator in a more interpretable way:
Lemma 3.4. The least squares estimator is:
β = X+ Y
(Note that the above is alway a minimizer, while the solution provided in Equation 1 only holds if X> X is invertible,
in which case the minimizer is unique).